In this post we introduce two important concepts in multivariate calculus: the gradient vector and the directional derivative. These both extend the idea of the derivative of a function of one variable, each in a different way. The aim of this post is to clarify what these concepts are, how they differ and show that the directional derivative is maximised in the direction of the gradient vector.

The gradient vector

The gradient vector, is, simply, a vector of partial derivatives. So to find this, we can 1) find the partial derivatives 2) put them into a vector.  So far so good. Let’s start this on some familiar territory: a function of 2 variables.

That is, let f: \mathbb{R}^2 \rightarrow \mathbb{R} be a function of 2 variables, x,y. Then the gradient vector can be written as:

\nabla f(x,y) =    \left [ {\begin{array}{c}    \frac{\partial f(x,y)}{\partial x}  \\    \frac{\partial f(x,y)}{\partial y}   \\    \end{array} } \right]

For a more tangible example, let f(x,y) = x^2 + 2xy, then:

\nabla f(x,y) =    \left [ {\begin{array}{c}    2x + 2y  \\    2x \\    \end{array} } \right]

So far, so good. Now we can generalise this for a function f: \mathbb{R}^n \rightarrow \mathbb{R} taking in a vector \mathbf{x} = x_1, x_2, x_3, \dots, x_n. Our gradient vector is now a vector of length n and is written in the expected way:

\nabla f(x_1,x_2, \dots, x_n)= \nabla f(\mathbf{x}) =    \left [ {\begin{array}{c}    \frac{\partial f(\mathbf{x})}{\partial x_1}  \\    \frac{\partial f(\mathbf{x})}{\partial x_2}   \\    \dots \\    \frac{\partial f(\mathbf{x})}{\partial x_n}   \\    \end{array} } \right]

When we evaluate this vector at a point, we get a vector pointing in the direction of the rates of change of the function f with respect to the input vector \mathbf{x}.

The directional derivative

The most important thing to note about the directional derivative is that it is a scalar whereas the gradient is a vector. The directional derivative of a differentiable function f is defined as the dot product between f and a unit vector, \mathbf{u}:

D = <\nabla f , \mathbf{u}> and it tells us, at any given point, the rate of change of f in the direction \mathbf{u}.

That is to say, if we start at some values of our input variables and change them in some way, how much does the value of f change?

Let’s see this in action with our initial example. Let’s say we are at the point (x_0,y_0) = (2,3).

Now let’s see how fast we are moving in each direction. First we want to see the rate of change when we change only x and leave constant i.e. along \mathbf{u} = (1,0) . We evaluate:

<    \left [ {\begin{array}{c}    2x + 2y  \\    2x \\    \end{array} } \right] ,    \left [ {\begin{array}{c}    1 \\    0 \\    \end{array} } \right]>    and get 2x+2y = 2(2) + 2(3) = 10. This is clearly the same as the value of the partial derivative with respect to x. Similarly if we multiplied the gradient into \mathbf{u} = (0,1), we would get 4.

What about if we vary and the same amount? Then we would be taking the dot product of our gradient with a unit vector: \mathbf{u} = [\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}]. Taking the dot product and evaluating this we now have:

<    \left [ {\begin{array}{c}    2x + 2y  \\    2x \\    \end{array} } \right] ,    \left [ {\begin{array}{c}    \frac{1}{\sqrt{2}} \\    \frac{1}{\sqrt{2}} \\    \end{array} } \right] >    which evaluates to \frac{4x+2y}{\sqrt{2}} =  \frac{14}{\sqrt{2}}

So this would be the rate of change at the point (2,3) if we moved in both the x and y directions at the same rate. Loosely speaking, the numerator is made up of the 10 from going in the direction of x, the 4 is made up of going in the direction of y and the denominator is a normalising constant that reduces the value of the derivative to compensate for the fact that we are now going in 2 different directions (because for a sensible comparison of the rates of change in different directions we’d the same distance in each direction- it is easily checked that the size of the proposed vector is 1).

Direction of maximum change

Great, so now we have our gradient and we know how to find how fast the function is changing at any point in any direction. But what if we want to move in the direction that increases the value of the function as quickly as possible. That is, we want to find a vector \mathbf{u} such that <\nabla f,\mathbf{u}> is maximised subject to ||\mathbf{u}|| = 1.

We can write this as a system of equations, where we find the maximum of \sum_i \nabla_i u_i subject to \sum_i u_i^2 = 1. But we won’t do this- as it is an unnecessary amount of work to do- however anyone up for some Lagrangian optimisation should give it a try in the 2 variable case with the given example.

Instead we will look at the dot product definition:

<\nabla f,\mathbf{u}> = |\nabla f||\mathbf{u}| \cos{\theta} where \theta is the angle between the two vectors. This is maximised when \cos(\theta) = 1 (as \cos is a bounded between -1 and 1) and therefore \theta = 0 . If we think about \theta as the angle between the 2 vectors, when the angle is 0 the vectors point in the same direction.
Therefore the direction of the maximum rate of change of the function is given by the direction of the gradient vector i.e. \frac{\nabla f}{|\nabla f|}.

This is can be a very useful fact to know when optimising functions and is exploited extensively in numerical optimisations of non-convex functions.


Hopefully it is now clear that what the gradient vector is, how to find a directional derivative in any direction and that the gradient vector gives us the direction of maximum change for a given function.

How clear is this post?