Much of this content is based on lecture slides from slides from Professor David Barber at University College London: resources relating to this can be found at: www.cs.ucl.ac.uk/staff/D.Barber/brml

What is Autodiff?

Autodiff, or Automatic Differentiation, is a method of determining the exact derivative of a function with respect to its inputs. It is widely used in machine learning- in this post I will give an overview of what autodiff is and why it is a useful tool.

The above is not a very helpful definition, so we can compare autodiff first to symbolic differentiation and numerical approximations before going into how it works.

Symbolic differentiation is what we do when we calculate derivatives when we do it by hand, i.e. given a function f, we find a new function f'. This is really good when we want to know how functions behave across all inputs. For example if we had f(x) = x^2 + 3x + 1 we can find the derivative as f'(x) = 2x + 3 and then we can find the derivative of the function for all values of x. For example, for x=2 we have f'(x) = 7.

Numerical approximations can come from the definition of the derivative: normally we take the limit as \delta \rightarrow 0 for the strict definition, for the numerical approximation we just take a small \delta and calculate the approximation as follows:

f'(x) \approx \frac{f(x+\delta) - f(x)}{\delta},

for using the same example as above with $\delta = 0.01$ we have $f'(x) \approx \frac{11.0701-11}{0.01} = 7.01$. This is pretty close!

We use autodiff instead of these two options because:

  • It is computationally more efficient than symbolic differentiation
  • Often we only needed the derivative at a point rather than an expression that gives the derivative for the whole function
  • It returns an exact value for derivative
  • Sometimes we will be looking for functions that include computational structures such as ‘for loops’ and ‘if’ statements. This makes getting a symbolic representation of the function as a whole a lot harder- breaking it into steps allows for easier computation.

Forward and reverse mode autodiff

Forward mode autodiff is a less computationally efficient method of derivative calculation than is reverse mode autodiff. We will consider them both briefly below.

What will become apparent, looking at the calculations, is that doing differentiation in this way will get you the same answers as a direct application of the Chain Rule. The reason for building a computational tree and for following the schedules outlined below is that we want to get the same answer as the Chain Rule but in a computationally efficient way.

What I want to do here is to give an example of a computational tree, which is an idea that we will leverage for the autodiff calculations. We will use this tree as a very simple example to demonstrate:

  • how to do froward and reverse mode autodiff
  • Why reverse mode autodiff is typically a faster

 

Comp_graph

The above computational tree represents a nested function, starting with three inputs, let us define the function as follows:

f_3(f_2(f_1(x_1,x_2,x_3))) = \sin(\cos(x_1 + x_2))^2 + x_3

For forward-mode autodiff: if we want to find d_{x_2}, the derivative of f_3 with respect to x_2, we start at f_1 and step through the tree as follows:

  • d_{x_2} = \frac{\partial f_1}{\partial x_2}
  • d_{x_2} = d_{x_2} \times \frac{\partial f_2}{\partial f_1}
  • d_{x_2} = d_{x_2} \times \frac{\partial f_3}{\partial f_2}

and then we are done. This is clearly the same answer as given by the Chain Rule. To find the values for x_1 \text{ and } x_3 we would do the corresponding calculations for those two inputs.

In the case where x_2 is an input to functions that are not nested (as shown below) we will sum the partial derivatives of the functions with respect to the input.

comp_graph_2

  • d_{x_2} = \frac{\partial f_1}{\partial x_2} + \frac{\partial f_2}{\partial x_2}
  • d_{x_2} =  d_{x_2} \times( \frac{\partial f_3}{\partial f_2} + \frac{\partial f_3}{\partial f_1} )

Reverse differentiation follows the exact same process, except backwards. The reason why this is a good idea can be seen when we look at  the first computation graph. If we want the derivative of f_3 with respect to x_1, x_2 \text{ and } x_3, starting from the top of the graph each time would result in some redundant calculations. The reason for this is that in all three cases we would need \frac{\partial f_3}{\partial f_2} and \frac{\partial f_2}{\partial f_1}, therefore using forward mode autodiff would result in calculating these two partial derivatives three times each, instead of once each for reverse mode autodiff. In this case it is 4 extra calculations, which does not seem like a big deal, but if we had, say, 500, inputs into f_1, it might slow things down considerably.

So starting from the bottom of the graph and working upwards, we can get all three derivatives doing fewer calculations. We can find the derivatives as follows:

  • t_{f_3} = \frac{\partial f_3}{\partial f_2}
  • t_{f_2} = t_{f_3} \times \frac{\partial f_2}{\partial f_1}
  • t_{x_1} = t_{f_2} \times \frac{\partial f_1}{\partial x_1}
  • t_{x_2} = t_{f_2} \times \frac{\partial f_1}{\partial x_2}
  • t_{x_3} = t_{f_2} \times \frac{\partial f_1}{\partial x_3}

At the cost of some additional memory, reverse mode autodiff gives a faster, more efficient way of calculating exact derivatives of more complex computational routines.

Applications

For non-convex optimisation problems, a common method of optimisation is through gradient-descent-based algorithms. These rely on calculations of gradients of exactly the kind of functions that are discussed above. For example, ‘backpropogation’ that is used in training Deep Neural Networks is a kind of reverse Autodiff. Packages such as Tensorflow and Pytorch implement these kinds of calculations in order to train models/find optimal (in some sense) parameter values.

 

How clear is this post?