This is second post in the blog series, and it is meant to give a broad narrative of the content for next two blog posts. Like the previous post, it will be more of an overview, but the two posts that will follow it will unpack and discuss deeply whatever appears in this one and slightly more.
Humble Beginnings: Ordinary Differential Equations
The story begins with differential equations. Consider such that is a continuous function. We can construct a rather simple differential equation given this in the following way. We let
A solution to this system is a continuous map that is defined in the neighbourhood of such that this map satisfies the differential equation.
Ordinary differential equations are well-studied, and we know that, for example, a solution to the given differential equation will exist whenever the function satisfies the following:
This property is known as Lipschitz continuity. A function that satisfies this condition is said to be Lipschitz. We shall see that whenever we require this condition, on up coming situations, wonderful things happen!
A remarkable field that almost always couples with differential equations is numerical analysis, where we learn to solve differential equations numerically and we study these numerical schemes. We shall explore numerical integration briefly.
We note that the solution to the differential equation above is a function:
We can write this equation as
under suitable conditions–for example, it is sufficient for to be Lipschitz.
One might then ask if it is possible to approximate this integration (which is something that is needed in order to solve problems numerically, as opposed to analytically) because if we can do that, then we can easily compute the solution to the initial differential equation. There are a couple of ways to do this. We give examples.
Examples of Numerical Integration Schemes:
Notice that these method use only neighbouring discretisation points to do the estimation. It is possible to do more than this, and use more points. One method that allows you to do this is what’s known as the linear multi-step method. This is a numerical scheme of the form
where in the same way we can specify the coefficients .
This uses -steps instead of only one step to do the approximations. We won’t devote much time to this, because will not be seeing it again any time soon, but it is a really, really powerful tool, and offers great insight into numerical analysis as a field.
Motivations and Mechanisms for Residual Networks as well as Connections to Ordinary Differential Equations
More recently, researchers at found themselves with the following problem: Around 2017, it became clear that very often using deep neural networks as we know them can be quite inefficient. For example, if one had two networks –one deep, and the other shallow–then it was observed that the accuracy in the larger network would tend to saturate as you add more layers. And this meant that gradients were not efficiently computed. As a way around this the following was done.
If we define a neural network, in the usual sense, as a sequence of transformations:
where is the input, is a non-linear transformation and is an affine transformation.
Then we overcome the stated problem/s by considering a model of the form:
This is precisely what is known as a residual network.
The idea is that we would like to approximate a function , and instead of approximating this function we consider the function
then we approximate this instead as in the residual net equation. After each unit, we concatenate which is simply .
An example of this kind of structure is given below.
It is also quite interesting to notice that the form
resembles the explicit euler integration scheme for . In fact, with the generality of the non-linear function, we can write the residual network transformation as
In this form, the relationship really stands out. Note that here is just a constant, and need not be the time-step, and we put it just to ease the reader’s intuition about what follows.
Establishing Neural Ordinary Differential Equations as Continuous Limits of Residual Networks
Notice that a residual network can be written as:
where the are neural network parameters as we saw above,
This takes the form of a numerical derivative for where the step is .
We might ask: What if we considered , and then we added many layers?
In this case we get
which we realise as
Recall that the input, so this is simply a differential equation with an initial condition.
As before, we can write
Comparing Residual Nets to Neural ODE Nets in Behaviour
We can compare the fields of these models using the following picture:
Here the circles represent evaluation locations, and we see that a residual network defines a discrete sequence of finite transformations while an ODE network defines a vector field, which continuously transforms the state.
To summarize everything:
We saw that ordinary differential equations are creatures than can be studied and solved numerically. After this, we visited the problem of error saturation and gradient degradation in deep networks at which point we also introduced residual networks. This led us right into neural ordinary differential equations. We will spend some time on the functionality of these two methods, which will take at two of the next posts–one for each method.