This is second post in the blog series, and it is meant to give a broad narrative of the content for next two blog posts. Like the previous post, it will be more of an overview, but the two posts that will follow it will unpack and discuss deeply whatever appears in this one and slightly more.

Humble Beginnings: Ordinary Differential Equations

The story begins with differential equations. Consider f such that f:[0,T]\times \mathbb{R}^n\to \mathbb{R}^n is a continuous function. We can construct a rather simple differential equation given this in the following way. We let

\begin{cases}  {y'(t)}=f(t,y(t))\\  y(0)=y_0\in \mathbb{R}^n  \end{cases}

A solution to this system is a continuous map that is defined in the neighbourhood of t=0 such that this map satisfies the differential equation.

Ordinary differential equations are well-studied, and we know that, for example, a solution to the given differential equation will exist whenever the function f satisfies the following:

(\forall x,y\in \mathbb{R}^n)(\exists C>0 (\in \mathbb{R}))(||f(t,y)-f(t,x)||\leq C||y-x||)

This property is known as Lipschitz continuity. A function that satisfies this condition is said to be Lipschitz. We shall see that whenever we require this condition, on up coming situations, wonderful things happen!

A remarkable field that almost always couples with differential equations is numerical analysis, where we learn to solve differential equations numerically and we study these numerical schemes. We shall explore numerical integration briefly.

We note that the solution to the differential equation above is a function:

y(t)=y_0+\int_0^tf(s,y(s))ds

We can write this equation as

y_{n+1}(t)=y_n(t)+\int_0^tf(s,y_n(s))ds

where

\lim _{n\to \infty}y_n(t)=y(t)

under suitable conditions–for example, it is sufficient for f to be Lipschitz.

One might then ask if it is possible to approximate this integration (which is something that is needed in order to solve problems numerically, as opposed to analytically) because if we can do that, then we can easily compute the solution to the initial differential equation. There are a couple of ways to do this. We give examples.

Examples of Numerical Integration Schemes:

  • Explicit-Euler’s-Method: \hat y_{n+1}=\hat y_n+hf(t_n,\hat y_n)
  • Implicit-Euler’s-Method: \hat y_{n+1}=\hat y_{n}+hf(t_n,\hat y_{n+1})
  • Trapezoidal Method : \hat y_{n+1}=\hat y_n+h/2(f(t_n,\hat y_n)+f(t_n,\hat y_{n+1}))
  • Notice that these method use only neighbouring discretisation points to do the estimation. It is possible to do more than this, and use more points. One method that allows you to do this is what’s known as the linear multi-step method. This is a numerical scheme of the form

    \sum_{j=-1}^{k}\alpha _j \hat y_{n-j}=h\sum _{j=-1}^k\beta _j f(t_{n-j},\hat {y}_{n-j})

    where in the same way we can specify the coefficients \alpha_j,\,\beta_j.
    This uses k-steps instead of only one step to do the approximations. We won’t devote much time to this, because will not be seeing it again any time soon, but it is a really, really powerful tool, and offers great insight into numerical analysis as a field.

    Motivations and Mechanisms for Residual Networks as well as Connections to Ordinary Differential Equations

    More recently, researchers at found themselves with the following problem: Around 2017, it became clear that very often using deep neural networks as we know them can be quite inefficient. For example, if one had two networks –one deep, and the other shallow–then it was observed that the accuracy in the larger network would tend to saturate as you add more layers. And this meant that gradients were not efficiently computed. As a way around this the following was done.

    If we define a neural network, in the usual sense, as a sequence of transformations:

    x_{k+1}=p(W_k\cdot x_k+b_k)

    where x_0 is the input, p is a non-linear transformation and W_k\cdot x_k +b_k is an affine transformation.
    Then we overcome the stated problem/s by considering a model of the form:

    x_{k+1}=x_k+p(W_k\cdot x_k+ b_k)

    This is precisely what is known as a residual network.

    The idea is that we would like to approximate a function \mathcal{H}, and instead of approximating this function we consider the function

    \mathcal{H}(x)-x\equiv \mathcal{F}(x),

    then we approximate this instead as p(x) in the residual net equation. After each unit, we concatenate \mathcal{F}(x)+x which is simply \mathcal{H}(x).

    An example of this kind of structure is given below.

    This is a single residual unit.
    Screenshot (119)
    It turns out that this formulation makes training much easier and improves accuracy for reasons we are going to explore in the next blog post.

    It is also quite interesting to notice that the form

    x_{k+1}=x_k+p(W_k\cdot x_k+ b_k)

    resembles the explicit euler integration scheme for h=1. In fact, with the generality of the non-linear function, we can write the residual network transformation as

    x_{k+1}=x_k+hp(W_k\cdot x_k+ b_k)

    In this form, the relationship really stands out. Note that here h is just a constant, and need not be the time-step, and we put it just to ease the reader’s intuition about what follows.

    Establishing Neural Ordinary Differential Equations as Continuous Limits of Residual Networks

    Notice that a residual network can be written as:

    x_{k+1}=x_{k}+p(x_k,\theta_k)

    where the \theta_k are neural network parameters as we saw above,
    and so

    x_{k+1}-x_{k}=p(x_k,\theta_k)

    This takes the form of a numerical derivative for x_k where the step is \Delta k=1.

    We might ask: What if we considered \Delta k\to 0, and then we added many layers?

    In this case we get

    \lim_{\Delta k\to 0}\dfrac{x_{k+1}-x_k}{\Delta k}= \lim_{\Delta k\to 0}p(x_k,\theta_k)

    which we realise as

    \dfrac{dx}{dk}=p(k,x(k),\theta).

    Recall that x_k=x_0 the input, so this is simply a differential equation with an initial condition.

    \begin{cases}  {x'}=p(k,x(k),\theta)\\  x(0)=x_0\in \mathbb{R}^n  \end{cases}

    As before, we can write

    x(t)=x_0+\int_0^tp(s,x(s),\theta)ds

    Comparing Residual Nets to Neural ODE Nets in Behaviour

    We can compare the fields of these models using the following picture:

    This image depicts transformation behaviour for residual networks as well as neural ordinary differential equation networks.
    Screenshot (120)

    Here the circles represent evaluation locations, and we see that a residual network defines a discrete sequence of finite transformations while an ODE network defines a vector field, which continuously transforms the state.

    To summarize everything:
    We saw that ordinary differential equations are creatures than can be studied and solved numerically. After this, we visited the problem of error saturation and gradient degradation in deep networks at which point we also introduced residual networks. This led us right into neural ordinary differential equations. We will spend some time on the functionality of these two methods, which will take at two of the next posts–one for each method.

    How clear is this post?