Humble Beginnings: Ordinary Differential Equations

The story begins with differential equations. Consider $f$ such that $f:[0,T]\times \mathbb{R}^n\to \mathbb{R}^n$ is a continuous function. We can construct a rather simple differential equation given this in the following way. We let

$\begin{cases} {y'(t)}=f(t,y(t))\\ y(0)=y_0\in \mathbb{R}^n \end{cases}$

A solution to this system is a continuous map that is defined in the neighbourhood of $t=0$ such that this map satisfies the differential equation.

Ordinary differential equations are well-studied, and we know that, for example, a solution to the given differential equation will exist whenever the function $f$ satisfies the following:

$(\forall x,y\in \mathbb{R}^n)(\exists C>0 (\in \mathbb{R}))(||f(t,y)-f(t,x)||\leq C||y-x||)$

This property is known as Lipschitz continuity. A function that satisfies this condition is said to be Lipschitz. We shall see that whenever we require this condition, on up coming situations, wonderful things happen!

A remarkable field that almost always couples with differential equations is numerical analysis, where we learn to solve differential equations numerically and we study these numerical schemes. We shall explore numerical integration briefly.

We note that the solution to the differential equation above is a function:

$y(t)=y_0+\int_0^tf(s,y(s))ds$

We can write this equation as

$y_{n+1}(t)=y_n(t)+\int_0^tf(s,y_n(s))ds$

where

$\lim _{n\to \infty}y_n(t)=y(t)$

under suitable conditions–for example, it is sufficient for $f$ to be Lipschitz.

One might then ask if it is possible to approximate this integration (which is something that is needed in order to solve problems numerically, as opposed to analytically) because if we can do that, then we can easily compute the solution to the initial differential equation. There are a couple of ways to do this. We give examples.

Examples of Numerical Integration Schemes:

• Explicit-Euler’s-Method: $\hat y_{n+1}=\hat y_n+hf(t_n,\hat y_n)$
• Implicit-Euler’s-Method: $\hat y_{n+1}=\hat y_{n}+hf(t_n,\hat y_{n+1})$
• Trapezoidal Method : $\hat y_{n+1}=\hat y_n+h/2(f(t_n,\hat y_n)+f(t_n,\hat y_{n+1}))$
• Notice that these method use only neighbouring discretisation points to do the estimation. It is possible to do more than this, and use more points. One method that allows you to do this is what’s known as the linear multi-step method. This is a numerical scheme of the form

$\sum_{j=-1}^{k}\alpha _j \hat y_{n-j}=h\sum _{j=-1}^k\beta _j f(t_{n-j},\hat {y}_{n-j})$

where in the same way we can specify the coefficients $\alpha_j,\,\beta_j$.
This uses $k$-steps instead of only one step to do the approximations. We won’t devote much time to this, because will not be seeing it again any time soon, but it is a really, really powerful tool, and offers great insight into numerical analysis as a field.

Motivations and Mechanisms for Residual Networks as well as Connections to Ordinary Differential Equations

More recently, researchers at found themselves with the following problem: Around 2017, it became clear that very often using deep neural networks as we know them can be quite inefficient. For example, if one had two networks –one deep, and the other shallow–then it was observed that the accuracy in the larger network would tend to saturate as you add more layers. And this meant that gradients were not efficiently computed. As a way around this the following was done.

If we define a neural network, in the usual sense, as a sequence of transformations:

$x_{k+1}=p(W_k\cdot x_k+b_k)$

where $x_0$ is the input, $p$ is a non-linear transformation and $W_k\cdot x_k +b_k$ is an affine transformation.
Then we overcome the stated problem/s by considering a model of the form:

$x_{k+1}=x_k+p(W_k\cdot x_k+ b_k)$

This is precisely what is known as a residual network.

The idea is that we would like to approximate a function $\mathcal{H}$, and instead of approximating this function we consider the function

$\mathcal{H}(x)-x\equiv \mathcal{F}(x),$

then we approximate this instead as $p(x)$ in the residual net equation. After each unit, we concatenate $\mathcal{F}(x)+x$ which is simply $\mathcal{H}(x)$.

An example of this kind of structure is given below.

This is a single residual unit.

It turns out that this formulation makes training much easier and improves accuracy for reasons we are going to explore in the next blog post.

It is also quite interesting to notice that the form

$x_{k+1}=x_k+p(W_k\cdot x_k+ b_k)$

resembles the explicit euler integration scheme for $h=1$. In fact, with the generality of the non-linear function, we can write the residual network transformation as

$x_{k+1}=x_k+hp(W_k\cdot x_k+ b_k)$

In this form, the relationship really stands out. Note that here $h$ is just a constant, and need not be the time-step, and we put it just to ease the reader’s intuition about what follows.

Establishing Neural Ordinary Differential Equations as Continuous Limits of Residual Networks

Notice that a residual network can be written as:

$x_{k+1}=x_{k}+p(x_k,\theta_k)$

where the $\theta_k$ are neural network parameters as we saw above,
and so

$x_{k+1}-x_{k}=p(x_k,\theta_k)$

This takes the form of a numerical derivative for $x_k$ where the step is $\Delta k=1$.

We might ask: What if we considered $\Delta k\to 0$, and then we added many layers?

In this case we get

$\lim_{\Delta k\to 0}\dfrac{x_{k+1}-x_k}{\Delta k}= \lim_{\Delta k\to 0}p(x_k,\theta_k)$

which we realise as

$\dfrac{dx}{dk}=p(k,x(k),\theta).$

Recall that $x_k=x_0$ the input, so this is simply a differential equation with an initial condition.

$\begin{cases} {x'}=p(k,x(k),\theta)\\ x(0)=x_0\in \mathbb{R}^n \end{cases}$

As before, we can write

$x(t)=x_0+\int_0^tp(s,x(s),\theta)ds$

Comparing Residual Nets to Neural ODE Nets in Behaviour

We can compare the fields of these models using the following picture:

This image depicts transformation behaviour for residual networks as well as neural ordinary differential equation networks.

Here the circles represent evaluation locations, and we see that a residual network defines a discrete sequence of finite transformations while an ODE network defines a vector field, which continuously transforms the state.

To summarize everything:
We saw that ordinary differential equations are creatures than can be studied and solved numerically. After this, we visited the problem of error saturation and gradient degradation in deep networks at which point we also introduced residual networks. This led us right into neural ordinary differential equations. We will spend some time on the functionality of these two methods, which will take a few blog posts in the near future
.

Resources

1. Chen. R. T. Q. et al. Neural Ordinary Differential Equations. arXiv:1806.07366v5 [cs.LG] (2019).

2. He. K. et al. Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 [cs.CV] (2015).

 How clear is this post?