This blog post is a direct translation of a talk that was given by the author on the 17th of February 2020. The ideas was to very briefly introduce Deep Q-Learning to an audience that was familiar with the fundamental concepts of reinforcement learning. If the person reading this is not familiar with these basics, then a very great introduction can be found here: An Introduction to Reinforcement Learning. Without the additional details from the talk, one will note that this post is rather brief, and should really be used as a tool to gain an overview for the method or a gateway to relevant resources. This will not the case for posts later in the series, because the intention is to deal more with the mathematical aspect of reinforcement learning.

Basic Reinforcement Learning Notions

The idea behind reinforcement learning is that there is an agent that interacts with the environment in order to achieve a certain task. The interaction can be explained by the following picture.

This is a flow diagram of the agent-environment interaction (Francois-Lavet, V., et al. 2018) As a reminder, the primary goal in reinforcement learning is to optimise the value functions, because through this process we obtain an optimal policy.

The Bellman optimality equations are: $v_*(s)=\max_{a\in \mathcal{A}(s)}\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s')),$
and $q_*(s,a)=\sum_{s',r}p(s',r|s,a)(r+\gamma max_{a'}q_*(s',a')).$

A few things to note:
1. We know that reinforcement learning requires computing policies in order to solve problems. The process with which this is achieved can be divided into stages: policy evaluation, policy improvement and policy iteration. This chain forms what is commonly referred to as policy control.

2. On-policy methods are those methods that estimate the value of a policy while they use it for control. In off-policy methods these two functions are separate. There is a behaviour policy which is used to generate behaviour, and the policy being evaluated which is called the target policy.

3. There is class of algorithms that fall under the name Temporal Difference Learning. The methods under this umbrella are like Monte Carlo methods in the sense that they can learn directly from experience without requiring a model. They are also like Dynamic Programming methods in the sense that they update estimates based on other learned estimates without waiting for the final outcome.

Moving forward, we note that classical reinforcement learning algorithms can be classified solely based on their update rules sometimes, and here we shall use this to our advantage for brevity.

Q-Learning

One of the most important algorithms in reinforcement learning is an off-policy-temporal-difference-learning-control algorithm known as Q-learning whose update rule is the following: $Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha (R_{t+1}+\gamma\max_aQ(S_{t+1},a)-Q(S_t,A_t))$

This method is quite effective, and often robust. For example, consider the problem where we have the usual grid traversal with a cliff incorporated. Most algorithms will find a ‘safe’ path that is possibly suboptimal, while Q-learning finds the optimal algorithm. This, however, is done at the expense of occasionally falling down if we use epsilon-greedy selection. The following diagram attempts to illustrate this (source: S&B) The game changes when we sometimes have a lot of states and actions, in which we cannot simply tabulate the values as is traditionally done. In cases like this, it is appropriate to use neural networks. The question then becomes that which concerns generalising existing methods, if not creatively dreaming up new ones. One needs to be creative when generalising methods, because brute-force rarely ever works\footnote{For example, one could end up with divergent updates, degeneracy of solution or saturation of the errors. These are words that may mean different things depending on what you are doing, but usually it is not good news for whoever is implementing the methods.}. The main algorithm presented here is a generalised form of Q-learning, and is a good example which serves to show exactly the kind of fire-control which comes with generalising ideas.

Deep Q-Learning

This brings us to Deep Q-Learning (or Deep-Q Networks), from Mnih, V., et al. 2013. The advantages we have in this case are the following:
1. Stable updates: The naive update is based on a target that this constantly changing. The method limits this change, which allows for more stable convergence to targets.
2. Reducing correlations between the estimates and states which are imposed by the regular Q-learning updates when thrown into deep architectures mindlessly. This follows as the first point does. By using memory techniques, and sampling on past experiences we reduce the correlations between estimate trajectories and states.

For this method we optimise the following loss function during training: $L_i(\theta_i)=\mathbb{E}_{(s,a,r,s')\sim U(D)}\bigg(r+\gamma \max_{a'}Q(s',a';\theta_i^-)-Q(s,a;\theta_i)\bigg)^2$

The idea is the following:

The authors introduce a network that you would train using the loss function that is given above, $L_i(\theta_i)$, but add to the usual procedure the following technicalities:
1. Keep a vector of experiences over a lot of episodes, and whenever you have to do your updates sample over this set, instead of confining yourself to only updating with data obtained most recently. The sampling is usually done uniformly, as the form of the loss function suggests.
2. Keep the parameters of the target network approximator fixed for $N$ steps, then whenever you reach a multiple of this number set $\theta_i^-=\theta_i$. This is to say that you update the parameters of this approximator with your current parameters for the model for every multiple of $N$.

Here is a diagram (Francois-Lavet, V., et al. 2018) that better visualises the mechanism. Examples of applications:

1. (Mnih, V., et al. 2013): Playing Atari with Deep Reinforcement Learning.

2. (Mnih, V., et al. 2015): Human-level control through deep reinforcement learning.

3. (Rahman, MDM., et al. 2018): Implementation of Q learning and deep Q network for controlling a self balancing robot model.

Appendix: Deep Q-Learning (Mnih, V., et al. 2015)

This is the algorithm as given on the original paper, as mentioned. I have attached a PDF version of the Jupyter Notebook for the Cartpole implementation below. I could not put the code in a more interactive format, but I hope that this is helpful regardless.