Recap

In the first post we introduced inverse reinforcement learning, then we stated some result on the characterisation of admissible reward functions (i.e reward functions that solve the inverse reinforcement learning problem), then on the second post we saw a way in which we proceed with solving problems, more or less, using a maximum entropy framework, and we encountered two problems:
1. It would be hard to use the method introduced if we did not know the dynamics of the system already, and
2. We have to solve the MDP in the inner loop, which may be an expensive process.

Here, we shall attempt to mitigate the challenges that we have encountered, as before, and we shall give a rather beautiful closing which shall link concepts in this space of inverse reinforcement learning to ‘general’ machine learning structures, in particular generative adversarial networks.

Inverse Reinforcement Learning with Unknown Dynamics and Possibly Higher Dimensional Spaces

As we saw previously, the maximum entropy inverse reinforcement learning approach proceeds by defining the probability of a certain trajectory under the expert as being, $p(\tau)=\dfrac{1}{Z}e^{R_\psi (\tau)},$

where $Z=\int e^{R_\psi(\tau)}d \tau.$

We mentioned that this is hard to compute in higher dimensional spaces. We overcome this by using sampling strategies as methods of obtaining estimates for $Z.$ In order to sample, we need a distribution. The optimal distribution for importance sampling, in this case, is one such that $q(\tau)\propto |e^{R_{\psi}(\tau)}|.$

Observe that this depends on $r_{\psi}$, so we cannot do much without knowing the true reward function.

In order to deal with this, we sample adaptively, instead, so we build a process such that we as we obtain better estimates for the reward function, we improve the needed distribution with the condition for importance sampling mentioned above. We introduce a method for doing this below, and this is known as guided cost learning.

Guided Cost Learning Algorithm:

Start loop

1. Given an initial policy, $\pi_1$, generate samples.
2. Update the reward function, $r_{\psi},$ using samples and demonstrations.
3. Update the policy $\pi_1\to \pi_2$ using the $r_{\psi}.$

Repeat Until Condition to End loop

In this algorithm, on the second step, the samples are used to estimate the partition function, while the demonstrations are used to update the first term of the gradient, as seen previously. At the end of this loop, one gets a better policy and a reward function, and one avoids the extra duty of completely solving the MDP by going through only one step of policy optimisation. This turns out to be much more computationally efficient than having to solve the MPD in the inner loop as outlined in the previous post, and can actually be understood as a method that updates the reward function in the inner loop of policy optimisation instead.

Here we start with the story of two children who grew up with guardians who loved art. The first child, $D$, also adopted this taste in life–becoming as invested as her guardians–while the second child, $G$, became not only interested, but could produce art himself.

One day they found themselves locked in the house together (a situation that we can no longer think is absurd to imagine) with a lot of paintings (in the basement, say) from their guardians most of which $D$ had never seen, but that $G$ had, and upon realising this they thought of playing a game (which they did play, as follows): Every day $G$ would show $D$ a painting, and if $D$ could say correctly whether this painting was painted by $G$ or was indeed from their guardians collection correctly, then she would have won, otherwise $G$ won. We assume that some guardian is present to facilitate the game (to ensure that there is honesty).

We expect that the following will happen here:
1. Initially, both $G$ and $D$ will be very bad at, respectively, reproducing paintings that look convincing enough, and $D$ at classifying the paintings that $G$ will show.
2. However, as the game proceeds: $D$ will get better at seeing which paintings are forgeries, and which ones are actually from the grand collection. When this happens, $G$ needs to step up the game, and make better forgeries. Both are able to improve what they are doing based on these interactions.

In the end, $G$ will be able to produce very convincing samples (i.e samples that look very much like those that belong to their guardians), and $D$ will be the master at telling whether an art piece is legit or a forgery in the given scope.

What we have described here is what happens in a generative adversarial network. We call $G$ the generator, and $D$ the discriminator. They are both neural networks, and usually their applications are in image generations as seen above. Given a sample of pictures, one trains a generator (which we usually care more about) to produce pictures that are convincingly similar to those that are in the samples. In recent developments, these have been used to produce videos from only a handful of pictures, which is quite remarkable (Zakharov. E. et al. (2019))

Here, we shall omit the theoretical treatment of the topic, but instead we shall state the results that are relevant to our discussion in what follows.

Mathematically, if we consider the value function as $V(G,D)$, and $D(x)$ as the probability that the sample $x$ came from some data rather than some distribution $p_g$ over which the generator is sampling. Then we would like $D$ to maximise the probability of assigning the correct label to both the samples from $G$ and the training examples. At the same time, we train $G$ to minimise $\log (1-D(G(z)))$. This is then a minimax game (Goodfellow. I. J. et al. (2014)), such that: $\min_G \max_G V(D,G)=\mathbb{E}_{x\sim p_{data}}(\log D(x))+\mathbb{E}_{z\sim p_z}(\log(1-D(G(z)))),$

where $p_z(z)$ are input noise variables over which a prior is defined.

Results:

1. For a fixed $G$, the optimal discriminator is $D_G^*(x)=\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)},$

where $p_{data}$ is the data distribution.

2. If $G$ and $D$ have enough capacity, and at each step in algorithm 1 (Goodfellow. I. J. et al. (2014)) the discriminator is allowed to reach its optimum given $G$, and $p_g$ is updated so as to improve $\mathbb{E}_{x\sim p_{data}}(\log D_G^*(x))+\mathbb{E}_{z\sim p_g}(\log(1-D_G^*(z))),$

then $p_g$ converges to $p_{data}$.

The second result is included for completeness, because we love convergence guarantees, but also as incentive for the reader to go over the paper, which I wholeheartedly recommend. We shall therefore not invest any time into unpacking this, but we need the first result in what follows.

Guided Cost Learning and Generative Adversarial Networks Ties

Recall that in guided cost learning the goal for the policy is to have samples that strongly resemble the demonstrations, while the reward function gives high reward to demonstrations and low rewards to trajectories that come from the policy (which cannot happen only when they have an uncanny resemblance to the demonstrations). This seems almost adversarial, in some sense, if we think about the feedback loop that is evident in the guided cost learning algorithm above.

We could give more direct comparisons as follows: The sampling of trajectories in guided cost learning can be viewed as being related to the sampling from a distribution (usually images) by the generator, the policy can be viewed as generator samples and the reward function as the discriminator, as hinted in the above paragraph.

In guided cost learning, we could qualify the optimal discriminator as $D^*(\tau)=\dfrac{p(\tau)}{p(\tau)+q(\tau)},$

where $q(\tau)$ is the policy distribution.

Of course, this is simply $D^*(\tau)=\dfrac{\dfrac{1}{Z}e^{R_\psi(\tau)}}{\dfrac{1}{Z}e^{R_\psi(\tau)}+q(\tau)}.$

The optimisation then also starts to resemble the generative adversarial networks, as we see from this resulting loss function. $L_D(\psi)=\mathbb{E}_{\tau\sim p}(-\log D_{\psi}(\tau))+\mathbb{E}_{\tau\sim q}(-\log(1-D_\psi(\tau))),$

which is the same as that in generative adversarial networks.

In generative adversarial networks, the loss associated with the generator is $L_G(\psi)=\mathbb{E}_{\tau\sim q}(\log(1-D_\psi(\tau))-\log D\psi(\tau)),$

which can also be written as: $\mathbb{E}_{\tau\sim q}(\log q(\tau)+\log Z-R_\psi(\tau)),$

and this leads us into a territory known as entropy regularised reinforcement learning, giving a hint that we can go the other way around.

For the purposes of our exploration, we shall cap the discussion at this point. We shall not be solving any practical problems just yet, since the github that we referenced in the last post gave an unmatched introduction to the practical component, but we shall make an appendix in the near future with this, for the sake of completeness. We hope that this series of posts has been informative, and fun to read.

Thank you.

Resources

1. Finn. C. et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. arXiv:1603.00448v3 [cs.LG] (2016).

2. Goodfellow. I. J. et al. Generative Adversarial Nets. arXiv:1406.2661v1 [stat.ML] (2014).

3. Zakharov. E. et al. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. arXiv:1905.08233v2 [cs.CV] (2019).

 How clear is this post?