## Inverse Reinforcement Learning: Guided Cost Learning and Links to Generative Adversarial Networks

Recap

In the first post we introduced inverse reinforcement learning, then we stated some result on the characterisation of admissible reward functions (i.e reward functions that solve the inverse reinforcement learning problem), then on the second post we saw a way in which we proceed with solving problems, more or less, using a maximum entropy framework, and we encountered two problems:
1. It would be hard to use the method introduced if we did not know the dynamics of the system already, and
2. We have to solve the MDP in the inner loop, which may be an expensive process.

Here, we shall attempt to mitigate the challenges that we have encountered, as before, and we shall give a rather beautiful closing which shall link concepts in this space of inverse reinforcement learning to ‘general’ machine learning structures, in particular generative adversarial networks.

Inverse Reinforcement Learning with Unknown Dynamics and Possibly Higher Dimensional Spaces

As we saw previously, the maximum entropy inverse reinforcement learning approach proceeds by defining the probability of a certain trajectory under the expert as being,

$p(\tau)=\dfrac{1}{Z}e^{R_\psi (\tau)},$

where

$Z=\int e^{R_\psi(\tau)}d \tau.$

We mentioned that this is hard to compute in higher dimensional spaces.…

## Maximum Entropy Inverse Reinforcement Learning: Algorithms and Computation

In the previous post we introduced inverse reinforcement learning. We defined the problem that is associated with this field, which is that of reconstructing a reward function given a set of demonstrations, and we saw what the ability to do this implies. In addition to this, we also saw came across some classification results as well as convergence guarantees from selected methods that were simply referred to in the post. There were some challenges with the classification results that we discussed, and although there were attempts to deal with these, there is still quite a lot that we did not talk about.

Maximum Entropy Inverse Reinforcement Learning

We shall now introduce a probabilistic approach based on what is known as the principle of maximum entropy, and this provides a well defined globally normalised distribution over decision sequences, while providing the same performance assurances as previously mentioned methods. This probabilistic approach allows moderate reasoning about uncertainty in the setting inverse reinforcement learning, and the assumptions further limits the space in which we search for solutions which we saw, last time, was quite massive.…

## Inverse Reinforcement Learning: The general basics

Standard Reinforcement Learning

The very basic ideas in Reinforcement Learning are usually defined in the context of Markov Decision Processes. For everything that follows, unless stated otherwise, assume that the structures are finite.

A Markov Decision Process (MDP) is a tuple $(S,A, P, \gamma, R)$ where the following is true:
1. $S$ is the set of states $s_k$ with $k\in \mathbb{N}$.
2. $A$ is the set of actions $a_k$ with $k\in \mathbb{N}$.
3. $P$ is the matrix of transition probabilities for taking action $a_k$ given state $s_j$.
4. $\gamma$ is the discount factor in the unit interval.
5. $R$ is defined as the reward function, and is taken as a function from $A\times S\to \mathbb{R}$.

In this context, we have policies as maps

$\pi:S\to A$,

state value functions for a policy, $\pi$, evaluated at $s_1$ as

$V^\pi(s_1)=\mathbb{E}[\sum_{i=0}\gamma ^i R(s_i)|\pi]$,

and state action values defined as

$Q^\pi (s,a)=R(s)+\gamma \mathbb{E}_{s'\sim P_{sa}}[V^\pi (s')]$.

The optimal functions are defined as

$V^*(s)=\sup_\pi V^{\pi}(s),$

and

$Q^*(s,a)=\sup_\pi Q^\pi (s,a).$

Here we assume that we have a reward function, and this reward function is used to determine an optimal policy.