Standard Reinforcement Learning

The very basic ideas in Reinforcement Learning are usually defined in the context of Markov Decision Processes. For everything that follows, unless stated otherwise, assume that the structures are finite.

A Markov Decision Process (MDP) is a tuple $(S,A, P, \gamma, R)$ where the following is true:
1. $S$ is the set of states $s_k$ with $k\in \mathbb{N}$.
2. $A$ is the set of actions $a_k$ with $k\in \mathbb{N}$.
3. $P$ is the matrix of transition probabilities for taking action $a_k$ given state $s_j$.
4. $\gamma$ is the discount factor in the unit interval.
5. $R$ is defined as the reward function, and is taken as a function from $A\times S\to \mathbb{R}$.

In this context, we have policies as maps

$\pi:S\to A$,

state value functions for a policy, $\pi$, evaluated at $s_1$ as

$V^\pi(s_1)=\mathbb{E}[\sum_{i=0}\gamma ^i R(s_i)|\pi]$,

and state action values defined as

$Q^\pi (s,a)=R(s)+\gamma \mathbb{E}_{s'\sim P_{sa}}[V^\pi (s')]$.

The optimal functions are defined as

$V^*(s)=\sup_\pi V^{\pi}(s),$

and

$Q^*(s,a)=\sup_\pi Q^\pi (s,a).$

Here we assume that we have a reward function, and this reward function is used to determine an optimal policy. It is interesting, however, to ask if we can determine, given a policy, a reward function that generates that policy, and this is the essence of Inverse Reinforcement Learning.

Inverse Reinforcement Learning

The problem was stated formally in (Russell, 1998) in the following way:

Suppose that one is given:
1. Measurements of an agent’s behaviour over time in a variety of circumstances.
2. If needed, suppose that measurements of sensory inputs are also given.
3. If available, a model of the environment too.

Is it possible to determine the reward function being optimised?

One motivation behind Inverse Reinforcement Learning is that humans learn a lot of things by imitation. For instance, if you did not know that tennis (as a sport) existed, and you came across someone with a tennis racket playing against a wall, like this:

Image source: Portable Tennis Trainer is Changing the Game

Then it would be quite natural for you (as well as most humans) to gather that all this person is trying to do is keep the ball in the air, with some allowed bounces– as we know is done in tennis. Once you realise this, you would have defined an implicit concept of a reward that is being optimised, and this is exactly what you need to be able to play the game yourself. This is all we want from our agents in Inverse Reinforcement Learning.

We are asking the question of whether it is possible for agents to infer reward functions given some actions from some a demonstrative source, because if an agent has this ability then it opens other doors for teaching machines or having them learn on their own. This means that whenever there is an expert performing an action, the agent can reason through the actions of the expert, and can attempt to perform these actions itself in a very natural way (sometimes given some constraints)

A comparison of Reinforcement Learning and Inverse Reinforcement Learning in a diagram

Image source: Inverse Reinforcement Learning

The name of the game from this point:

Inference of reward functions from demonstrations.

Recall that:

1. If we have an MPD $(S,A, P,\gamma, R)$ and a policy $\pi : S\to A$, then for all $s,a\in S, A$, it is the case that $V^\pi$ and $Q^\pi$ satisfy

a)

$V^\pi (s)=R(s)+\gamma \sum_{s'}P_{s\pi(s)}(s')V^\pi(s')$

b)

$Q^\pi (s,a)=R(s)+\gamma \sum_{s'}P_{sa}(s')V^\pi(s')$

2. If we have an MDP and policy as in 1., then the policy, $\pi$, is optimal if and only if for all $s\in S$,

$\pi (s)\in argmax _{a\in A}Q^\pi (s,a)$.

1. and 2. respectively give us the Bellman equations as well as their optimality.

Moving on,

We shall, without loss of generality write $\pi (s)=a_1$, and we can rename where necessary. This is to simplify computations.

It is posssible to show that if $S$ is finite, and $A=\{a_1,\cdots,a_k\}$, then transition probability matrices $\{P_a\}$, and a discount factor $\gamma \in (0,1)$ are given, then the policy $\pi$ given by $\pi (s)=a_1$ is optimal if and only if for all $a=a_2,\cdots,a_k$, the reward satisfies:

$(P_{a_1}-P_{a})(I-\gamma P_{a_1})^{-1}R\geq 0$

where we are assuming the comparison $\geq$ works on a componentwise level. This result is from (Ng, Russell. 2000. )

From this we get the following:
1. One can show that this condition is necessary and sufficient for $\pi =a_1$ to be a unique optimal policy.
2. For finite MDPs, this result gives a characterisation of all reward functions that are solutions to the inverse reinforcement learning problems.

The problem is that the zero reward function is admissible, i.e if the reward is the same no matter which action is taken then any policy is optimal. Demanding uniqueness would remedy this, but this move is unsatisfactory since then some reward functions arbitrarily close to zero would be admissible. Moreover, in many contexts, it seems like it would be the case that there would be many reward functions that are admissible, and it is expensive to evaluate the goodness of a learned reward function.

A way forward rephrases the problem in a Linear Programming setting wherein we obtain an efficiently solvable problem.
There are a couple of key observations to be made:
1. A way to choose the reward function is to choose those that satisfy the characterisation condition given above (and $|R(s)|\leq R_{\text{max}}$ ) for all $s\in S$, and one that makes a single step deviation away from $\pi$ as costly as possible, i.e we want one that maximises $\sum_{s}\big (Q^\pi(s,a_1)-\max_{a\in A\backslash a_1}Q^\pi (s,a)$ where $\pi =a_1$ in the shorthand notation from above.

2. If we believe that solutions with lower rewards are simpler, then we might add a regularisation term $-\lambda ||R||_1$ where $\lambda$ is an adjustable coefficient that balances maximising the above expression and doing this in a simple way.

Putting these details together, with some fine print acting as the glue, one gets that the optimisation problem we have in mind is actually the following:

$\max \sum_{i=1}^N \min _{a\in A\backslash a_1}\{(P_{a_1}(i)-P_a(i))(I-\gamma P_{a_1})^{-1}R\}-\gamma ||R||_1$

such that

$(P_{a_1}-P_{a})(I-\gamma P_{a_1})^{-1}R\geq 0 \forall a\in A\backslash a_1$

and

$|R_i|\geq R_{\max},\,i\in \{1,\cdots, N\}$

where $P_a(i)$ denotes the $i$th row of $P_a$.

This formulation can then be formulated as a linear programme and can be solved efficiently.

The paper (Ng, Russell. 2000. ) goes on to give linear function approximations for large spaces, and introduces a more realistic method where trajectories are sampled. This gives a bootstrapping procedure, which essentially formed the basis for the applications.

Another interesting paper is (Ng. Abbel. 2001) which has the following content.

The authors assume that an expert is trying to optimise an unknown reward function that can be expressed as a linear combination of known features, and show that there is an algorithm that will find a policy that performs as well as the expert. They present what they call an apprenticeship learning algorithm, which one can find on the papers if they wish. We are more interested in the results for now, and we will go through the algorithms at a later stage.

In order to proceed, we define some vector of features

$\phi : S\to [0,1]^k$

such that there is some $\textit{true}$ reward function

$R^*(s)=w^*\cdot \phi(s)$,

where $w^*\in \mathbb{R}^k$.

1. Let a MDP without the usual reward function, $\phi$ and an $\epsilon>0$ be given. Then the apprenticeship learning algorithm will terminate with $t^{(i)}\leq \epsilon$ after at most

$n=\mathcal{O}\bigg(\dfrac{k}{(1-\gamma)^2\epsilon^2}log\dfrac{k}{(1-\gamma)\epsilon}\bigg)$

where $t^{(i)}$ is a metric of how well the agent is doing compared to the expert.

2. Let a MDP without the usual reward function, $\phi$, $\delta>0$ and an $\epsilon>0$ be given. Suppose that the apprenticeship learning algorithm is run using an estimate of from $m$ Monte-Carlo samples. In order to ensure that with probability at least $1-\delta$ the algorithm terminates after at most a number of iterations $n$ given in 1., and outputs a policy $\tilde{\pi}$ so that for any true reward

$R^*(s)=w^{*T}\phi(s)$

$\big( ||w^*||_1\leq 1\big)$ we have

$\mathbb{E}(\sum _{t=0}\gamma ^tR^*(s_t)|\tilde{\pi})\geq \mathbb{R}(\sum_{t=0}\gamma ^tR^*(s_t)|\pi_E)-\epsilon$

it suffices that

$m\geq \dfrac{2k}{\epsilon^2(1-\gamma)^2}\log \dfrac{2k}{\delta}.$

These results give us convergence guarantees in a computational setting, at least, which is something that we would, of course, be interested in knowing.

The next post shows some mechanics behind the method, and we shall attempt to understand things better from both a mathematical perspective as well as a computational one.

Resources:

1. Ng. A. Y. and Russel. S. Algorithms for Inverse Reinforcement Learning. International Conference on Machine Learning (2000).

2. Abbeel. P. and Ng. A. Y. Apprenticeship Learning via Inverse Reinforcement Learning. International Conference on Machine Learning (2004).

 How clear is this post?