Standard Reinforcement Learning
The very basic ideas in Reinforcement Learning are usually defined in the context of Markov Decision Processes. For everything that follows, unless stated otherwise, assume that the structures are finite.
A Markov Decision Process (MDP) is a tuple where the following is true:
1. is the set of states with .
2. is the set of actions with .
3. is the matrix of transition probabilities for taking action given state .
4. is the discount factor in the unit interval.
5. is defined as the reward function, and is taken as a function from .
In this context, we have policies as maps
state value functions for a policy, , evaluated at as
and state action values defined as
The optimal functions are defined as
Here we assume that we have a reward function, and this reward function is used to determine an optimal policy. It is interesting, however, to ask if we can determine, given a policy, a reward function that generates that policy, and this is the essence of Inverse Reinforcement Learning.
Inverse Reinforcement Learning
The problem was stated formally in (Russell, 1998) in the following way:
Suppose that one is given:
1. Measurements of an agent’s behaviour over time in a variety of circumstances.
2. If needed, suppose that measurements of sensory inputs are also given.
3. If available, a model of the environment too.
Is it possible to determine the reward function being optimised?
One motivation behind Inverse Reinforcement Learning is that humans learn a lot of things by imitation. For instance, if you did not know that tennis (as a sport) existed, and you came across someone with a tennis racket playing against a wall, like this:
Image source: Portable Tennis Trainer is Changing the Game
Then it would be quite natural for you (as well as most humans) to gather that all this person is trying to do is keep the ball in the air, with some allowed bounces– as we know is done in tennis. Once you realise this, you would have defined an implicit concept of a reward that is being optimised, and this is exactly what you need to be able to play the game yourself. This is all we want from our agents in Inverse Reinforcement Learning.
We are asking the question of whether it is possible for agents to infer reward functions given some actions from some a demonstrative source, because if an agent has this ability then it opens other doors for teaching machines or having them learn on their own. This means that whenever there is an expert performing an action, the agent can reason through the actions of the expert, and can attempt to perform these actions itself in a very natural way (sometimes given some constraints)
A comparison of Reinforcement Learning and Inverse Reinforcement Learning in a diagram
Image source: Inverse Reinforcement Learning
The name of the game from this point:
Inference of reward functions from demonstrations.
1. If we have an MPD and a policy , then for all , it is the case that and satisfy
2. If we have an MDP and policy as in 1., then the policy, , is optimal if and only if for all ,
1. and 2. respectively give us the Bellman equations as well as their optimality.
We shall, without loss of generality write , and we can rename where necessary. This is to simplify computations.
It is posssible to show that if is finite, and , then transition probability matrices , and a discount factor are given, then the policy given by is optimal if and only if for all , the reward satisfies:
where we are assuming the comparison works on a componentwise level. This result is from (Ng, Russell. 2000. )
From this we get the following:
1. One can show that this condition is necessary and sufficient for to be a unique optimal policy.
2. For finite MDPs, this result gives a characterisation of all reward functions that are solutions to the inverse reinforcement learning problems.
The problem is that the zero reward function is admissible, i.e if the reward is the same no matter which action is taken then any policy is optimal. Demanding uniqueness would remedy this, but this move is unsatisfactory since then some reward functions arbitrarily close to zero would be admissible. Moreover, in many contexts, it seems like it would be the case that there would be many reward functions that are admissible, and it is expensive to evaluate the goodness of a learned reward function.
A way forward rephrases the problem in a Linear Programming setting wherein we obtain an efficiently solvable problem.
There are a couple of key observations to be made:
1. A way to choose the reward function is to choose those that satisfy the characterisation condition given above (and ) for all , and one that makes a single step deviation away from as costly as possible, i.e we want one that maximises where in the shorthand notation from above.
2. If we believe that solutions with lower rewards are simpler, then we might add a regularisation term where is an adjustable coefficient that balances maximising the above expression and doing this in a simple way.
Putting these details together, with some fine print acting as the glue, one gets that the optimisation problem we have in mind is actually the following:
where denotes the th row of .
This formulation can then be formulated as a linear programme and can be solved efficiently.
The paper (Ng, Russell. 2000. ) goes on to give linear function approximations for large spaces, and introduces a more realistic method where trajectories are sampled. This gives a bootstrapping procedure, which essentially formed the basis for the applications.
Another interesting paper is (Ng. Abbel. 2001) which has the following content.
The authors assume that an expert is trying to optimise an unknown reward function that can be expressed as a linear combination of known features, and show that there is an algorithm that will find a policy that performs as well as the expert. They present what they call an apprenticeship learning algorithm, which one can find on the papers if they wish. We are more interested in the results for now, and we will go through the algorithms at a later stage.
In order to proceed, we define some vector of features
such that there is some reward function
1. Let a MDP without the usual reward function, and an be given. Then the apprenticeship learning algorithm will terminate with after at most
where is a metric of how well the agent is doing compared to the expert.
2. Let a MDP without the usual reward function, , and an be given. Suppose that the apprenticeship learning algorithm is run using an estimate of from Monte-Carlo samples. In order to ensure that with probability at least the algorithm terminates after at most a number of iterations given in 1., and outputs a policy so that for any true reward
it suffices that
These results give us convergence guarantees in a computational setting, at least, which is something that we would, of course, be interested in knowing.
The next post shows some mechanics behind the method, and we shall attempt to understand things better from both a mathematical perspective as well as a computational one.
1. Ng. A. Y. and Russel. S. Algorithms for Inverse Reinforcement Learning. International Conference on Machine Learning (2000).
2. Abbeel. P. and Ng. A. Y. Apprenticeship Learning via Inverse Reinforcement Learning. International Conference on Machine Learning (2004).