In the previous post we introduced inverse reinforcement learning. We defined the problem that is associated with this field, which is that of reconstructing a reward function given a set of demonstrations, and we saw what the ability to do this implies. In addition to this, we also saw came across some classification results as well as convergence guarantees from selected methods that were simply referred to in the post. There were some challenges with the classification results that we discussed, and although there were attempts to deal with these, there is still quite a lot that we did not talk about.

Maximum Entropy Inverse Reinforcement Learning

We shall now introduce a probabilistic approach based on what is known as the principle of maximum entropy, and this provides a well defined globally normalised distribution over decision sequences, while providing the same performance assurances as previously mentioned methods. This probabilistic approach allows moderate reasoning about uncertainty in the setting inverse reinforcement learning, and the assumptions further limits the space in which we search for solutions which we saw, last time, was quite massive. A rather important point of concern is that demonstrations are prone to noise and imperfect behavior. The presented approach provides a principled method of dealing with this uncertainty.

In what follows we state the context in which we are working, the assumptions that characterise the ‘maximum entropy inverse reinforcement learning’ approach, then we show the algorithm as well as how to compute the elements that will be new to most readers who are starting out in exploring the field (inverse reinforcement learning).

We have a demonstrator, as before, whose behaviour that will be characterised by trajectories $\tau$ of states $s_i$ and $a_i$, i.e $\tau=\cup_{j}\{s_j,a_j\}.$

We are going to assume that everything is finite once again to simplify matters in both notation as well as the concepts themselves.

Define, $R_{\psi}(\tau)=\sum_{t}r_{\psi}(s_t,a_t),$

where this $\psi$ is a parameterisation of the the usual reward function $r_{\psi}$, and $R_\psi (\tau)$ is the total reward along a trajectory.

Then define $D:{\tau_i}\sim \pi^*,$

to be the set of expert demonstrations.

The assumption is that the learner observes these demonstrations, and is trying to model or imitate the demonstrator.

Maximum Entropy Formulation

The maximum entropy approach essentially moves forward by defining the probability of a certain trajectory under the expert as being, $p(\tau)=\dfrac{1}{Z}e^{R_\psi (\tau)}.$

This means that high reward trajectories are more likely to be sampled from the expert, and low reward trajectories are less likely.

The inference of the reward function in this case then essentially comes from the maximisation of the log-likelihood of the set of demonstrations with respect to the parameters of the reward function, i.e $max_{\psi}\sum_{\tau}\log (p_{r_{\psi}})$

In this formulation, $Z=\int e^{R_\psi(\tau)}d \tau,$

and evaluating this integral, also known as the partition function, turns out to be one of the harder things to do in high dimensional spaces. Next we give the algorithm for this approach.

Maximum Entropy Reinforcement Learning Algorithm:
1. Initialise $\psi$, and get a set of demonstrations $D.$
2. Obtain an optimal policy, $\pi(a|s)$, with regards to $r_\psi.$
3. Obtain $p(s|\psi)$.
4. Compute the gradients $\nabla_\psi L(\psi)$.
5. Update $\psi$ with one gradient step using the gradients.

Here $p(s|\psi)$ is the state visitation frequency, i.e the probability of visiting state $s$ under the demonstrative policy for a $\psi$ parameterised reward function. This function is important in the computation, and so we would like a way to go about recovering it.

In order to do this, we define $\mu_t(s)$ as the probability of visiting the state $s$ at time $t$, so that $\mu_1(s)=p(s_1=s)$,

and for $t\in \{1,\cdots,T\}$ $\mu_{t+1}(s)=\sum_{a\in A}\sum _{s'\in S}\mu(s')\pi(a|s')p(s|s',a)$,

where the policy $\pi$ needs to be optimal or near-optimal.

This means that $p(s|\psi)=\dfrac{1}{T}\sum_{t}\mu_t(s).$

Looking into the business of inferences, we have the following situation. Let $M$ be the number of demonstrations. Then observe that \begin{aligned} max_{\psi}L(\psi)&=\sum_{\tau\in D}\log( p_{r_{\psi}}(\tau))\\ &=\sum_{\tau\in D} \log \dfrac{1}{Z}e^{R_{\psi}(\tau)}\\ &=\sum _{\tau\in D}R_\psi(\tau)-M \log Z\\ &=\sum_{\tau\in D}R_\psi(\tau)-M \log \sum_{\tau \in D} e^{R_\psi(\tau)}\\ \end{aligned}

Taking the usual gradients, we obtain \begin{aligned} \nabla_\psi {L(\psi)}&=\sum_{\tau\in D}\bigg( \dfrac{dR_{\psi}(\tau)}{d\psi}-\dfrac{M}{\sum_{\tau\in D}e^{R_{\psi}(\tau)}}\sum_{\tau\in D}e^{R_\psi(\tau)}\dfrac{dR_\psi(\tau)}{d\psi}\bigg)\\ \end{aligned}

Observe, once again, that the second term can be simplified as follows: \begin{aligned} \dfrac{M}{\sum_{\tau\in D}e^{R_{\psi}(\tau)}}\sum_{\tau\in D}e^{R_\psi(\tau)}\dfrac{dR_\psi(\tau)}{d\psi}&=M\sum_{\tau\in D}\dfrac{e^{R_\psi(\tau)}}{{\sum_{\tau\in D}e^{R_{\psi}(\tau)}}}\dfrac{dR_\psi(\tau)}{d\psi}\\ &=M\sum_{\tau\in D}p(\tau|\psi)\dfrac{dR_\psi(\tau)}{d\psi}\\ &=M\sum_{s\in S} p(s|\psi)\dfrac{dr_\psi(s)}{d\psi}\\ \end{aligned}

All of this allows us to compute $\psi$ as required, which gets us to a reward function as needed.

One might notice that, in some sense, the definition of the probability distributions of trajectories that we have given carries a deterministic element to it, and we can do more than this (see (Ziebert. B. D., et al. 2008)).

Here, it is clear that one has to solve for $\pi$ in the inner loop in order to find the state visitation frequencies, so one has to know the dynamics of whatever context they work on, and has to work in a low dimensional space in order to be able to compute the policy and visitation frequencies at every iteration. In the next post, we shall talk about how one would go about removing this as an issue, and find a nice link between methods in inverse reinforcement learning and the generative adversarial networks.

If one would like to, immediately, see a practical implementation of these ideas, then I would suggest visiting: Max-Entropy-Git-Notebook .
Depending on how we do with time, given what would preferably be covered as far as concepts go, we might see very specific applications at the end.

Resources:

1. Ziebart. B. D. et al. Maximum Entropy Inverse Reinforcement Learning. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008).

2. Wulfmeier. W. et al. Maximum Entropy Deep Inverse Reinforcement Learning. arXiv:1507.04888v3 [cs.LG] (2016).

 How clear is this post?