## CRL Task 5: Learning Causal Models

We’ve now come to one of the most vital aspects of this theory – how can we learn causal models? Learning models is often an exceptionally computationally intensive process, so getting this right is crucial. We now develop some mathematical results which guarantee bounds on our learning. We’ll start by discussing the current state of this field in relation to causal inference and reinforcement learning.

### This Series

1. Causal Reinforcement Learning
2. Preliminaries for CRL
3. CRL Task 1: Generalised Policy Learning
4. CRL Task 2: Interventions – When and Where?
5. CRL Task 3: Counterfactual Decision Making
6. CRL Task 4: Generalisability and Robustness
7. Task 5: Learning Causal Models
8. (Coming soon) Task 6: Causal Imitation Learning
9. (Coming soon) Wrapping Up: Where To From Here?

## Learning Causal Models

Perhaps one of the most computationally difficult processes in the field of causal inference is that of learning underlying causal structure by algorithmically identifying cause-effect relationships. In recent years there has been a surge of interest in learning such relationships in the fields of machine learning and artificial intelligence, though it has been relatively prevalent in the social sciences for many years now (e.g.…

## CRL Task 3: Counterfactual Decision Making

In the previous blog post we discussed some theory of how to select optimal and possibly optimal interventions in a causal framework. For those interested in the decision science, this blog post may be more inspiring. This next task involves applying counterfactual quantities to boost learning performance. This is clearly very important for an RL agent where its entire learning mechanism is based on interventions in a system. What if intervention isn’t possible? Let’s begin!

## This Series

1. Causal Reinforcement Learning
2. Preliminaries for CRL
3. CRL Task 1: Generalised Policy Learning
4. CRL Task 2: Interventions – When and Where?
5. CRL Task 3: Counterfactual Decision Making
6. CRL Task 4: Generalisability and Robustness
7. Task 5: Learning Causal Models
8. (Coming soon) Task 6: Causal Imitation Learning
9. (Coming soon) Wrapping Up: Where To From Here?

## Counterfactual Decision Making

A key feature of causal inference is its ability to deal with counterfactual queries. Reinforcement learning, by its nature, deals with interventional quantities in a trial-and-error style of learning.…

## A challenging limit

This post comes mostly from the youtube video by BlackPenRedPen found here: https://www.youtube.com/watch?v=89d5f8WUf1Y&t=3s

This in turn comes from Brilliant.com – details and links can be found in the original video

In this post we will have a look at a complicated-looking limit that has an interesting solution. Here it is:

$\lim_{n \rightarrow \infty} ( \frac{n!}{n^n})^{\frac{1}{n}}$

This looks pretty daunting – but we will break the solution down into sections:

• taking the logarithms and rearranging
• recognising something familiar
• finding the numerical value

Step 1: Taking the Logarithm

The first step here is to take the logarithm, a generally useful trick when applying limits. First we assign the variable L to the limit (so that we can solve for it in the end). Now lets do some algebra:

$L = \lim_{n \rightarrow \infty} ( \frac{n!}{n^n})^{\frac{1}{n}}$

$\ln(L) = \ln(\lim_{n \rightarrow \infty} ( \frac{n!}{n^n})^{\frac{1}{n}})$

Noting that the natural logarithm $\ln$ is a continuous function and therefore we can take the limit outside of the function:

$\ln(L) = \lim_{n \rightarrow \infty} \ln( (\frac{n!}{n^n})^{\frac{1}{n}})$

Next we can use the logarithm laws to bring down the exponent:

$\ln(L) = \lim_{n \rightarrow \infty} \frac{1}{n} \ln(\frac{n!}{n^n})$

Alright, now we have taken the logarithm, step 1 is complete.…

Introduction

In this post we will have a look at Parrondos paradox. In a paper* entitled “Information Entropy and Parrondo’s Discrete-Time Ratchet”** the authors demonstrate a situation where, by switching between 2 losing strategies, we can create a winning strategy.

Setup

The setup to this paradox is as follows:

We have 2 games that we can play – if we win we get 1 unit of wealth, if we lose, it costs 1 unit of wealth. Game A gives us a payout of 1 with a probability of slightly less than 0.5. Clearly if we play this game for long enough we will end up losing.

Game B is a little more complicated in that it is defined with reference to our existing winnings. If our current level of wealth is a multiple of M we play a game where the probability of winning is slightly less than 0.1. If it is not a multiple of M, the probability of winning is slightly less than 0.75.…

## Basic Reverse Image Search Using an Autoencoder

Introduction

In this post we are going to create a simple reverse image search on the MNIST handwritten image dataset. That is to say, given any image, we want to return images that look most similar to it. To do this, we will use an autoencoder, trained using Tensorflow 2.

The dataset

The MNIST dataset is a commonly-used dataset in machine learning comprised of 28-by-28 images of handwritten digits between 0 and 9. For our purposes we would be interested in our image searcher returning images of the same number as the query images, i.e. if we input a 3 we want the images returned to all be 3s. However, if we had, say, four 3s and one 2 that mightn’t be too bad, considering how 2 and 3 look a bit similar. However, if we had three 3s, one 1 and a 7 we might say that the performance is not up to standard.…

## A simple introduction to causal inference

Introduction

Causal inference is a branch of Statistics that is increasing in popularity. This is because it allows us to answer questions in a more direct way than do other methods. Usually, we can make inference about association or correlation between a variable and an outcome of interest, but these are often subject to outside influences and may not help us answer the questions in which we are most interested.

Causal inference seeks to remedy this by measuring the effect on the outcome (or response variable) that we see when we change another variable (the ‘treatment’). In a sense, we are looking to reproduce the situation that we have when we do an designed experiment (with a ‘treated’ and a ‘control’ group). The goal here is to have groups that are otherwise the same (with regard to factors that might influence the outcome) but where one is ‘treated’ and the other is not.…

## Inverse Reinforcement Learning: Guided Cost Learning and Links to Generative Adversarial Networks

Recap

In the first post we introduced inverse reinforcement learning, then we stated some result on the characterisation of admissible reward functions (i.e reward functions that solve the inverse reinforcement learning problem), then on the second post we saw a way in which we proceed with solving problems, more or less, using a maximum entropy framework, and we encountered two problems:
1. It would be hard to use the method introduced if we did not know the dynamics of the system already, and
2. We have to solve the MDP in the inner loop, which may be an expensive process.

Here, we shall attempt to mitigate the challenges that we have encountered, as before, and we shall give a rather beautiful closing which shall link concepts in this space of inverse reinforcement learning to ‘general’ machine learning structures, in particular generative adversarial networks.

Inverse Reinforcement Learning with Unknown Dynamics and Possibly Higher Dimensional Spaces

As we saw previously, the maximum entropy inverse reinforcement learning approach proceeds by defining the probability of a certain trajectory under the expert as being,

$p(\tau)=\dfrac{1}{Z}e^{R_\psi (\tau)},$

where

$Z=\int e^{R_\psi(\tau)}d \tau.$

We mentioned that this is hard to compute in higher dimensional spaces.…

## Maximum Entropy Inverse Reinforcement Learning: Algorithms and Computation

In the previous post we introduced inverse reinforcement learning. We defined the problem that is associated with this field, which is that of reconstructing a reward function given a set of demonstrations, and we saw what the ability to do this implies. In addition to this, we also saw came across some classification results as well as convergence guarantees from selected methods that were simply referred to in the post. There were some challenges with the classification results that we discussed, and although there were attempts to deal with these, there is still quite a lot that we did not talk about.

Maximum Entropy Inverse Reinforcement Learning

We shall now introduce a probabilistic approach based on what is known as the principle of maximum entropy, and this provides a well defined globally normalised distribution over decision sequences, while providing the same performance assurances as previously mentioned methods. This probabilistic approach allows moderate reasoning about uncertainty in the setting inverse reinforcement learning, and the assumptions further limits the space in which we search for solutions which we saw, last time, was quite massive.…

## Inverse Reinforcement Learning: The general basics

Standard Reinforcement Learning

The very basic ideas in Reinforcement Learning are usually defined in the context of Markov Decision Processes. For everything that follows, unless stated otherwise, assume that the structures are finite.

A Markov Decision Process (MDP) is a tuple $(S,A, P, \gamma, R)$ where the following is true:
1. $S$ is the set of states $s_k$ with $k\in \mathbb{N}$.
2. $A$ is the set of actions $a_k$ with $k\in \mathbb{N}$.
3. $P$ is the matrix of transition probabilities for taking action $a_k$ given state $s_j$.
4. $\gamma$ is the discount factor in the unit interval.
5. $R$ is defined as the reward function, and is taken as a function from $A\times S\to \mathbb{R}$.

In this context, we have policies as maps

$\pi:S\to A$,

state value functions for a policy, $\pi$, evaluated at $s_1$ as

$V^\pi(s_1)=\mathbb{E}[\sum_{i=0}\gamma ^i R(s_i)|\pi]$,

and state action values defined as

$Q^\pi (s,a)=R(s)+\gamma \mathbb{E}_{s'\sim P_{sa}}[V^\pi (s')]$.

The optimal functions are defined as

$V^*(s)=\sup_\pi V^{\pi}(s),$

and

$Q^*(s,a)=\sup_\pi Q^\pi (s,a).$

Here we assume that we have a reward function, and this reward function is used to determine an optimal policy.

## Correlation vs Mutual Information

This post is based on a (very small) part of the (dense and technical) paper Fooled by Correlation by N.N. Taleb, found at (1)

Notes on the main ideas in this post are available from Universidad de Cantabria, found at (2)

The aims of this post are to 1) introduce mutual information as a measure of similarity and 2) to show the nonlinear relationship between correlation and information my means of a relatively simple example

Introduction

A significant part of Statistical analysis is understanding how random variables are related – how much knowledge about the value of one variable tells us about the value of another. This post will consider this issue in the context of Gaussian random variables. More specifically, we will compare- and discuss the relationship between- correlation and mutual information.

Mutual Information

The Mutual Information between 2 random variables is the amount of information that one gains about a random variable by observing the value of the other.…