Causal inference is a branch of Statistics that is increasing in popularity. This is because it allows us to answer questions in a more direct way than do other methods. Usually, we can make inference about association or correlation between a variable and an outcome of interest, but these are often subject to outside influences and may not help us answer the questions in which we are most interested.

Causal inference seeks to remedy this by measuring the effect on the outcome (or response variable) that we see when we change another variable (the ‘treatment’). In a sense, we are looking to reproduce the situation that we have when we do an designed experiment (with a ‘treated’ and a ‘control’ group). The goal here is to have groups that are otherwise the same (with regard to factors that might influence the outcome) but where one is ‘treated’ and the other is not.

The below example aims to demonstrate a very simple example of causal inference. There are a number of more flexible methods of causal inference but the key underlying concept of finding a more direct relationship between the treatment and the outcome remains unchanged.

Set up

We will consider a fictional example on a synthetic dataset. We are in a fictional country in which outbreak of a lung disease, LD01, has just happened. The population of the country is made up of farmers, and city-dwellers (who all live in the capital, Smoketopia). We are interested in whether or not smoking causes LD01. We consult experts in smoking behaviour, and together draw up the graph (DAG) below:



This shows the causal relationships between the variables of interest. A directed arrow from one node (circle) to another shows the direction of causal effects between the (binary) variables. The symbols in the circles indicate that smoking is our ‘treatment’ and that LD01 (L1 for short)  is our ‘outcome’. We can infer the following from the graph:

  • LD01 is influenced by genetics, living in the city and smoking
  • Smoking is influenced by living in the city

We use our domain knowledge to determine that both smoking and living in the city of Smoketopia will increase the probability of suffering from LD01.

Now, considering the 2 above points we have a bit of a problem: we want to know the effect of smoking on developing LD01, but LD01 can also be caused by living in Smoketopia, and people who live in Smoketopia are more likely to smoke. This means that if we just look at all the smokers and compare them to the non-smokers, there would be a greater proportion of smokers who live in the city and therefore are at risk of developing LD01. In a situation like this, the variable representing living in Smoketopia is known as a confounder in causal inference we aim to control for confounders by conditioning on them.

Models and Results

In this section we consider 2 approaches to estimating the effects*:

  • first we estimate the difference between the prevalence rates of LD01 of smokers and non-smokers for the whole population, ignoring the effect of the confounder;
  • then we use the method of covariate matching to create a ‘treated’ and ‘control’ group with equal proportions of Smoketopians, which we will use to determine the casual effect

Approach 1:

Here we do not control for the confounder and we find that the prevalence of LD01 in smokers is around 16%, whereas in non-smokers it is around 11%, leaving us with an estimated difference of about 5%. The output from our GLM fitted also suggested that smoking his a highly predictive feature of LD01.

However, we also note from this dataset that 61% of smokers live in the smokey city of Smoketopia, whereas only 42% of non-smokers do. This presents a problem: how do we know if it is the smokers’ smoking that is causing the problem, or whether it is just their living in the city that puts them at risk?

We deal with this below.

Approach 2:

To get around the problem presented above we proceed by matching:

  • split the dataset into smokers and non-smokers
  • for each smoker from Smoketopia, find a a smoker from outside of the city
    • continue you this until you cannot find a ‘match’/you run out of smokers
  • for each non-smoker in Smoketopia do the same thing**

This means that we now have a (smaller) dataset made up of a treatment group (smokers) and a control group (non-smokers) which have the same proportion of residents of Smoketopia (i.e. we have controlled for the confounder).

Looking at our results now we have an LD01 prevalence of 16% of smokers of 13.5% in non-smokers and therefore a difference of 2.5%, which is roughly half of what we observed before. This tells us that the effect of smoking is smaller than we expected, but appears to be more significant because smokers tend to live in a city. Additionally, the GLM that was fit presented a p-value that was a number of orders of magnitude larger than in the original fit.***

* for simplicity we are fitting a GLM with only the intercept and one \beta coefficient, which in terms of prediction is essentially comparing the means between our smoker and non smoker groups

**note that the ‘matching’ done here does not mean that we now have a dataset with ‘matched pairs’ necessarily

***the exact p-values are not important for this discussion

Extensions and resources

This post has served to introduce causal inference, but has not gone into any details on how conditional independence relationships work in DAGs or indeed why the conditioning that we have done is acceptable. As with most things, there are a number of online resources which can be helpful, listed below alongside extensions that can be made:

  • Different techniques of controlling for confounders, such as:
    • propensity score matching – matching probabilities of data points of being in the ‘treatment’ group
    • inverse probability treatment weighting, where underrepresented datapoint are given a higher weighting
    • the use of Instrumental Variables, which is more common in observational studies
  • extending to numerical variables, rather than binary variables
  • fitting more complicated models than a simple comparison-of-means
  • a tutorial on CI, including an overview of Do-Calculus (which underlies the theory used in this post):
  • an interesting paper on fairness in Machine Learning that uses Causal Bayesian Networks (CBNs) as a method of understanding how unfair patterns in data can be avoided:
  • A rigorous explanation of DAGs can be found in Professor David Barber’s Bayesian Reasoning and Machine Learning
  • An online tool that is helpful in creating and understanding the (conditional) (in)dependence relationships between variables in a DAG can be found here:


How clear is this post?