A key consideration when analysing stratified data is how the behaviour of each category differs and how these differences might influence the overall observations about the data. For example, a data set might be split into one large category that dictates the overall behaviour or there may be a category with statistics that are significantly different from the other categories that skews the overall numbers. These features of the data are important to be aware of and go find to prevent drawing erroneous conclusions from your analysis. Context, the source of the data and a careful analysis of the data can prevent this. Simpson’s paradox is an interesting result of some of these effects.

The Paradox

Simpson’s paradox is observed in statistics when a trend is observed in a number of different groups but it is not observed in the overall data or the opposite trend is observed.

Observing the overall data might therefore lead us to draw a conclusion, but when the data is grouped we might conclude something different. This means that we must look at the overall story that the data tells us..

An Example

Say we have a town with a population of 200. The population is made up of 100 men and 100 women. We are interested in the rate of a certain disease that is more common among women than it is men: in year 1 the population rates of disease are 2% for men and 20% for women. This gives us an overall rate of \frac{20+2}{200} = 11%. In year 2, the rate of the disease decreases to 9.7%. The population numbers and rates of disease are shown below:

Year 1:


  • Population: 100
  • Cases of disease: 2
  • Rate: 2%


  • Population: 100
  • Cases of disease: 20
  • Rate: 20%


  • Population: 200
  • Cases of disease: 22
  • Rate: 11%

Year 2:


  • Population: 200
  • Cases of disease: 6
  • Rate: 3%


  • Population: 110
  • Cases of disease: 24
  • Rate: 21.8%


  • Population: 310
  • Cases of disease: 30
  • Rate: 9.7%

Looking at the overall rates only, we might conclude that the disease is less of a problem, due to the lower overall rate. But, say we know that something has happened in the town, for example a mine opening. There is a sudden influx of mineworkers (who, in this case, are majority men). The population in year 2 then increases to 310- 200 men and 110 women. This makes the demographic of the town quite different- there are now proportionally more men who are less susceptible to the disease in the town, so we would expect the overall incidence rate to decrease.

However, in year 2 there is an increase in the incidence rate amongst both men and women. Our conclusion might now change. Maybe this disease is now more of a problem than it was in the previous year. Say we also learn that the disease could be related to pollution. Now the data makes a lot more sense, the overall rate is lower because there are more men, the individual rates are higher because people are more susceptible to the disease.


The important thing here is to try and understand the data and the situation as much as possible. In the example above we saw that it was the demographic change that caused the overall rate to decrease but that in fact the rates of disease had increased in both men and women (perhaps due to environmental changes). In this case it was important not to be hasty in drawing conclusions and not to understand the context of where the data came from.

There is a caveat here in that we must also be careful about splitting data into groups that don’t make any sense (or, trying to split the data into the most sensible groups possible) and reading into trends that are not there i.e. splitting the data until we have noise that suits us.

Finally, we should always keep the issue of causality in mind and the claims that can be made about the data (in our example, the reason for the increase in the disease could be completely unrelated to pollution).


How clear is this post?