What is p-hacking?

You might have heard about a reproducibility problem with scientific studies. Or you might have heard that drinking a glass of red wine every evening is equivalent to an hour’s worth of exercise.

Part of the reason that you might have heard about these things is p-hacking: ‘torturing the data until it confesses’. The reason for doing this is mostly pressure on researchers to find positive results (as these are more likely to be published) but it may also arise from misapplication of Statistical procedures or bad experimental design.

Some of the content here is based on a more serious video from Veritasium: https://www.youtube.com/watch?v=42QuXLucH3Q. John Oliver has also spoken about this on Last Week Tonight, for those who are interested in some more examples of science that makes its way onto morning talk shows.

p-hacking can be done in a number of ways- basically anything that is done either consciously or unconsciously to produce statistically significant results where there aren’t any. Some of the most common ways of doing this are:

• formulating a hypothesis after seeing the data
• small sample sizes (a sample of 5 people is not sufficient to separate a real relationship and random variation)
• splitting data into groups (e.g. young people and old people) and performing the analysis again
• testing multiple hypotheses at once
• excluding observations from the analysis because they are ‘outliers’ or are ‘non-representative’

Testing multiple hypothesis is probably the most interesting way that this is done. If we test a single hypothesis with a false positive rate of $\alpha = 0.05$ (i.e. we want to look for p<0.05) then it is reasonable to expect that about 5% of the time we will get a false positive.

If we test 12 independent hypotheses at the same time, however, then the probability of a false positive increases to:

$P(\text{at least 1 false positive}) = 1 - P(\text{No false positives}) = 1 - 0.95^{12} = 0.46$

An example of finding significant results in random noise

To illustrate this I decided to run a few tests. I wanted to look for relationships between 12 outcomes (read y-values) and 5 observations of some predictive variable that will be our x-values. These numbers were 0,1,2,3,4 (to make plots more simple and the trends really easy to spot, but they could have been anything and we shall see why shortly). I then populated the values of these 12 measurements to some standard normal random samples (we can take the measurements as normalised).

Between the multiple hypothesis being tested I managed to get significant (p<0.05) results for the relationship between the number of pigeons encountered and $y_0$ and $y_3$  of 0.032 and 0.016 respectively.

I did require the sample size to be very small to get these results, and I did test 12 things. But considering that the ‘data’ had absolutely no real trend it’s pretty surprising that these results should occur.

If we note that some of the studies mentioned in the above videos (and elsewhere) use small sample sizes,  test multiple hypotheses and might have data that has at least some trend or that the researcher has more control over than I did just generating random samples, then it is clear that p-hacking is something that is fairly easy to do.

Of course, it is not to say that false positives can’t, or shouldn’t happen (we have to tolerate at least some false positives if the system is going to work at all). However, it is important to be aware of the potential for results to be misinterpreted or for there to appear to be a trend in places where no such trend exists.

How to spot this

Spotting false claims can be difficult, especially without access to the original data, but there are some simple sanity checks that can be done. Firstly, find out what the original study claimed and not what is being reported: claims like ‘x may improve the value of y health indicator’ are easier to believe than are claims like ‘x is good for you’.

The next thing is, ‘does that sound reasonable?’. Obviously there are counterintuitive results in the world that we don’t expect, but often claims are either vague, difficult to test or just ridiculous. For example, from experience of both drinking wine and exercising, and observing those around you who do either, we might treat the claim that ‘drinking a glass of wine is equivalent to an hour of exercise’ with a healthy degree of scepticism.

If possible, check the experimental procedure that was followed- was the sampling fair, were there enough test subjects, was there a control group, what outcome was measured etc.

Finally, see if the result has been reproduced anywhere and/or keep it in your mind for when a new result comes along that either confirms or contradicts it.

 How clear is this post?