The starting point

This is the first of (at least) 3 posts on p-values. p-values are everywhere in statistics- especially in fields that require experimental design.

They are also pretty tricky to get your head around at first. This is because of the nature of classical (frequentist) statistics. So to motivate this I am going to talk about a non-statistical situation that will hopefully give some intuition about how to think when interpreting p-values and doing hypothesis testing.

My New Car

I want to buy a car. So I go down to the second hand car dealership to get one. I walk around a bit until I find one that I like.

I think to myself: ‘this is a good car’. 

Now because I am at a second-hand car dealership I find it appropriate to gather some data. So I chat to the lady there (looks like a bit of a scammer, but I am here for a deal) about the car. She tells me all about it and lets me take it for a test drive. I find some problems:

  • It took 4 attempts to open the door (the handle didn’t really work)
  • The gearbox was very stiff
  • The breaks were unreliable
  • etc

In total I find 10 major faults with the car (and I could have found more). Now I am a little sceptical about whether it is a good car or not. If it were it were a good car, it probably wouldn’t have this many faults. So I decide not to buy the car. 

Back to Statistics

Normally when doing hypothesis testing, we start with a hypothesis, which we will call H_0, and then gather some data- from this data we then have to make a decision about our hypothesis. Note how this is similar to the set-up above. Let’s do a more stats-y example and then we will draw together the intuition.

Let us say that we have a hypothesis, H_0 is that the average cost of a coffee in Cape Town is R20. The alternative to this (a hypothesis called H_1) is that the average cost is higher. We will call this the ‘true’ mean, \mu.

We think to ourselves: ‘The average cost of coffee (\mu) in Cape Town is R20′.

Now we can’t verify that exactly because there is a lot of coffee in Cape Town. So we gather some data. We go to n=30 coffee shops and calculate that the mean (which we will call \bar x) is 22 and that the standard deviation (which we will call s) is 4.

We also know from our Stats book that t = \frac{\bar x - \mu}{s/\sqrt(n)}  has a t distribution with 29 degrees of freedom (the 29 is n-1). We calculate a value that comes from this distribution as $latex  \frac{22 – 20}{4/\sqrt(30)} = 2.74$

So now we test our hypothesis. We find that probability that, if our null hypothesis is true, we observe data at least as extreme as this. i.e. P(T>2.74) where T follows the above mentioned t-distribution. This probability turns out to be 0.005. That is pretty low. So we think to ourselves if the average cost were R20 this data is pretty unlikely. So we reject that hypothesis.

The similarities

Hopefully the similarities between my car story and the coffee experiment were clear I will make them explicit anyway:

  • In both cases we had some hypothesis (in bold, above) about the state of the world
  • In both cases we gathered some data (the faults in the car and the prices of coffee)
  • In both cases we looked at that data under the assumption that our hypothesis was true and concluded that seeing data like this (or more extreme) is improbable if our hypothesis about the state of the world is correct.

This is the core principle of hypothesis testing.


With all of this in mind we can now talk about p-values. A p-value is:

The probability of observing data that is equal to or more extreme than the data that is observed, given that the null hypothesis is true.

Note that the probability statement is a probability statement about the data, given an assumption made about the true state of the world. 

This means that it would be incorrect to say: ‘the probability that the average cost of a cup of coffee in Cape Town is R20 is 0.005’, as this is a statement about the true cost of coffee and not about the data. And, rather weirdly, it would mean that we could not say ‘this is probably not a good car’ (this is the counterintuitive bit and, in hindsight, shows how the way that I was thinking at the car dealership is a little weird).

A final point to emphasise is that the p-value includes the probability of seeing data that is more extreme than the data that we have observed. In the car example that is the probability of 10 or more faults, given that the car is good (which we did not calculate but just decided was low) and in the coffee example it is the probability of observing data with a mean of 22 (and s=4) or more given that the true mean is 20. A good way to remember to do this is to think of continuous random variables, where P(X=x)=0 \forall x, so in order to get a non-zero answer you pretty much have to include the other probabilities.



How clear is this post?