Introduction
So far we have discussed what pvalues are and how they are calculated, as well as how bad experiments can lead to artificially small pvalues. The next thing that we will look at comes from a paper by N.N. Taleb (1), in which he derives the metadistribution of pvalues i.e. what ranges of pvalues we might expect if we repeatedly did an experiment where we sampled from the same underlying distribution.
The derivations are pretty in depth and this content and the implications of the results are pretty new to me, so any discrepancies/misinterpretations found should be pointed out and/or discussed.
Thankfully, in this video (2) there is an explanation that covers some of what the paper says as well as some MonteCarlo simulations. My discussion will focus on some simulations of my own that are based on those that are done in the video.
What we are talking about
We have already discussed what pvalues mean and how they can go wrong. Now we want to know how they vary under repeated experiments. We are drawing a random sample from an underlying distribution, which means that the test statistic that we get is random, which, in turn, means that our pvalue is a random variable. So if we are going to interpret this random variable then we might want to know something about it i.e. its distribution. We will now demonstrate the simple method for approximating this distribution.
The procedure
Since the analytical distribution is quite in depth I will limit this to simulating the distribution of the pvalues of different experiments by sampling.
This is done as follows:
 Specify the true distribution that the data will be drawn from. In practice this is not known, but here we sample from a known distribution so that we can observe the behaviour of pvalues under different circumstances that we already understand
 Draw 10 000 samples (relating to 10 000 repeated experiments) of size 30 from that distribution
 For each experiment, calculate a test statistic and a pvalue
 Plot the distribution of the (10 000) pvalues
The distributions that will be considered will be Gaussian with a standard deviation of 1 and varying means. The hypothesis that is being tested is that the mean is greater than 0. This, the test statistics will be therefore follow a tdistribution with 29 degrees of freedom (which we used in the first post on pvalues).
Results
The plots below show the results of the simulated experiments from data with different means (0, 0.05, 0.1 and 0.3). The red line in each plot shows the 5% cutoff value: anything to the left of this line means we are rejecting the null hypothesis. The blue line shows the ‘true’ (or ‘typical’) pvalue that we would get if our sample mean was exactly the true mean of the distribution from which we are sampling.
The first plot shows that when the data is sampled from a distribution such that the null hypothesis is true i.e. it is sampled from a Normal(0,1) distribution. The plot shows that the distribution of pvalues is approximately uniform. We note that about 5% of the time we reject the null hypothesis (relating to our 5% false positive rate) and that about half of the observations have pvalues below 0.5 (where we find the blue line). This is what we might expect from this situation. Looking at the first column of the table, the proportion of observations below lower than some typical cutoff points is roughly what would be expected from the uniform distribution.
Fig 1: mean = 0
It is in figures 24 that the pattern of interest begins to emerge. As the mean increase we see the ‘typical’ pvalue decreases and move towards the 5% cutoff. Additionally, along with this move, the number of experiments that lead to a rejection of the null hypothesis increases. In a way this is good even if a ‘typical’ draw from that distribution does not result in a rejection, the mean is different from 0 so it makes sense that we should be rejecting the null hypothesis more often. The flip side of this is that these rejections come from slightly atypical draws. But this is a little tangential to the point of this post.
As Taleb notes in the video: it is not that the pvalues are stochastic that is surprising, but the asymmetry.
So what we should really be looking at is the shape of this distribution. The most striking thing here is the fat tails and large numbers of experiments with pvalues very close to 0. It is clear that as the true mean of the underlying distribution steadily increases, the bin that grows the fastest is the one to the far left of the plot (corresponding to pvalues less than 0.02).
So what does this mean? It means that as soon as we have a distribution that is even slightly different from the distribution under the null hypothesis (where a ‘typical’ experiment will not give a significant result), the distribution of the pvalues resulting from repeated experiments has density that increases the fastest around areas with highly significant results.
Consider figure 3 and the final column in the table. If we had drawn a sample with a mean equal to the true mean, we would have a pvalue of just over 0.05 borderline significant. We note that about half the time we get a significant result, i.e. the pvalue is below 0.05 about half the time. That’s fine if something is borderline and stochastic we might expect the split to be roughly 5050. What is more interesting is that of the significant experiments, onethird of these have a pvalue of less than 0.005, i.e. a full order of magnitude lower. Which means that about 1 out of every 6 experiments that are run on a phenomenon with a true underlying distribution that is typically borderline significant will produce results that are highly significant.
Fig 2: mean = 0.05
mean
p<x 
0 
0.05 
0.1 
0.3 
0.001 
0.0007

0.0027

0.0048

0.0554

0.005 
0.0048

0.0111

0.0185

0.1544

0.01 
0.0093

0.02

0.0389

0.2269

0.05 
0.0465

0.0849

0.1369

0.48

typical 
0.5017

0.5018

0.4965

0.4992

The columns in this table show the true mean of the distribution from which the data was sampled. The rows correspond to various cutoff values, where ‘typical’ is the pvalue calculated with sample mean equal to true mean. The entries show the proportion of experiments/simulations that gave pvalues below each point. Notice that as the true mean increases, the proportions of experiments giving extremely low pvalues increases.
Discussion
This a problem that does not have a clear solution, beyond those ‘use a far stricter indicator of significance’ (which, of course, comes at a cost). It does, however, indicate that the behaviour of pvalues can lead to unwanted results, even once we understand them and design experiments properly (the subjects of part 1 and part 2).
As noted, this content contains less established knowledge than the others, so any challenges/questions/complaints/edits are more than welcome.
 https://arxiv.org/pdf/1603.07532.pdf
 https://www.youtube.com/watch?v=8qrfSh07rT0
Thanks for preparing this. So the main takeaway is that pvalues tend to overestimate significance?