No particular prevalence of p values just below .05 // Cogsci

Like workers from all trades, scientists produce things. Bakers produce bread, construction workers produce buildings and such. And we scientists… well, we produce p values that are smaller than .05.

So what exactly is a p value? If a scientist wants to prove a point, she generally does so by testing a hypothesis. For example, she might hypothesize that rich people are happier than poor people. She could test this hypothesis by collecting happiness ratings from fifty rich and fifty poor people, and calculate a p value for the difference. The p value then expresses the chance that these happiness ratings would be as different as they are, or more different, if rich and poor people were really just as happy. (For a more detailed discussion, see my previous post.)

Are you still with me? Maybe not, but no matter: The important point is that a low p value means that your hypothesis is probably correct. (Actually, it means that the data is unlikely given the null hypothesis, but let’s skimp over this important detail for now.) The commonly accepted threshold is .05: If your p value is below .05, you have found something worthy of publication, otherwise you haven’t.

So there is a clear incentive for scientists to find p values that are smaller than .05. So what do yo do if you get a p value of .051? Well, you do what any sensible scientist would do: You test a few more participants, analyze the data a bit differently, maybe forget about some data points that on second thought were not reliable anyway. And … voilà, your p value just became .049.

It is generally accepted that scientists engage in all kinds of dubious practices to get low p values (e.g., Simmons, Nelson, & Simonsohn, 2011), but Masicampo & Lalande (2012) showed this in a particularly elegant way. They collected a large number of p values from three scientific journals in the field of psychology, and analyzed the resulting distribution. The journals that they selected were Psychological Science, Journal of Experimental Psychology: General, and Journal of Personality and Social Psychology. Their crucial finding was ‘a peculiar prevalence of p values just below .05’. In other words, it appeared that when researchers find a p value that is just above .05, they tweak their results in such a way that the p value changes to just below .05. And just to be clear: You are not supposed to do this.

The little bump that corresponds to this peculiar prevalence is indicated by the orange dot in the figure below.

Figure adapted from Masicampo & Lalande (2012).

I thought this was pretty cool and elegant, so I did something similar. First, I downloaded a lot of content¹ from Journal of Vision, which is a journal in the field of vision science. Although I’ve questioned this journals’ open-access status, it’s content can be downloaded freely, which made it suitable for the purpose of automated data mining. Next, I extracted all statistics that were formatted exactly as t(X) = X, p = X or F(X,X) = X, p = X. In other words, I analyzed only t tests and ANOVAs.

Of course, it’s better to retrieve the statistics by going through each and every paper manually, like Masicampo and Lalande appear to have done. But I decided to save a few lifetimes of work and go with a regular-expression search, which is an automated way to search text for specific patterns. And although this approach no doubt missed a lot of p values, I nevertheless ended up with 2063 p values across 298 papers. In total there were 1999 papers, but this also included many conference abstracts, which generally don’t contain any statistics. So, assuming that there were about 1000 full articles, I managed to extract statistics from about 30% of them.

This set of 2063 p values is smaller than the set of 3494 p values analyzed by Masicampo and Lalande. But it comes close and is enough to fit a nice distribution.

So was there also a ‘peculiar prevalence of p values just below .05’ in this dataset?

No, much to my disappointment there was not.

On the left side of the figure above, you see can the full distribution of p values. The dashed line indicates the .05 threshold and everything to the left of this line is ‘significant’. A large proportion (47%) of all p values is significant. This is typical, among other things because, as I already mentioned, it is difficult to publish non-significant results (i.e. publication bias).

Now look at right part of the figure, which zooms in on the range between .01 and .1. In Masicampo and Lalande’s dataset, there was a bump in the shaded region between .045 and .05 (i.e. just below .05). But in this dataset there is no such bump: p values just below .05 are just as common as you would expect them to be based on the rest of the distribution. We do see a number of clear and regularly spaced peaks, but these are simply due to rounding: Many authors round their p values to two decimals, so a lot of p values are exactly .03, .04, .05, etc. This rounding phenomenon is not evident in Masicampo and Lalande’s distributions, probably because they re-calculated p values themselves whenever the exact p value was not reported. What is actually quite striking is that the .05 rounding peak is not too different from the other rounding peaks. When you think of it, this suggests that authors have occasionally chosen to round up their p value of, say, .04912 to .05. Well, that’s noble in a way, I suppose.

So, at the risk of taking this analysis too seriously, what can we conclude from this? An optimistic conclusion is that Journal of Vision authors are on their best behavior, and don’t tweak their results in the same way that contributors to Psychological Science, Journal of Experimental Psychology: General, and Journal of Personality and Social Psychology do.

Or, alternatively, Journal of Vision authors simply tweak their p values in more sophisticated ways.

References

Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology, 65(11), 2271–2279. doi:10.1080/17470218.2012.711335

Mathôt, S. (2013). 2063 F and t tests extracted from articles in Journal of Vision. figShare. doi:10.6084/m9.figshare.832497

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. doi:10.1177/0956797611417632

I downloaded from volumes 2 to 13, numbers 2 to 12, and articles 1 to 19. You can use these numbers to retrieve the full text via direct links, such as http://www.journalofvision.org/content/12/12/14.full. You can download the full dataset from FigShare (Mathôt, 2013).↩