Let’s consider a biologist with an interest in swan coloration. She goes on an expedition to an area where two groups of swans live, to investigate whether the two groups have different colors. The biologist takes her job very seriously, and first calibrates a photometer against two reference colors: One for the ideal black swan; one for the ideal white swan. She then measures the color (or rather luminance) of ten specimens from each group, obtaining a range of values where 0 is ideal black and 100 is ideal white:
To analyze her results, she runs an independent samples t-test on the measurements, which tells her that p = .0001. This leads her to conclude that the two groups have different colors. Just as she suspected all along:
Our biologist is probably satisfied at this point. But we are not. What exactly has she learned from this t-test and the resulting p-value? Let’s start with the basics: What exactly does p = .0001 mean? Well … it means that if the two groups were really of the same color, the chance of observing a color difference as extreme as she observed, or more extreme, is .0001. This is an odd and counter-intuitive statement. Yet it is the foundation of most research.
Personally, I have a hard time understanding statistics, and null hypothesis testing (the type of statistics that gives you p-values) in particular. And I when I finally think I have some grasp on it, the paradoxes and weirdness that follow from this understanding have no end.
First the obligatory disclaimer: I’m not a statistician. But I don’t think you need to be in order to see that p-values are strange creatures that lead to odd conclusions and force researchers into an odd way of working.
Let’s start with the most striking property of the p-value: Its asymmetry. You generally compare two hypotheses: two models of what the world might be like. The null hypothesis states that nothing interesting is going on. In the case of the swans, this means that the two groups have the same color. The alternative hypothesis is generally the interesting one, and in this case states that the two groups have different colors. What the p-value allows you to do is reject, with a seemingly objective level of confidence, the null hypothesis, thus in a way accepting the alternative hypothesis. But it does not allow you to reject the alternative hypothesis and accept the null hypothesis. You can deduce this from the definition of the p-value given above: It is essentially a description of the world under the null hypothesis. This is also why this approach is called ‘null hypothesis testing’.
Why this asymmetry? If you ask a scientist, chances are that you get a profound-sounding Popperian explanation. Something like: “You can disprove that all swans are white by finding a single black swan. But you will never be able to prove that all swans are white, no matter how many white swans you see, because there is always the chance that the next one will be black. That is why you can reject, but never accept the null hypothesis.”
But how sound is the black-swan argument really? Very sound, at least if we forgive the fact that swans are countable: There are only a finite number of swans, and you could in theory observe every single one of them and know with absolute certainty whether or not there are any black ones. But some things are really infinite, and even more things are infinite in practice, and in that case the black-swan argument does indeed make sense.
However, as a psychologist I have never seen a black or a white swan. Gray swans a-plenty, sure. Some are pretty white, some are pretty black, but never perfectly so. And what is more: If I were to observe a hundred swans that were pretty white, and a single one that was pretty black, I would still conclude that all swans are probably white: The seemingly black swan must have been a measurement error, its latent whiteness obscured by noise. (Perhaps it was in the shadows?) In practice, therefore, there is not really a difference between hypothesizing that all swans are white and hypothesizing that they are not. Neither hypothesis can be conclusive proven or disproven, and either one can be very likely after you have seen a lot of swans. (Personally, I have seen so many swans that appeared quite black that I feel confident that some swans are, as a matter of fact, not white.)
Another property of the p-value is the lack of a reference to an effect size. With ‘effect size’ I don’t mean a fancy statistical measure, but simply–and to stick with our example–the magnitude of the color difference between the two groups of swans. Our biologist probably had an idea of how much coloration should differ: One group should be pretty black, the other pretty white. She did not expect both groups to be gray, differing only slightly in coloration. That wouldn’t be interesting at all. Yet the p-value is oblivious to this important detail. And this has unfortunate consequences, both practically and theoretically.
From a theoretical point of view, the lack of any role for effect size means that p-values are nonsensical for things that are finite and perfectly measurable. For example, as I already hinted at, you could observe every single swan from both groups. The outcome of this exhaustive experiment is known in advance: You will find that the two groups differ in color. Even if all swans are pretty much the same shade of gray, there is some variation from swan to swan. Therefore, the average color of one group will always differ at least a little from that of another group. Once you have observed all swans, all uncertainty is gone, and you can always reject the null hypothesis with complete certainty: p = 0. (Remember this if you work with patient groups or other finite populations. It can save a lot of participant money!)
From a practical point of view, the lack of a predicted effect size underlies much of the trouble with chance capitalization, which is so bothersome when you do null hypothesis testing. First, let’s consider this basic fact: If the null hypothesis is true (no color difference), the p-value fluctuates randomly between 0 and 1. It does not converge onto a particular value, or gradually get higher. (This is the asymmetry again: Under the alternative hypothesis the p-value does converge onto a particular value, namely 0.) This means that if you calculate the p-value many times over, for example after every time that you have determined the color of a swan, the chance that you encounter a small p-value at some point, just by chance, is quite high. Astoundingly high sometimes.
However, effect sizes do not randomly fluctuate in the same way. For example, if we compare two groups of equally gray swans, the difference in observed swan coloration decreases as you observe more swans. This is simply because the random individual variation in coloration averages out.
Let’s consider the practical implications of this by conducting a simple thought experiment. Our biologist does what psychologists sometimes do, but really shouldn’t. She engages in a procedure called ‘optional stopping’, which works as follows: She measures a pair of swans, one from each group, and conducts a t-test to see if the coloration differs significantly (using the common criterion of p < .05). If yes, she stops the experiment and triumphantly declares that the two groups have different colors. If no, she measures two additional swans and does the same thing again. And she keeps doing so, until, at some point, she finds that p < .05. The scary thing is that this is pretty much guaranteed to happen, at least if she is (very) persistent. You can see this in the graph below: By engaging in this extreme form of optional stopping, the chance of encountering a false alarm (i.e. concluding a color difference when there is none) keeps increasing as you measure more swans, approximately as a log function of the number of swans measured. (Just to be clear: Here we assume that the null hypothesis is true.)
But now look at the average (color) difference that is actually observed: This decreases as you measure more swans, simply because there really is no difference and the error averages out. So optional stopping (i.e. calculating p-values over and over again, and stopping when you have a significant result) leads to a high number of false alarms in combination with very small effect sizes. If only the p-value would know what effect size to expect! But alas, it doesn’t. The p-value is interested in any effect, no matter how big or small.
It’s all in the mind
The properties of the p-value described above can be dealt with by taking a very peculiar course of action: You take into account what’s in the mind of the researcher. And this, of course, is just asking for trouble.
Consider a lazy research assistant who works for two professors. The professors work in the same field of research, and independently have the same idea for an experiment. What is more, they both ask the research assistant to carry this experiment out. The assistant decides that there’s really no point in running the same experiment twice, so he just provides the two professors (who are unaware of the situation) with the same data. The plan is to test twenty participants, but the assistant sends the data immediately after every participant that he tests. Professor A is conscientious and, aware of the rules of the p-value, looks at the data only after all participants have been tested. He finds that the desired effect is significant with p = .03: A nice publication in the pocket! Professor B, on the other hand, is impatient and reluctant to waste resources. He checks after every participant with the intent to stop the experiment as soon as p < .05. But his impatience does not pay off: Only after the final participant is tested does the p-value drop below the golden threshold.
We are now presented with the following situation: When professor A informs us about the results of the experiment, we believe him when he says that the null hypothesis should be rejected. However, when professor B concludes the same thing on the basis of the exact same experiment, we are skeptical, because he engaged in optional stopping. (Assuming that he is honest about this, which he probably wouldn’t be.) In other words: Depending on the mind state of the researcher, the exact same data can lead to different conclusions.
With some creativity, you can concoct many such weird examples. But I will constrain myself and only give you one more. Let’s return once again to our swan expert, who, after a lifetime of expeditions to swan habitats all over the planet, has compared the coloration between many groups of swans. Usually she found differences, sometimes not (or not significantly so), but her methodology was always impeccable: She predicted a color difference, specified the methods, analysis, and number of specimens to be measured in advance, and used appropriate statistics to analyze her results. The perfect example of confirmatory research, for which the p-value is so exquisitely suitable. Over the course of her career, she measured 100 swan-group pairs, and found color differences between 90 of them. (It is rumored that her research suffered from a circularity issue, and that she defined her groups by coloration–but that’s beside the point.) Unfortunately, sometime after her death, all her papers were lost by a freak accident, and the only thing that remained was a list of all color measurements that she ever made. In order to reconstruct the results from the great swan master, a young post-doc takes the raw data, and feeds it into a statistics package. With the click of a button he re-checks which pairs of groups differ significantly in color. The results are shocking: Only 25 pairs! Was the ancient master a cheat after all? A Diederik Stapel avant la lettre?
No! The statistics package has corrected for the fact that when you do 100 comparisons at once, you need to correct for multiple comparisons. Otherwise you are likely to obtain false alarms. This means that you have to use a much more stringent criterion than when you just make a single comparison. Consequently the chance of finding a significant color difference is smaller, and the number of ‘hits’ goes down. The point here is that whether you analyze something as an experiment in itself, or as part of a larger dataset, makes a huge difference. Even when it’s the same data!
A form of communication
The problems outlined above are well known. They are just my rendition of issues that statisticians can explain in far greater detail, and often with a sense of humor (an excellent read, even for the non-statistically informed, is Wagenmakers, 2007). However, there is far less agreement on what should be done, if anything, to resolve these issues.
The dominant stream of thought appears to be that nothing should be done. The p-value is the way that it is, and we, as scientists, should submit to the rules that it imposes on us. As I pointed out in a previous blog that drew some fairly critical reactions, an example of this is the recent move towards pre-registration in fundamental research. By limiting ourselves to purely confirmatory research, which includes being explicit about your state of mind before you do an experiment, you can largely avoid the pitfalls of the p-value. This is sensible in many ways, and addresses many real problems. (See, for example, Chris Chambers’ recently accepted proposal for a pre-registration format in Cerebral Cortex.) So don’t mistake my argument for a blatant criticism. But at the same time, pre-registration is the ultimate submission to the p-value.
But what are statistics, really? I think they are, at least largely, a form of communication. We need statistics to communicate our findings in summary form, and to understand them better. To do so effectively, statistics should be intuitive, informative, and practical. And the p-value is none of these, or at least not all at the same time. The p-value is not intuitive, because it’s difficult to wrap your head around its definition, and it does not always tell you what you really want to know (such as how likely it is that the null hypothesis is true). The p-value is often not very informative, because it depends on unknowables, notably the researcher’s state of mind. And when we take measures to make the p-value as informative as possible, by having researchers write down their state of mind before an experiment takes place, it often becomes impractical.
I think that there is a point to be made for presenting results in a more descriptive fashion, without referring to predictions or the mind state of the researcher. From what I understand, Bayesian statistics go some way in this direction. This approach allows you to say things like “these results are twice as likely to emerge under model A than under model B”. This type of model comparison strikes me as intuitive, and honest about what it is and what it is not. (Unfortunately, I have a hard time implementing it in practice, but I’m making baby steps.)
More generally, I think that there are many ways to do science, and different approaches benefit from different types of analyses and statistics. Some may not even benefit from statistics at all. Granted: Many studies, at least within psychology and related fields, are more-or-less confirmatory and benefit from explicit predictions and null hypothesis testing. As Wagenmakers and colleagues put it in their famous treatment of Bem’s studies on psi, “this [explicit prediction] is particularly important if one wants to convince a skeptical audience of a controversial claim: after all, confirmatory studies are much more convincing than exploratory studies.” But one can easily think of studies in which predictions are hard to define. Take, for example, corpus analyses, in which large, pre-existing datasets are analyzed in the hope of finding interesting patterns. In this case, p-values loose much of their meaning, because it’s difficult to clearly distinguish the prediction from the experiment and the analysis: It’s just a big pile of data. Of course, one option is to calculate p-values anyway, and concede that there is no unambiguous interpretation in terms of probabilities. But, arguably, one might be better off with a dry description of the pattern that has been observed in the data. Something like: “Based on color-measurements of hundreds of thousands of swans across the world, it struck us that most swans in Botswana are white (M=94, SD=3), whereas most swans in China are black (M=3, SD=4). And in rare cases, studies might not benefit from much in the way of statistics at all. For example, if you measure the pupillary light reflex in various patient groups for diagnostic purposes, you are basically just interested in the absolute measurements (although some measure of variation would probably be helpful).
So what’s the bottom line here? I suppose that if there even is a bottom line, it’s that null hypothesis testing is just a tool for communicating and understanding scientific results. The p-value is not perfect or special in any way that I can see. That doesn’t mean that you shouldn’t use it (I use it all the time) or shouldn’t sometimes constrain your research to make the p-value optimally informative (such as by pre-registering your experiments). But it does mean that a certain level of skepticism is warranted. Beware! Here be dragons.
Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804.
Wagenmakers, E. J., Wetzels, R., Borsboom, D., & Van Der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426-432.