Based on (critical) responses that I received, and discussions that I had after this post, I have added some footnotes to elaborate on certain aspects. The main criticisms are that pre-registration is not necessarily as rigid as I depict it to be (which may be true), and that questioning statistical guidelines is dangerous (which is certainly true, but also a moralistic fallacy: something can be correct and dangerous to say at the same time). Also, see my (sort of) follow-up post The Black Swan and NeuroSkeptic’s response.
In response to the many recent cases of scientific fraud, a debate has ignited about how science can be made more transparent, and how some of the public trust can be regained. Suggestions include …
- making all research data publicly available, not just the summarized results.
- making all scientific papers publicly available (i.e. open access).
- investing more time in replicating results, those of others as well as your own (e.g., the reproducibility project).
- and pre-registering all studies.
A slightly mysterious, but influential voice in this debate is Neuroskeptic. In a recent post, Neuroskeptic interviews Jona Sassenhagen, a neurolinguist from the University of Marburg, who decided to pre-register his EEG study. So what does it mean to pre-register a study, and why would anyone do this?
The idea behind pre-registration is simple: Before you conduct your experiment, you publicly list exactly what kind of experiment you are going to conduct, how many participants you will test, and what the predicted outcome is. Once you have done this, you have very few degrees of freedom to tweak your results afterwards. For example, if your results are not statistically significant, you cannot keep running participants until you obtain the desired result (i.e. optional stopping), because you have specified the number of participants in advance. Similarly, if you obtain results that don’t match your hypothesis, you cannot confabulate a post-hoc hypothesis that matches the outcome of your study.
Pre-registration addresses a very real problem, particularly when it comes to research funded by the pharmaceutical industry. Pharmaceutical companies are not, by and large, hindered by much in the way of ethics, and will happily tweak their results until they ‘clearly show’ the efficacy of their product. Pre-registration makes it considerably more difficult for these companies to engage in this serious form of bad science.
In my opinion, no sensible person could be opposed to compulsory pre-registration for clinical trials. However, Neuroskeptic goes one step further, by arguing that pre-registration should be made mandatory also for fundamental research, and celebrates Sassenhagen’s EEG study as a guiding beacon. This set me thinking, because I’m not convinced that this is necessarily a good idea. Or let me rephrase that: I’m not convinced that pre-registration for fundamental research is a good idea if it imposes the same restrictions that it does on clinical trials.
The argument in favor of pre-registration for fundamental research is clear: Just like pharmaceutical companies, fundamental researchers routinely tweak their results in any number of scientifically dubious ways. Make no mistake, almost all researchers engage in some form of (usually mild) scientific malpractice at one point or another. Human nature: There is as little point in denying it, as there is in being sanctimonious about it. As a result, a large number of ‘untrue’ findings presumably end up in the literature. Pre-registration would combat this problem very effectively. However, the question is whether there are no downsides that outweigh this gain.
More specifically, pre-registration relies strongly on the notion that science is always confirmatory: You make a prediction and you test it. This is often how it goes, but not always. Sometimes (not often enough!) you just stumble across something cool. Another, more subtle point is that pre-registration implies that science should be made to fit statistics, rather than the other way around: That we should only conduct the type of experiments that can be properly analyzed using standard statistical techniques (i.e. null-hypothesis testing), and that we should not collect evidence in whatever way is most efficient, and choose a statistical technique that can deal with this.
This may sound esoteric, so let me give a concrete example. My colleagues and I recently conducted an experiment in which we recorded eye movements of participants while they viewed photos of natural scenes. On half of the trials we manipulated the scene based on where participants were looking. The other half of the trials served as a control condition, in which nothing special happened. I won’t bother you with the details of our manipulation, not because I want to be secretive, but because it turned out not to have the predicted effect. According to the rules of pre-registration, this means that our study was worthless: We made a prediction, it didn’t come out, and any attempt to use this dataset for another purpose borders on scientific fraud. However, we stumbled across an unexpected, but interesting and statistically highly reliable phenomenon in the control trials. So what now? Are we not allowed to look at this effect, because we did not predict it in advance? Should we run a new study, in which we predict what we have already found, and use only the data from the new experiment1?
Your intuition, no doubt, screams ‘no’, or at least mine does. However, the logic behind pre-registration says ‘yes’. The essential conflict here is that pre-registration discourages exploratory research23, and assumes that a finding is not a real finding unless it was predicted – a questionable assumption at best. True, a finding is much more convincing when it was predicted, but even when a prediction is lacking the evidence can still be overwhelming. If pre-registration is to be made compulsory for fundamental research, this basic fact should be accommodated.
Let’s consider another situation: You predict an effect, test 20 participants, and find that the effect is reliable with p = .15. This means that the chance of obtaining this result (or one that is more extreme) if there was really no effect was only 15%. This is promising, but certainly not enough to base any conclusion on (the commonly accepted threshold is 5%, or p < .05). So what should you do now? Your first intuition is to test another batch of participants, because a larger sample size increases statistical power and thus, hopefully, results in a lower p-value. But the problem is that the meaning of the p-value assumes that you don’t do this: You should only look at your data once. The reason for this is quite subtle, but basically it comes down to this: Every time that you calculate a p-value there is a chance of getting a false positive: The p-value could fall below .05 just by chance, even if there is really no effect. And if you calculate the p-value multiple times, after every participant say, the chance of obtaining a false positive at some point increases accordingly. If you check often enough, you are almost guaranteed to obtain a ‘signficant’ result. (The cornerstone of research on parapsychology.)
For this reason, the rules of pre-registration are clear: You cannot recruit a different number of participants from what you initially intended, because p-values become meaningless when you do. If your p-value happens to be ‘promising’, but insignificant (a value of .11, say), tough luck: Your data are worthless4.
But does this make sense? Take a step back and think about it. Should you throw away a dataset, because it is by itself insufficiently convincing? Let’s say that you conduct one experiment and obtain a p-value of .10. Being a good boy or girl, you throw away the data and conduct another experiment (although even that is statistically questionable). Again, you get a p-value of .10. What now? Should you conclude: I failed to obtain a significant result twice, so the effect probably does not exist? Of course not! The chance of obtaining a p-value of < .10 in two experiments, given that there is really no effect, is only 1% (.10 * .10 = .01)5! So two experiments that by themselves provide weak evidence, can together provide very strong evidence. Evidence accumulates6.
Phrased differently, pre-registration (and null-hypothesis testing in general) forces scientists to use what you might call a one-shot approach: You get only one chance of obtaining a significant result. You are not allowed to accumulate evidence, for example by testing more participants, until you are convinced that there is an effect (or not) or until you give up. Even though, as proponents of Bayesian statistics point out, there is nothing inherently wrong with accumulating evidence over time: It is a problem that stems mostly from the fact that null-hypothesis testing is not equipped to deal with this.
So what’s the upshot of this? Should we pre-register fundamental studies or not? I think pre-registering could be beneficial, because it enforces transparency on a crowd that could certainly use some, but it should not impose the same restrictions that are applied to clinical trials, where the stakes are so high that extreme rigor is warranted7. Science is often exploratory, and ‘exploration’ is not a dirty word. Not as long you don’t pretend to have predicted something that you did not. Furthermore, it’s silly to throw away perfectly good evidence just because it is by itself insufficiently strong. When in doubt, testing additional participants is a sensible thing to do, as long as you don’t pretend to have fixed your number of participants in advance.
But there is a downside to this as well: If we loosen the rules, it becomes more difficult to evaluate how strong the evidence in a particular study is. We suddenly have to think about what evidence actually means, and how it is affected by (a lack of) predictions. This is really complicated stuff, but statistical experts argue that the tools are there, even though no-one uses them. And, after all, should we conduct experiments that match statistics, or should we choose statistics that fit our experiments?
Neuroskeptic. (2012). The nine circles of scientific hell. Perspectives on Psychological Science, 7(6), 634-644. doi: 10.1177/1745691612459519 PDF: Official
Phrasing this as analyzing the control trials of a failed experiment probably invokes strong images of me taking out my fishing gear and hunting for p-values. However, you could equally well say that this is a corpus-based analysis, because the control trials for the failed experiment are simply a corpus of free-viewing eye-movement data. Corpus-based analyses are very common in psycholinguistics and eye-movement research, and even increasingly in EEG research, where large numbers of participants are tested with the goal of creating a large dataset that can be analyzed in many (unspecified) ways afterwards. This is an exploratory approach that would be difficult to fit in the pre-registration protocol described below.↩
Thanks to Hans Ijzerman for pointing out a pre-registration protocol that is part of a Frontiers in Cognition research topic. When it comes to exploration, the protocol states that such analyses are admissible but must be clearly justified in the text, caveated, and reported in a separate section of the Results titled “Exploratory Analyses”. Editors must ensure that authors do not base their conclusions entirely on the outcome of significant Exploratory Analyses. So when it comes to exploration, this particular protocol (others may differ) offers some room, and rightly emphasizes the need to clearly label exploratory analyses as such. But I think it’s nevertheless fair to say that exploration is discouraged. This is not a criticism on the prototocol: It strikes me that for the studies that this Research Topic calls for, there is relatively little room for sensible exploratory analyses. And with this example salient in your mind, it might seem preposterous to suggest that a similar protocol might not work in all cases. But see the footnote above.↩
Another pre-registration protocol, pointed out by Jona Sassenhagen, has been proposed by Chris Chambers as part of a special article format for the journal Cortex. It’s in many ways similar to the protocol described above, in the sense that authors won’t be able to base the conclusions of their study on the outcome of unplanned analyses. However, Chambers makes a compelling point, by saying that serendipitous findings are, by their nature, rare. A far greater problem is the proliferation of false positives due to excessive post-hoc flexibility in analysis approaches. So let’s deal with the big problem first. In other words, he argues that although you might (as I wrote) occasionally stumble across something cool, this is simply too rare an occasion to be a strong argument for anything.↩
Some (sensible) pre-registration protocols will promise to publish studies regardless of outcome, in which case it’s not true that a study is worthless when p = .051. However, the fact remains that it’s a one-shot approach, which makes it difficult to deal with ambiguous results (even when they are published by prior agreement).↩
Although it’s true that the chance of obtaining p < .10 in two consecutive studies is 1%, you cannot multiply p-values across experiments to obtain a ‘grand’ p-value. I.e. the chance that the multiplied p-value of two experiments falls below .10 (note the subtle difference with the statement above) is 33% (script). The reason is essentially that p-values fluctuate between 0 and 1, so multiplying two p-values by definition results in a smaller value than either one. As Rogier Kievit points out in the comment section, going from p-values to the amount of evidence for an effect is not trivial. Also see this post about the Fisher method for combining p-values.↩
Just to be perfectly clear: The fact that evidence accumulates in principle, does not excuse optional stopping in combination with null-hypothesis testing, so practically speaking this point is of questionable usefulness. It does, however, suggest that there is something odd, Schrödinger’s-cat-like about this form of analysis, and at least some proponents of Bayesian statistics argue that Bayes factors do not suffer from this problem in the same way (see e.g. Dienes, 2011 for a readable introduction).↩
It might sound cynical to argue that clinical trials are more important than fundamental research, and should therefore be controlled more strictly, although I do think it’s true (nobody died because of Diederik Stapel’s fraud!). However, another point is that the nature of clinical trials is generally such that exploration makes little sense: An intervention either works or it doesn’t. And therefore a pre-registration protocol as described in footnote 2 fits the type of research well.↩