Notes from the NC3Rs workshop on publication bias // Cogsci

I'm writing this on my way back from London, where I attended a workshop on publication bias that was organized by the NC3Rs (the British National Center for the Replacement, Refinement, and Reduction of Animals in Research). Publication bias arises when not all scientific studies are published, and when the chance of whether a study is published depends on its outcome. More specifically, studies that show a 'positive' result (e.g. a treatment effect, or something that supports a researcher's hypothesis) are published more often than studies that show a 'negative' result (e.g. no treatment effect, or something that doesn't support a researcher's hypothesis). Publication bias distorts scientific evidence. In most cases, it makes treatments (drugs, therapies, etc.) seem more effective than they are, simply because we only see studies that show positive treatment effects.

Publication bias is increasingly recognized as a severe problem that affects all areas of science. It's not new. It's just that until recently little was done about it. It was therefore great to see this workshop bring researchers, funders, publishers, and people from industry together with the aim of discussing concrete ways of reducing publication bias. In this post I would like to tell you about some of the things that were discussed.

There were many excellent speakers, but I will first highlight the opening talk by Emily Sena. Her talk was partly based on a meta-analysis in which she investigated publication bias in animal research on stroke treatment. Her work nicely shows how you can answer a seemingly unanswerable question: How many studies were never published, and what did these invisible studies find?

To tackle this question, she used (among other things) a technique called trim-and-fill, which works as follows. First, you assume that an effect (of a stroke treatment in this case) has some unknown true size, which you would like to find out. Next, you assume that the outcome of each individual experiment is an estimate of this true effect plus some noise. Not all experiments are equally noisy. For example, the estimate of a large experiment (with many animals or participants) is likely closer to the truth than the estimate of a small experiment. Therefore, if you look at a large number of experiments, you expect to see that large experiments are tightly clustered around the true effect, and that small experiments are more widely spread, but also centered on the true effect. In other words, the distribution of experiments should be symmetrical with the true effect in the middle. However, what you often see is that small studies overwhelmingly show large positive effects, as if small studies that showed no effect, or a negative effect, have simply disappeared. And they have: Those missing experiments are the scientific dark matter that has never been published¹.

Left: The distribution of effect estimates is asymmetrical, as you can see by comparing the number of dots in the red (few dots) and green (many dots) ovals. Right: By adding hypothetical experiments (red dots) until the distribution is symmetrical, you can estimate which experiments are missing. Adapted from Sena et al. (2010).

To quantify publication bias, you can take a graph like the one above, which shows the effect size of experiments (on the X axis) as a function of their precision (on the Y axis; noisy experiments at the bottom). You see that this graph is not symmetrical: There are too many points at the bottom right, in and around the green oval. To quantify publication bias, you can determined which points need to be added to make the graph symmetrical. Each (red) point that needs to be added corresponds to one experiment that never saw the light of day.

Based on these types of analyses, Sena and colleagues estimated that stroke treatments are about 30% less effective than published studies suggest—A severe distortion of scientific evidence. Worryingly, many researchers (myself included) would say that this 30% number is conservative. Furthermore, Sena and colleagues focused on a field in which there is a fair amount of scientific rigor, at least compared to fields like psychology. In other words, the situation is likely even worse in many fields of research.

Moving on, another interesting talk was given by Tom Walley of the British National Institute of Health Research (NIHR), an organization that funds clinical trials for new treatments. One remark in particular stuck in my mind: Only about 50% of NIHR-funded trials finds positive treatment effects, and this is exactly what they want. This goes against how many researchers think, because we want to find an interesting effect 100% of the time. And when we don't find an effect, our experiment is considered to have 'failed'. But this doesn't make sense: If all your experiments find effects, they are not worth doing, because you're testing things that you already knew. No information is gained. On the other hand, if your experiments never find effects, your hypotheses are too improbable, and no information is gained either. So there is an optimum that lies somewhere between never and always finding an effect. According to Tom Walley, this optimum is around 50%.

This highlights a fundamental problem with how researchers think. We want to test revolutionary hypotheses, which, by definition, are unlikely to be correct. But at the same time, we want to find positive results in all of our experiments, and this is only possible if you test boring hypotheses that you know are correct (and your experiment is well conducted and sufficiently powered!). These two desires are clearly incompatible. This is not something that I had given much thought to before, at least not as explicitly. And, I suspect, neither have most of my colleagues.

Finally, I would like to highlight the talk by Glen Begley, from TetraLogic Pharmaceuticals, which was related to this paper. For many years, Glen Begley and his colleagues at Amgen (another pharmaceutical company) tried to reproduce published findings from preclinical (i.e. not on humans) cancer research. They focused on 53 findings, which were considered very important, and had formed the basis for many subsequent studies. Shockingly, they were only able to reproduce 6 findings (11%). Most of his talk consisted of showing figures from high-profile publications, and pointing out the obvious (to experts, not to me) flaws in them.

To sum up, I want to thank the NC3Rs for organizing this workshop, and for inviting me. I felt a bit out of place as an experimental psychologist among this crowd of big-shot funders, publishers, pseudonymous bloggers, and (pre)clinically oriented researchers. But I learned a lot, and—more importantly—it is clear that science is slowly moving in the right direction.

Publication bias is not the only reason for an asymmetrical distribution of effect estimates, and you therefore have to be careful in drawing conclusions. But the general assumption is that an asymmetrical distribution is a tell-tale sign for the presence of publication bias. This is a general limitation of this type of data science: Scientific dark matter is invisible, so you can never be completely sure what it looks like.↩