Sharing science: Searchability, accessibility, and the future of academic publishing // Cogsci

The first journal purely dedicated to science was the Paris-based Journal des sçavans, founded in January 1665. The London-based Philosophical Transactions of the Royal Society was founded two months later. The Journal des sçavans appears to have died a fairly quiet death in 2007. But the Philosophical Transactions is still around, is still prestigious, and still publishes papers in pretty much the same way as it has done for centuries. I published a paper there myself not too long ago.

When these journals were founded, they provided an excellent platform for scientists to share their work. Science was mostly a regional affair, and very few people were wealthy enough to engage in the leisure of science. And, of course, there was no internet. In this small analog world, papers were the best form of communication.

Things change, though.

Some time ago I wrote about a dataset that I had downloaded from PubMed, which is a more-or-less comprehensive database of scientific articles. I downloaded information about only 38 journals, but even from this small selection it is clear that papers are being published at an exponentially increasing rate, and that this trend has been going on for some time.

Across these 38 journals, 14.544 articles were published in 2010. Not up until 2010, mind you: in 2010 alone. It is estimated that there are approximately 25.000 journals. Of course, not all journals publish the same number of articles. But still …

So what does this mean?

Firstly, it means that a journal is not really a journal anymore, and a paper is no longer really a paper. We are far removed from the times when eminent scientists published lengthy monographs in printed journals. For all practical purposes, a journal is a website and a paper is a PDF file. Therefore, the restrictions of print no longer apply to scientific output. And with these restrictions gone, scientists could, in principle, share their work in whatever format they consider most suitable. As a video, for example. Or as an interactive website. Or as a piece of software.

Secondly, the enormous number of publications that appear each year have made it extraordinarily difficult to find the information that you are looking for. But, you might ask, surely Google will tell you which of these millions of papers is most relevant for your research on, say, the visuospatial abilities of the naked mole rat? And yes, of course, Google excels in searching through lots of information and returning some relevant selection. But the problem is that a search engine does not provide an exhaustive overview of matching results. This doesn’t matter when you order a pizza: You just want to find one nearby pizzeria, and there is no point in knowing all of them. But this does matter when you write a scientific paper: You don’t want a handful of arbitrary search results, but a list of everything that was ever written on a certain topic and under certain conditions. Therefore, even in the era of Google, searchability poses a serious challenge to science.

Thirdly, academic publishing has become a financial burden. Publishers charge high subscription fees, even though the real cost of maintaining a (mostly digital) journal is presumably very low. And it’s not just the Journal des sçavans and the Philosophical Transactions anymore–we’re talking about thousands of journals. Consequently, total subscription costs for university libraries are counted in millions of euros, pounds, or dollars. To get away from these subscription costs, many scientists, including myself, support an open-access model of academic publishing. This means that scientists pay a fee for publishing their paper, but they retain copyright and, importantly, their work is made freely accessibly to anyone. Open-access publishing fees can be very high, over €/£/$1000 per paper. Too high, probably. But still, the open-access model is in many ways more fair than the subscription model. And even with exorbitant publishing fees the overall cost is probably lower than what is currently being spent on subscription fees.

So some progress is already being made in the form of a slow shift away from subscription journals towards open-access journals. This makes science accessible to the public and potentially reduces overall cost. But the problem of searchability remains: Millions of open-access papers are still very difficult to search through. And what is more, open access may drive the number of publications up even further, because publishers are paid for each article and thus have a strong incentive to publish all submissions. Consequently, rejection rates for many open-access journals, including reputable ones, are very low. For example, PLoS ONE rejects only about 30% of all submissions that they receive.

But what’s the real problem here? Is it the number of papers that are published? Or is it how they are published: as separate papers across many journals, all of which archive their content in a non-standardized, usually non-open, and always hard-to-search-through way? Is this 17th-century model of academic publishing still the way to go?

I think it is not. And I think that the amount of publications is not the main issue here: It’s not necessarily a bad thing to have a lot of scientific output, as long as it is stored in a systematic way that allows for exhaustive search. Ideally, there would be a system that is able to provide a list of all research that was ever done on the visuospatial abilities of the naked mole rat. And I don’t mean a disorganized Google results page. That’s great for finding a nearby pizzeria, but horrible for a serious literature review. Of course, ideals are rarely attainable. But such a system should come close. And given modern technologies, there is no reason why it could not.

So what would, or could, such a system look like? And what would we need to build it?

The first thing that we need is a simple and standardized way to refer to things. Which we have! The Digital Object Identifier (DOI) is a suitable candidate with the benefit of being widely adopted already. Each piece of scientific output should have a DOI: each dataset, each experimental script, each figure, each manuscript section, etc. There’s nothing new about this idea. Websites such as FigShare already provide DOIs for all content. And the journal eLife provides separate DOIs for each figure in a manuscript.

Furthermore, things such as datasets, analyses, and software should all be counted as valid scientific output. This doesn’t mean that scientists shouldn’t write papers anymore. It just means that papers shouldn’t be the only acceptable way to share science. And papers shouldn’t be perceived as single, indivisible entities, but as collections: A paper is a collection of figures, datasets, and manuscript sections, each of which should be citeable on its own and, in some cases, could have been published on its own. For example, you may want to share a minor reanalysis of an old dataset. This may be worth sharing, even though it might not warrant a full paper. In the current paper-centric system, such minor forms of scientific output would either be lost, or would have to be artificially blown up to meet the publication threshold.

So what place is there for peer review in a system like this? Should each little piece of scientific output be peer reviewed before it becomes citeable? I would say not. Peer review provides a decent way to check the quality of scientific work. But it’s not a gold standard, and not everything has to be peer reviewed all the time. It’s just too time-consuming for authors as well as reviewers, so peer review has to be applied with moderation. I think it would be enough to clearly indicate whether something has been peer reviewed or not, so that the reader can take this into account. And there should be a place for post-publication peer review as well: Something that was initially posted without peer review could be commented on and thus lose or gain credibility. And what are those comments? They are citeable objects, of course! Just like everything else in the scientific database.

So far, what I have described is a system that allows scientists to share their work in a more diverse way than is currently done. But we are still left with the thorny issue of searchability. Of course, devising a way to make all scientific output properly and systematically searchable is really difficult. But I would like to make one simple suggestion that would take us a few steps in that direction: Citations should be qualitative. Imagine, for example, that we are skeptical about a particular study, say one of Rolf Zwaan’s tongue-in-cheek social priming studies. And we want to find out whether this study is replicable or not. (Let’s disregard publication bias for now.) What we could do is get a list of (almost) all papers that cite this particular study, for example using Google Scholar. This list can be very long if the paper is a big deal. But most citations are irrelevant, because they are just passing mentions in an introduction or a discussion. Therefore, you will have to resort to hard manual labor: You will have to work your way through each individual paper to find out whether it constitutes a replication attempt or not.

A possible solution would be to add a description to each citation. Basically, a citation would be a tuple of the cited object (a DOI) and a keyword that describes the nature of the citation. Keywords can be things like: ‘comment on’ (for comments), ‘using’ (for software and algorithms), ‘succesful replication of’ (for studies), ‘reanalysis of’ (for data), ‘consistent with’ (for studies), ‘relevant reading’ (for studies), etc. These are just some arbitrary examples, and there are probably better keywords. Ideally, there would be a ‘soft standard’ for commonly used keywords. For non-standard keywords we could rely on the same fuzzy-natural-language techniques that allow Google to know that ‘free software’ is similar to ‘shareware program’. Eventually, it should be possible to systematically search for scientific works that relate in a specific way to the study that your are interested in.

But we’re still missing a crucial ingredient: incentive. There is currently no incentive for scientists to share their work in the piecemeal, but systematic fashion that I’m arguing for. If anything, there is a disincentive, because your colleagues may think that you are doing so only because you have failed to publish your work in a ‘real’ journal. The proper incentive is obvious. To judge the quality of someone’s research, you look at how central their work is in the scientific network. You can think of each piece of scientific output as a node, and the entire database of scientific output as a graph. Therefore you can use techniques from graph theory to determine a centrality index for a given scientific work, for example. We are scientists, aren’t we? Surely we can do better than h-indices or impact factors as measures of scientific impact? (See also this commentary by Björn Brembs and colleagues.)

Depending on how you feel about these things, all of this may sound like old news or like a call for revolution. In fact it’s neither. A revolution implies a sudden shift from one system to another, whereas I think that academic publishing should be reformed in a way that is backwards compatible. In the future, we can and should get away from journals and papers, at least as the sole format for scientific output. But this can be done gradually, without making second-class citizens out of older studies and scientists who prefer not to change their work-flow. In the future, for the sake of searchability, citations should be accompanied by descriptive keywords. But citations without keywords are still valid citations. In the future, scientists should make their work publicly available and stop signing over copyright to publishers. But the fact of the matter is that a lot of high-quality research from the past is hidden behind paywalls. More generally, in the future, scientific output should be archived openly and systematically. But we’ll have to deal with the legacy of the past for many years to come.

So will the future pan out in the way that I’ve outlined here? Certainly not exactly, but I’m confident that a shift in this general direction will happen. The signs are there. For example, as I mentioned already, the newly launched journal eLife attaches a DOI to each figure, thus recognizing that a figure is a valid scientific object on its own. PeerJ reduces the cost of publishing open access, thus making the transition away from subscription journals easier. The Journal of Open Psychology Data recognizes that data is valuable scientific output, although it hasn’t (openly) recognized that the term ‘journal’ is a bit odd in this context. And although I’m not aware of any direct efforts to improve searchability in the way that I’ve described (or otherwise), we’ll see what the future brings …