Let’s imagine for a second that we’re editing a research journal. Researchers submit papers to us, we peer-review them to check for flaws, accept the sound papers and reject the others.
There are 2 kinds of mistakes that we can make:
- Type 1: we reject a paper that should have been accepted.
- Type 2: we accept a paper that should have been rejected.
It’s important to distinguish these things if we want to evaluate journals properly.
It’s customary to use citation metrics, or distributions, to evaluate journals. Citations are random, biased, varied-in-meaning and for a very large proportion of articles, they are zero. This makes them a somewhat unhelpful proxy for quality.
Citations have also been used to measure the quality of the articles in a journal and even the quality of authors who publish in a journal.
- If that makes no sense to you, then it shouldn’t, because it doesn’t.
Type I errors
Let’s say that we publish a journal and it is very highly cited on average. This means that we have not made too many type 1 errors. I.e. we didn’t reject all the highly cited papers that were sent to us. That’s good, because, while highly cited papers can and do get rejected, they are usually easy to identify and it would take a staggering level of incompetence to reject all of them.
- So, what does a high impact factor tell us about our peer-review? It tells us that we are not completely incompetent. Strangely, this phrase does not often appear on journal marketing literature.
Type II errors
What about the type 2 error-rates, though? How do we know if a journal tends to accept bad papers? You see, this is the difficult part in managing peer-review. Accepting good papers is easy; finding flaws is hard.
Unfortunately, citations don’t help much here. Imagine we want to compare 2 journals. We could count the zero-cited articles in them, that might be better than nothing, but this wouldn’t be a useful measurement because:
- We don’t know what articles were submitted to the journals. Maybe one made twice as many type 2 errors as the other, but if it received 10x as many low-quality submissions - that would mean it has the better type 2 error-rate.
- Added to that, citations are biased in a number of ways, so comparing like-for-like between 2 journals is impossible.
Worse still, accepting low-quality (zero-cited) papers doesn’t have a big effect on your impact factor, so the incentive to enforce high standards is low — and this is particularly true for high impact-factor journals.
- So, what do citations tell us about type 2 error-rates in peer-review? Nothing. Nothing at all. In fact, there is no accepted metric for how good a journal’s peer-review is. Isn’t that odd? Peer-review is the de facto qualifier for all of science and we have no way to measure it’s failure rate.
Assessing assessment
But imagine if you could measure this. It would actually be a useful metric for article and author evaluation. Given a journal with a low-rate of type 2 errors, a reference with ‘your name in journal name’ would be an indicator of the quality of that work — or at least a lower-bound. Conveniently, you wouldn’t have to wait years for the metric to build (as with citation counts).
For decades now, we have relied on an opaque method of research assessment. Peer-review is carried out in secret and there has been little in the way of data, research, and innovation to understand and improve the system. That’s odd, because there’s obvious value for everyone in doing so, and it’s a costly system.
Some journals are now publishing referee reports (in some cases even for articles that get negative reviews — potentially allowing for some measurement of those error rates). And there are some organisations gathering data on peer-review. This is encouraging if we want to properly evaluate the review process.
Change is hard. Authors are often put-off submitting to these journals by the risk of a negative review, or by low impact factors of new entrants with new ideas.
This is a pity, because I think there’s a lot more we can do to improve.