ArXangel: How to select referees for peer-review

Adam Day
4 min readApr 26, 2020


Edit: has been shut down for the time being. Many thanks to everyone who helped out with advice and feedback. I was thrilled to see arxangel being used regularly by a global audience, but sadly it was never enough to make it sustainable long-term.

In a recent post, I introduced ArXangel — a hobby project of mine which recommends referees for arXiv preprints. It’s a very simple application. All it does is takes some arXiv preprint, finds similar published papers and then lists the authors of those papers as potential referees. This is what might be called a ‘high accuracy’ approach to recommendation in that our approach is only supposed to find people with the right expertise and ignores other considerations.

That’s what we want, right? The right expertise?

This might sound like a stupid question to anyone who hasn’t spent a lot of time managing peer-review. Presumably, all you need to do is find the people with the most relevant expertise in that particular field and ask them to review, right? So, if ArXangel is finding people with the right expertise, it should be giving us an ideal list of reviewers, shouldn’t it?

Cracked pottery

Let’s imagine that I am a crackpot and I invent Adam’s Grand Theory (AGT).

A cracked pot. Image from Flickr by user momentcaptured1 reproduced here under the terms of the CC-BY 2.0 license
  • AGT is a work of unreproducible hogwash, but somehow I get it published somewhere.
  • Then let’s say that another crackpot sees this work and decides to write their own paper on AGT.
  • They send this paper to you for review, so you go looking for a specialist in AGT to review it.
  • You find me, because I’m the only other person with a paper on AGT in their history.

Now you have a crackpot reviewing another crackpot’s paper. That’s not good is it?

The above example might sound contrived, but there are numerous controversial fringe topics in science. Homeopathy? Antivax? Climate change denialism? All of the researchers in those niche fields believe that the theory they are working on is worthy of research (and some might have an incentive to keep the number of publications in their field up). However, it might be that the broader community does not agree that the topic is worthwhile. If we take a high-accuracy approach to referee selection, it’s not hard to get a small group of niche researchers all reviewing each-others’ papers without any outside scrutiny.

So it might sound like an odd conclusion to reach, but sometimes, you actually want people without highly relevant expertise to review papers — particularly on niche topics. This is why we can’t rely on high-accuracy alone as a metric for the success in referee-recommendation.

One shortcoming of ArXangel’s approach to referee recommendation is that there are limits to how much accuracy we can get due to noise in the data. Interestingly, this potentially mitigates the problem above because niche fields will sometimes get close-but-not-perfect reviewers suggested. So, there’s a chance ArXangel will recommend people who have a valid outsider’s view. It turns out that noise in recommendations can improve their diversity, which is a valuable thing in itself.

Limited accuracy is clearly not a solution to the problem with controversial topics I described above (and if ArXangel addresses it, it does so entirely by accident), but it’s one reason why improving the accuracy of ArXangel might not actually make it better at recommending referees… OK… But what would make it better?


Most of the gripes I hear about using referee-recommenders in the wild revolve around the recommender not actually saving time because the referees suggested tend to decline or ignore invitations. From an editorial point-of-view, this looks like poor-accuracy. However, I suspect that one reason this happens is because the recommender finds referees who haven’t refereed for that journal before and therefore aren’t in the habit of receiving invitations from them — so they decline, or ignore them. A solution to this is to give the algorithm feedback data on referees’ responses. It can then improve quickly. If our metric for success becomes ‘getting something peer-reviewed quickly’, then feedback on who accepts refereeing tasks from a journal and who returns reports quickly would allow significant improvement in results.

Unfortunately, while this is a tempting metric in the short-term, it’s obviously not a good idea to use it long term. It is foreseeable that this metric would result in a recommender which recommends a small number of referees who tend to write undemanding reports quickly and are happy to report on research outside their expertise. That would maximise success according to that meric, but it’s not a recipe for good peer-review.

Performance measures

So, long-term, we need a better metric for success than accuracy — or at least something that balances accuracy with other metrics on the quality of peer-review. Consider that some of these metrics might be at odds — high speed peer-review is probably not high-quality peer-review.

There are a number of well-known performance measures that we can consider in building a recommender.

There are also some issues which are not easily solved with machine-learning, like should we allow competitors to review each others’ papers knowing they might be biased? Or should we let people from the same institution review each others’ papers?

It bothers me that we have a very time-consuming, costly, and imperfect system for qualifying scientific research. As time goes on, the volume of research is increasing and it’s becoming harder and harder to ensure quality and timeliness in review. Our use of automation tools in addressing this problem is unavoidable and it’s important that we set out with a clear understanding of what we can realistically achieve and where the pitfalls are.



Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools #python #machinelearning #ai #researchintegrity