Papermills in peer-review

Adam Day
5 min readSep 12, 2022

Dissecting a joke is a bit like dissecting a frog. No one needs to know what’s inside and the frog dies. — Jimmy Carr

Last week, a paper I wrote on the subject of peer-review fraud was published in the journal Scientometrics (free link here, preprint here).

It was an interesting project to work on. I found a lot of examples where one referee would write a report during peer-review and then another referee would write an identical report in some other peer-review of some other paper. (I wasn’t the first person to observe this behaviour, but I think this was the first method to detect it automatically.)

If you look at all the data, you start to see networks of these individuals copying peer-review comments. At least one of these clusters is definitely the work of a paper mill (see if you can guess which one!)

Each triangular node represents a referee account. Each red link shows that 2 referees have, at some point, written identical comments in peer-review.

I think of this as ‘tip-of-the-iceberg’ analysis because only a subset of paper-mill accounts and papers will ever show up when we use this method.

Be that as it may, there is a huge bonus here. Say we find one referee account duplicating a single comment from that big papermill cluster. In that case, we can investigate that referee’s publication history, the other papers they’ve reviewed, the authors who have recommended them to review, their co-authors and so on. So a single result can be the first domino.

With a bit of coaxing, we get a vast number of user accounts to investigate. This is the same diagram as above once we add some more search methods and connect author and co-author accounts. Authors are green nodes, co-authors are grey. There’s some noise here, so a little bit of simple graph analysis lets us prioritise the accounts with the most connections. Details in the paper.

When to share

I may have laboured this point excessively in past blog posts, but one thing that bothered me a lot in working on this paper was whether to share the results of the work at all. Simply put: sharing any misconduct investigation details publicly has the potential to educate organised research fraudsters on how to circumvent detection. It’s therefore possible that sharing the method undermines it. Methods are like frogs.

Publishers often advertise their plagiarism detection tools on their submission pages. I’m sure that this is an effective deterrent some of the time, but I wonder if it is also partly responsible for the rise of paraphrasing in plagiarism (because plagiarists learned that plagiarism detectors can’t detect paraphrasing terribly well).

On the other hand, one might argue that bad actors will inevitably learn about and find their way around any detection method eventually. It’s not clear what difference advertising it makes.

And there are the positive benefits to sharing:

  • Even if fraudsters work around this detection method, doing so is harder than simply copy/pasting review comments as they were doing before.
  • Sharing the method is one of the only ways to get it used broadly. It won’t have much effect if only one publisher uses it.
  • One thing that can’t be worked around is the data already in publishers’ systems. This analysis can be run retrospectively and suspect accounts can be flagged and investigated.
  • Even if a method is well-known, there’s always a future, naïve, new generation of fraudsters who don’t know about it.

What’s the ‘right’ thing to do? As with all ethical matters: it’s debatable. But our decision was that, considering the above, publishing the method would move the needle in the right direction.

This brings me to a somewhat bigger question.

I spent many years working as a journal editor and, when I found a case of suspected misconduct, I would follow a well-defined procedure to investigate it.

  • First, discuss the case carefully with various stakeholders.
  • Then, as a rule, contact the authors and show them the evidence that they had committed that misconduct.

We would never accuse in that first message. We would simply show the authors the evidence and ask for an explanation. I’ll tell you now, that if I ever presented an author with evidence like that, I was 100% sure that they had done something wrong and that the evidence proved it clearly. (Nevertheless, I would receive some wonderfully creative excuses!)

But here’s the problem: we can all agree that it’s right and fair to show someone the evidence if they are being accused of wrongdoing. But, what if you are not dealing with an author? What if you are dealing with an organisation that mills thousands of fake papers every year? In that case, the evidence you send them is just teaching them how to improve their business and you are not communicating with the ‘author’ at all.

So, there’s the dilemma.

  • We don’t always know if we are dealing with a real author, or a fake account controlled by a papermill.
  • So do we present the evidence and risk harming our chances of detecting fraud in the future?
  • Or don’t present the evidence and risk labelling someone as ‘guilty’ without giving them a fair opportunity to see or respond to the evidence?

Having thought about this a lot, I think the only practical option is to aim to never share any information with suspected papermills. How that’s done (and particularly how to do it ethically) is, as I say: debatable.

As for methods: it depends on how fragile the method is. If it’s easy to work around a detection method, I’m certainly not going to share it. If it’s hard to work around it, then we might move the needle by sharing the method particularly if it’s easy to use.

Right now, the methods exist to detect the overwhelming majority of papermill products.

  • Some of those methods should never see the light of day.
  • Some of those methods might be published.
  • Others might simply be shared privately.

Here are some examples of methods I think should be shared!

My recent paper (mentioned above):

Some recent blog posts

Any questions, get in touch!



Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools #python #machinelearning #ai #researchintegrity