How to find all the papermills (step 1)

Adam Day
4 min readJul 25, 2022


The first time I flew over the North Atlantic was quite an experience. Through the clouds, I could see some little white boats out sailing in the sea. It was puzzling: from 30,000 feet, those boats must have been huge for me to be able to see them at all. Suddenly, the clouds were gone and, through clear skies, I could see a really big ‘little white boat’ and I understood at once what I had been looking at.

Image CC-BY: wiki commons

With papermills, people often say that they only see ‘the tip of the iceberg’. (It’s almost a cliché!) If fake papers were easy to detect, then forgers would just learn to get better at writing them. I guess that’s what actually happened over time and it’s why we need to be super-careful about sharing detection methods.

This is why I’ve thought very carefully about making this post (and why I’ve no intention of publishing steps 2, 3 etc).

Here’s what we know:

With this in mind, it’s actually very simple for publishers to see the iceberg.

  1. Make your submission dates visible in Crossref.

It might not be trivial to do, but it shouldn’t be hard or controversial. Most journals make their submission dates visible in article XML, so this wouldn’t make any data visible that isn’t already, it just makes it easier to work with. In fact, several publishers are doing this already.

2. Search for your submissions in Crossref.

This is also not hard to do. The results will show you when your submissions were published by others and (with the above step having been done) when they were submitted to those other publishers. Conveniently, there are some pre-built options that will make doing this trivial (see ‘Article Tracking Options’ below!).

3. Finally, if you find a duplicate submission, contact the publisher of that duplicate submission and show them: the submission and decision dates as well as the full-text. For them to confirm duplicate submission requires only a quick manual comparison of the manuscripts. It’s much, much quicker than a full misconduct investigation.

What does this give us?

Well, there are 2 things:

  • All of the papermill submissions (at least all of those that have been using duplicate submission — there are exceptions).
  • Any other authors using duplicate submission as a hedge.

Isn’t this great? No fancy AI, no need to actually detect and prove the fakery. Here is all the evidence you need to prove a paper has breached your terms of service. Furthermore, if we take this to its logical conclusion(which might be an up-front cross-publisher full-text check made at the submission of any article), you can see that there really isn’t any way around it. The paper-mills will have to stop sending the same paper to different publishers concurrently. That will slow them down considerably. It’s also why I’m comfortable sharing this publicly.

But how to tell the two apart — papermills and ‘other’ duplicates? Well, I’m not going to share that here, but what I will say is that it isn’t hard. The iceberg is obvious once you see it through clear skies ;)

At this point, it’s up to you how you proceed. You might want to analyse the data to learn more about papermills. You might want to conduct misconduct investigations. You might see opportunities to collaborate with other publishers. After all, this method does require that we all work together.

Article tracking options

  1. The SAGE Rejected article tracker.

This was a project that started out as a blog post that I wrote some years ago in my spare time. It later became an open source project from SAGE Publishing with its own peer-reviewed paper. It’s free and it will get the job done (if a little slowly). Note that, in this case, it’s important to track all articles, this includes accepted articles, new submissions and (importantly!) abandoned submissions as well as rejected articles. Remember how I said that we sometimes see the same paper published in 2 places? This is how to find those cases.

2. My own “Article Tracker” API.

This is part of a side-project (not related to my work at SAGE) and I hadn’t intended to make it publicly available, but after seeing the value of this tool in detecting papermills, here it is. It’s the best tool of its kind as far as I know. I hope that it helps with this problem.

The Article Tracker has the following features:

  • Fast: in testing, I’ve been able to track around 500 articles per minute.
  • Accurate: the current accuracy is around 96–98% on test data. There is a planned upgrade which will push accuracy even higher.
  • Does duplicate submission checking (as described above) out of the box.
  • I’ve set pricing at a low level where I think any publishing house should be able to easily afford to search for all of their submissions as well as historic submissions without running into limits.

This blog post is actually the first time I’ve advertised the Article Tracker’s existence, so you might consider this to be a soft-launch of the service. Any questions, comments or feedback would be very welcome.



Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools #python #machinelearning #ai #researchintegrity