Patterns and Evidence (and how to find papermills before peer-review)

3 min readAug 5, 2022

I once saw a brilliant presentation about how simple data analysis can detect credit card fraud**. The presentation showed a pattern in how people use their credit cards. Given a large number of people who had been victims of credit card fraud, this pattern showed there was just 1 store in-particular where they had all used their cards. There was no observational evidence of someone at that store stealing card details. No camera footage, no eye-witnesses, but the pattern was obvious in the data.

The pattern is circumstantial evidence: a useful lead that tells law-enforcement where to look for the observational evidence they need.

With misconduct detection, we are also often looking at patterns rather than direct evidence of misconduct. Take plagiarism detection:

We detect a pattern (which is an overlap in text between 2 documents)
There are various levels to which it’s ok to copy text with proper attribution and so simply finding overlapping text between 2 documents isn’t evidence of plagiarism on its own (but I still think it’s a good place to start!).
Finding the pattern saves a lot of time by telling us when to look much more carefully at a pair of documents to get evidence of misconduct.

But where to start with papermills?

Recently, I shared a simple method to identify most of the existing published papermill content. This method is particularly nice, because — while we are detecting a pattern (text duplication again) the investigation required to confirm misconduct is as simple as it gets.

But, from a publisher’s point of view, it would be much better to identify papermills early on — before peer-review. So I recently released a simple API that does just that. The Papermill Alarm identifies unusual patterns in the text of new submissions and issues a warning if those patterns appear consistent with papermills.

Here it is: https://rapidapi.com/clear-skies-clear-skies-default/api/papermill-alarm/

For example, you may have seen Elisabeth Bik’s excellent work on the ‘tadpole’ papermill. This shows a very obvious pattern in the titles of several papers which appeared to come from a papermill. You can see similar patterns in the many papers highlighted by Smut Clyde’s insightful work.

The Papermill Alarm

receives article metadata (from you)
detects patterns like those described above (and more)
and returns a warning when it finds them.

Checking your papers is simple. The Papermill Alarm highlights papers that you might want to check in-detail and greatly reduces the amount of time you might have to spend looking for such cases.

That being said, I don’t want to overstate what a tool like this can do.

It doesn’t find observational evidence, it detects patterns that are consistent with papermills.
While results are already extremely useful, it’s still an early-iteration of something with a lot of potential for development.

I’ll say more about testing the Papermill Alarm in a future post. In the meantime, if you have any questions at all, get in touch!

** P.S. I found the presentation I mentioned at the start. It’s possible that one of the balding heads in the audience is mine! https://youtu.be/6eOr7lp4MPg?t=805 https://www.slideshare.net/tdunning/cheap-learningdunning9182015

Patterns and Evidence (and how to find papermills before peer-review)

But where to start with papermills?

Written by Adam Day