Patterns and Evidence (and how to find papermills before peer-review)

Adam Day
3 min readAug 5, 2022

I once saw a brilliant presentation about how simple data analysis can detect credit card fraud**. The presentation showed a pattern in how people use their credit cards. Given a large number of people who had been victims of credit card fraud, this pattern showed there was just 1 store in-particular where they had all used their cards. There was no observational evidence of someone at that store stealing card details. No camera footage, no eye-witnesses, but the pattern was obvious in the data.

The pattern is circumstantial evidence: a useful lead that tells law-enforcement where to look for the observational evidence they need.

image CC-0

With misconduct detection, we are also often looking at patterns rather than direct evidence of misconduct. Take plagiarism detection:

  • We detect a pattern (which is an overlap in text between 2 documents)
  • There are various levels to which it’s ok to copy text with proper attribution and so simply finding overlapping text between 2 documents isn’t evidence of plagiarism on its own (but I still think it’s a good place to start!).
  • Finding the pattern saves a lot of time by telling us when to look much more carefully at a pair of documents to get evidence of misconduct.

But where to start with papermills?

Recently, I shared a simple method to identify most of the existing published papermill content. This method is particularly nice, because — while we are detecting a pattern (text duplication again) the investigation required to confirm misconduct is as simple as it gets.

But, from a publisher’s point of view, it would be much better to identify papermills early on — before peer-review. So I recently released a simple API that does just that. The Papermill Alarm identifies unusual patterns in the text of new submissions and issues a warning if those patterns appear consistent with papermills.

Here it is:

For example, you may have seen Elisabeth Bik’s excellent work on the ‘tadpole’ papermill. This shows a very obvious pattern in the titles of several papers which appeared to come from a papermill. You can see similar patterns in the many papers highlighted by Smut Clyde’s insightful work.

The Papermill Alarm

  • receives article metadata (from you)
  • detects patterns like those described above (and more)
  • and returns a warning when it finds them.

Checking your papers is simple. The Papermill Alarm highlights papers that you might want to check in-detail and greatly reduces the amount of time you might have to spend looking for such cases.

That being said, I don’t want to overstate what a tool like this can do.

  • It doesn’t find observational evidence, it detects patterns that are consistent with papermills.
  • While results are already extremely useful, it’s still an early-iteration of something with a lot of potential for development.

I’ll say more about testing the Papermill Alarm in a future post. In the meantime, if you have any questions at all, get in touch!

** P.S. I found the presentation I mentioned at the start. It’s possible that one of the balding heads in the audience is mine!



Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools #python #machinelearning #ai #researchintegrity