Looking back, there’s a theme in a lot of misconduct cases which is duplication. Again and again, we see the same thing in 2 different places and it’s either evidence of misconduct or, at least a hint of where to look for it.
I’ve seen pairs of papers, with different author-lists, which were clearly written with the same starting template. Sometimes that’s ok… but searching through the literature, I’ve also seen the exact same paper published in 2 different places and I’ve even seen the exact same paper published in 2 different places with 2 different author lists! I’m sorry, but that is just careless paper-milling. Surely they could see that, looking back, someone would spot that?
Naïvely, I’d like to think that we can detect all cases of science-fraud early in the submission process — perhaps even before the start of peer-review.
But I’ve learned that, while it’s definitely possible to detect a lot of misconduct early on in a manuscript’s lifecycle, the reality is that retrospective analysis — where we re-analyse articles some time later, either after rejection or after publication — is a necessity.
But why?
The first part of the answer is obvious — you need to know what you are looking for, so there need to be examples of past fakes to base your analysis on. For a data scientist like me, data on those past examples is necessary to build optimal detection systems. And it’s these kinds of systems which we can use to detect fakes at the submission stage. (Again — you still need humans to check every case.)
The other part of the answer might not be so obvious. What happens when we already have all those systems to identify misconduct?
Peer-review is fundamentally reliant on trust. Editors and referees must be able to assume that the manuscripts they receive represent an honest record of an author’s work because checking everything is impossible at the scale of modern scientific production. So, even with the best editorial processes and the best automated checks, a well-fabricated fake paper can make it through peer-review. But what happens when several papers have gone through that same fabrication process. Can we detect them at that point, retrospectively?
I’ll give you a slightly-contrived example**:
- A Forger writes a dozen fakes using a template and then passes them to an Agent, (or to several Agents).
- The Agent(s) add a dozen different author-lists to these templates.
- At this stage, even a casual observer could look at these papers: all highly similar, all ostensibly written at the same time and all with different author-lists…and know that something peculiar is afoot.
- However, instead, all of those template fakes are submitted to journals. None of the journal editors have the rest of the examples for comparison, so the templating isn’t obvious. Nevertheless, they might put each paper through a plagiarism detector.
- Plagiarism detectors look for similarities between the new submissions and published papers. But since nothing similar has been published before, the plagiarism detectors cannot search for the other fakes and the use of templating remains hidden.
What isn’t contrived is that we do see template papers like this. So it might be worth running some papers through plagiarism detectors post-publication!
In summary:
- Sometimes you can’t detect a fake because the data doesn’t yet exist to allow you to do so. In those cases, we have to accept that retrospective analysis is the only way.
- The good news is that there are a lot of opportunities to do this effectively.
**I’m deliberately glossing over some details for reasons given in last week’s post. One thing is that a lot of templates we see now are quite sophisticated. You can tell it’s a template, but there isn’t the overlap in text that you’d hope for if you are using a plagiarism detector. So, much of the time, the detection is more challenging than my example implies. It’s also worth considering that there are legitimate reasons to use templates sometimes. All that being said: there definitely are cases where plagiarism detection post-publication shows something that wouldn’t have been visible earlier.
As I said last week
I’ve got a few plans for blog posts about papermills. I’m being very careful about what I post — I think it’s important to avoid posting anything that could be used by papermills to avoid getting caught.
However, if I think of something valuable to share, I’ll do that.
- If you’re a publisher reading this, tell your friends.
- Also… find the person in your organisation who uploads your data to Crossref. Find them and tell them to follow this blog.
- You might also particularly recommend this blog to your data scientists, developers, engineers and any other technical folks with an interest in research integrity. There will be things for them here.