The Papermill Alarm: Patterns in PubMed

Adam Day
4 min readAug 23, 2022

--

This post is about The Papermill Alarm: an API for detecting potential papermill-products.

There’s a field of study called ‘stylometry’ where we look at the statistical properties of someone’s writing and use that to model their ‘style’. People write in idiosyncratic ways. So, for example, I often start sentences with ‘so, for example’ when I could more succinctly say ‘for example’. The stylistic properties of written work can therefore, potentially, give away the identity of the author. Famously, it’s been shown that J K Rowling shares stylistic traits with Robert Galbraith, something which is confirmed by the fact that both of them, without exception, are the same person.

Stylometry methods might seem dated these days, but don’t discount them. They have their uses!

I recently made a blog post about 2 of the main roles I see in papermilling — Agents and Forgers. Forging fake research papers is a skill. You have to know enough of the science to be able to write a believable fake. (It’s depressing. Think about it: the people forging those fakes must have had real scientific ambitions at some point.)

The thing is, though, that if you can write a believable fake paper about cancer genetics, you probably can’t write a believable fake about geology, or about engineering. So, what do you do? (Keep writing fakes about cancer genetics, of course!) The result is that we find unusual clusters of samey papers.

Let’s take a look at PubMed

What you’re seeing here is a map of PubMed. More specifically, this is the titles of PubMed papers put through 2 processes:

  • The Univeral Sentence Encoder is used to turn each title into a 1024 dimensional vector. That’s too many dimensions for me. If I try to visualise 1024 dimensions in my head, it makes my eyes water.
  • A process called ‘UMAP’ is used to compress those vectors into simple 2-dimensional x and y coordinates. Much better!

Maps like this are easy to make and a lot of fun to explore. They do lose a lot of data in that compression stage and so, while the map can be used for this kind of rough visual exploration, it shouldn’t be thought of as a tool for precision science. Also, keep in mind that we are just looking at titles here. The Papermill Alarm models abstracts as well.

You might recognise some of the features of PubMed on this map.

The bright green data-points are the titles of papers that have been identified as suspicious in various lists of papers by Elisabeth Bik, Smut Clyde and others. Note that these cases are concentrated in specific regions of PubMed.

There are 2 new clusters here at the foot of the map. One concerns polymers and the other one mathematics. These 2 don’t appear to overlap PubMed.

I really can’t overstate the importance of the kind of work that people like Elisabeth Bik, Smut Clyde, several anonymous PubPeer commenters (and countless referees and journal editors behind the scenes) do in carefully reviewing these papers for flaws!

Wouldn’t it be nice to flag papers which share the stylistic features of those groups? So that, if a paper fitting one of those specific regions comes into your peer-review system, you can be alerted that it looks a bit like a papermill? That would be good wouldn’t it? That’s what The Papermill Alarm does!

Here is a map showing a random sample of what The Papermill Alarm thinks look like papermills (in red). Note the clear correlation with those green dots.

The key takeaways

You can see now that there is clear agreement between the red alerts and the papers that have already been picked out by hand.

There’s also definite scope to expand the Papermill Alarm to fields outside of PubMed. E.g. those green clusters at the bottom.

Finally, the Papermill Alarm is also detecting growth in papers that trigger red alerts. That’s how I can be confident that you will see alerts for more papermill papers by the Papermill Alarm in the future. (I should be clear that I don’t intend this as a quantitative estimate of the scale of papermilling.)

Any questions? Get in touch!

Appendix: Some caveats about the visualisations

Reminder: this is just for illustrative purposes.

  • The 50,000 blue dots, which represent all of PubMed are also only a small proportion of the total. There are millions of papers in PubMed.
  • Red alerts make up only around 1% of the total PubMed output. This might not be obvious from looking at the above because everything is subsampled. Red dots are also deliberately brighter than blues.
  • The 2,000 red alerts (red dots) you’re seeing are only a small (randomly sampled) percentage of all the papers picked up by the Papermill Alarm.
  • There are around 3,000 green dots.

--

--

Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools clear-skies.co.uk #python #machinelearning #ai #researchintegrity