Cast a wide net

Adam Day
5 min readApr 12, 2023

TL;DR: There’s been a recent major upgrade to the private Papermill Alarm. It looks like we can predict the inclusion of journals on the Chinese Academy of Sciences (CAS) lists.

Image from https://www.pxfuel.com/en/free-photo-xwmxj

You know how data scientists make predictions? I’ll let you in on a secret: no they don’t. No one can predict the future. All data scientists can do is study the past and assume that patterns that have been observed in the past will continue into the future. It turns out that’s a reasonable assumption much of the time. So, there’s no crystal ball, instead there is a sound understanding of the data we have and reasonable assumptions about the data we don’t have.

Also, ‘prediction’ in this context isn’t necessarily future-tense. Sometimes we are looking at some new data from the past and we want to gain some insights into it. Those insights might get called ‘predictions’. So, the Papermill Alarm might ‘predict’ a paper came from a papermill, but that’s obviously something that has already happened. It’s still a useful piece of information to uncover: when we look at a portfolio and see which titles have been targeted, we can predict which journals will receive problematic papers tomorrow.

I was looking at these lists from the Chinese Academy of Sciences (CAS) recently and I wondered: without seeing the lists, could we predict which journals would appear on them?

Having recently completed a major upgrade to the private Papermill Alarm, I thought it would be a good test to drop the journals on the CAS lists into it and see what we see. Just to be clear: I don’t know what criteria CAS use to pick journals for this list, papermills might not even be among them. (There is actually a description of their process at the link above, but a Google-translate of it didn’t give me much clarity.)

Keep in mind that the latest Papermill Alarm upgrades both

  • improve the coverage of the Papermill Alarm’s high recall ‘dragnet’ approach. Think of the upgrades as being like casting a wider net.
  • increase the number of cases where we can have high certainty of detection.

But running the Papermill Alarm on publicly available data like this limits what we can do, so this is just the dragnet methods we’re seeing — not the other, more precise, methods.

Take this example. This journal (I won’t say which one) appeared on the 2020 CAS list. That list was released on 31 December 2020 and you can see the effect.

The CAS list appeared at the end of 2020. It appears submissions dropped abruptly at that point and we see publications in freefall throughout 2021 with the journal appearing to be almost completely abandoned by early 2022. Other journals in the list show a similar pattern.

Now might be a good time for a quick reminder of what the Clear Skies Papermill Alarm’s predictions mean:

  • Red alert: this paper has high similarity to past papermill-products. There is a low false-positive rate on red alerts.

Recommendation: check the paper carefully before peer-review.

  • Orange alert: this paper has some similarity to past papermill-products. Orange alerts have a higher false-positive rate than reds, but they are included to ensure high recall of known papermill-products.

Recommendation: just get good referees. Since it’s probably not a papermill-product, you can look into it if you have time, but you might just get good referees — referees who know about the papermill problem, know what to look for and can be relied-upon to give the paper thorough treatment.

  • Green: this paper has no significant similarity to past papermill-products that we are currently aware of.

Recommendation: peer-review as normal.

There are some cases where we see a very rapid rise in output from a journal. A big surge in submissions might seem like a stroke of luck to a journal editor, but the Papermill Alarm tells a different story.

This journal made it onto the 2023 CAS list (the 2023 list was published in December 2022). The effects are, again, quite clear. We see this pattern in a few journals.

Some of the trends go back a long way. This seems to be particularly the case in biomedical science. I also see something quite strange which is that ‘green’ articles seem to follow their own trend much of the time. I’m speculating, but it makes me think that authors are never really going to be looking carefully at the other articles in a journal to check for signs of fraud. So there could be a ‘real’ journal hidden under the problematic papers.

This particular biomedical science journal seems to have had a problem going at least back to 2010. Note that, while this journal was on the 2020 CAS list, it seems like problematic content reduced before that time.

Another prediction: which journals are least likely to have papermill problems?

Here’s the funny thing. You might have already noticed it in the above images. When a journal gets included on a CAS list, the number of alerts drops precipitously. So, if a journal is on a CAS list, it is actually highly unlikely to be publishing papermills.

It’s clear that there’s some correlation between what the Papermill Alarm is finding and the CAS lists, but I am still not sure what the selection criteria are for those lists. Also, I can understand that papermills serving Chinese authors might not want to publish in journals on the warning lists, but papermills aren’t unique to China — they are an increasingly global problem.

Looking at the rapid growth of a number of journals like those above, I imagine this big wave of these garbage papers sloshing around. As soon as a barrier comes up in one place, that wave has to go somewhere and it looks like it can hit a journal very suddenly.

The Clear Skies Standard Report includes 2 things:

  1. A webapp showing trends in papermill activity in your journals (like the images above). This tells you where the papermills will be tomorrow.
  2. Email alerts which let you know when problematic content is submitted. This tells you where the papermills are today.

Subscribers benefit from the private Papermill Alarm: a suite of detection pipelines for papermills trained on our subscribers’ data.

The public Papermill Alarm is also available on RapidAPI and provides a low cost introduction to Clear Skies’ services.

Contact us for more details.

I’d like to thank everyone who has helped with development of the Papermill Alarm. Subscribers and other publishers who have sent in training examples. OpenAlex, Crossref, PubMed and ArXiv for providing vast, clean, accessible data. Sleuths like Cheshire and Smut Clyde who have examined the Papermill Alarm’s predictions up close and provided helpful feedback. And everyone else who has helped out in one way or another.

Finally, It’s worth pointing out a few caveats:

  • Some data is missing. Most of the journals on the CAS list had enough data to perform the analysis, but some didn’t. (I think there’s a huge opportunity to improve industry metadata to make papermilling harder!)
  • As I said above, there are now a number of pipelines in the Papermill Alarm and I didn’t use all of them for this test.
  • The Papermill Alarm will change regularly as new pipelines are added and models are retrained, so we should expect to see different results if we repeat this in the future.

--

--

Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools clear-skies.co.uk #python #machinelearning #ai #researchintegrity