Where to draw the line? Precision and recall

Adam Day
4 min readJun 15, 2023

--

The Papermill Alarm draws a line between papers which have the characteristics of papermill-products and those that don’t. Given data like that shown below, where red dots represent papermill-products and green dots represent normal papers, where should we draw that line?

Some dummy data to illustrate the point. Interesting aside: we actually use dummy data like this all the time as we develop processes at Clear Skies. This prevents the sensitive data that we work with being used or seen outside of the context where it is necessary.

Limit the spread

There’s well-known (and fun!) thought experiment in statistics that goes something like this:

There is a rare disease affecting one in ten thousand people. Someone creates a diagnostic test for this disease with 99% accuracy. You take the test and it comes up positive. What is the probability that you have the disease?

  • 99%
  • 98%
  • 50%
  • 0.98%

The obvious answer seems to be 99%. The test is 99% accurate after all. But let’s think about it a different way.

If we have 1,000,000 people (and the disease affects 1 in 10,000 of those people), then 100 of those people have the disease and 999,900 don’t have the disease. Let’s consider those groups separately.

The test is 99% accurate. This means:

  • when we run the test on the 100 people who do have the disease, then 99 of them will be correctly diagnosed and 1 person will be misdiagnosed as not having the disease: a false-negative.
  • However, when we run the test on the other 999,900 patients, who do not have the disease, 99% of them are correctly diagnosed as not having the disease, but 1% of them (9,999 people) will be ‘false-positives’. I.e. they will be incorrectly diagnosed as having the disease.

So, of the 10,098 people who test positive, only 99 actually have the disease. That means that if you test positive, even with 99% accuracy, the chance that you have this disease turns out to be less than 1%!

Isn’t that strange?

Precision and recall

The above is a common example in Bayesian statistics. What it means for Clear Skies is that, if we want to build detection algorithms for papermill-products, we have to control for imbalance in the data in order to be able to make useful predictions. We also have to be a bit more careful about what we mean when we say ‘accuracy’. There are 2 very handy metrics here.

Precision

What % of the detections that you make are accurate?

  • High precision means that we can have a lot of confidence in predictions, but that we should expect to have a lot of false negatives.
  • In the above example, precision is 0.98%

Recall

What % of the things you want to detect are you detecting?

  • High recall means that we can expect to find a high percentage of what we are looking for, but the downside is that we will also see a lot of false positives.
  • In the above example, recall is 99%

So, our imaginary diagnostic test has high recall and low precision. I know that medical researchers sometimes read this blog. Your words are different to ours, but hopefully the concepts are familiar!

(Also, I’m not a doctor, but I assume that high recall is a good choice for medical diagnostics. If you get misdiagnosed as having a disease, that’s not good. But you might be able to run other tests for confirmation. It seems far worse if you were misdiagnosed as not having a disease that you need to be treated for.)

When it comes to papermill detection, high recall would mean a situation like the one described above where we flag or ‘diagnose’ a very high percentage of papers, including a lot of non-papermill products, in order to ensure catching all of the papermill-products. The next stage of the process is to run further tests (one reason why we have multiple pipelines!) or for Research Integrity specialists to sift through those flagged papers and apply their own checks. In principle, this is quite a nice 2-stage process which can maximise the number of papermill-products we detect correctly.

However, there’s a downside here. We don’t want to waste the Research Integrity specialist’s time by flagging lots of papers which are likely to be ok.

But here’s the cool thing: we actually get to choose whether we want high precision or high recall. That’s nice, isn’t it? Which should we pick?

In this example, we classify everything above the line as a ‘red’ alert. If we go for high precision, we miss a lot of papermills, but if we go for high recall, we might flag too many non-papermill-products.

High precision would mean accepting that we miss a few papermill-products, but that a very high percentage of detections are what we are looking for.

We could try to find a balance in between the 2. But we can do better than that.

We know that papermills will test-the-waters with journals all the time. When they find one that has a high acceptance rate for their products, they target it.

The monthly output of one journal by Papermill Alarm alert-rating (real data this time!). It looks like this journal had an ongoing problem for a while, but something caused the alert-rate to shoot up. I can’t say exactly what that thing was in this case, but I can say that - while most of the world’s journals have no alerts - this is a common pattern in the ones that do.

So the effect of using a high-precision tool would be to reduce the chances of being targeted with a minimal amount of work (compared with a high-recall tool). When you think about it, we don’t need 100% recall. We just need to catch enough that it isn’t worth the mill’s time to keep submitting.

So there’s value in both. We want to have a high precision option and a high recall option. That’s why the Papermill Alarm issues red alerts and orange alerts. Red alerts have high precision and orange alerts have high recall.

So you get the best of both worlds.

Contact us for more details

--

--

Adam Day

Creator of Clear Skies, the Papermill Alarm and other tools clear-skies.co.uk #python #machinelearning #ai #researchintegrity