This is Ralph. How tall is Ralph?
It seems simple, you could just hold a ruler up to the screen. But when you look at the ruler and use it to measure Ralph, are you actually measuring Ralph, or are you measuring the ruler and using that as a proxy? How accurate is your measurement? Are you including fur in the measurement? What if Ralph were to stand on his hind legs, like a mighty bear — how tall would he be then?
This is a problem with measurement science. You can often only measure a proxy of what you want to measure. We see this all the time e.g. with citation metrics being used as proxies for quality — stuff like that.
Proxies for misconduct
Plagiarism is no exception. A student recently asked me “what percentage of plagiarism is ok?”. There was clearly some confusion. Plagiarism isn’t ok. It’s like asking “what percentage of crime is ok?” or “how many Brussels sprouts would you like?”. The answer is zero. I think that confusion stems from the fact that we use copying of text to measure plagiarism. So people think that copying text is plagiarism. But it isn’t, it’s a proxy.
When dealing with re-use of text, there are 3 issues that journal editors are concerned with:
- Copyright infringement. This is a legal matter. For any piece of written work, there is a rights-holder. That might be the original author, their employer or a publisher. If someone copies text without permission of the rights-holder, that is copyright infringement and it’s illegal. Copying text is a great proxy for Copyright infringement because it’s very easy to find and prove, but it is still not exactly the same thing.
- Plagiarism. This is an ethical matter. Plagiarism, put simply, is theft of ideas. Or, it’s presenting someone else’s ideas as if they are your own. This is a far more thorny concept and its much harder to prove directly. What if 2 people just happened to have same idea by pure coincidence? Then there are cases where it’s accepted practice to use other people’s ideas. E.g. it’s ok to copy StackOverflow solutions because that’s what they are for. But it isn’t ok to copy other people’s scientific work without attribution. And that’s the key point: to avoid plagiarising, always cite your sources.
- Quality. Let’s imagine a situation where you have the rights-holder’s permission to reproduce their work. Let’s also say that you’ve cited your source properly and it’s clear that you are not plagiarising anyone. In that situation, can you re-use their work? It still depends. If an editor sees someone re-using so much text that it makes it hard to see the novelty in the paper, they might choose to reject on that basis. I.e. simply because repetition is boring!
So there are 3 issues that journal editors might be concerned with when authors re-use the work of others. If you think about it, they are all connected, but plagiarism is the only one that is a research integrity issue.
To answer the question, plagiarism can’t be measured, there isn’t a unit of measurement for it, so there isn’t a % that is ok. It’s like trying to measure Ralph with a ruler.
Generative AI
Computers are machines that copy. Fundamentally, that’s what they do. When I type, the letters on the keys are copied onto the screen, copied to a server, and then copied to your computer screen. (Here they are!) This is why plagiarism is so easy with a computer, copy and paste is a piece of cake. Generative AI is much more complex, but fundamentally, it is just taking human input, processing it, and showing it to other humans.
Generative AI has come under some fire for copyright infringement. But there’s a separate issue: there’s no doubt that generative AI presents people’s ideas without attribution. I don’t think it would be fair to say that generative AI is just automated plagiarism, but if it doesn’t cite its sources, that is certainly part of what it is.
Keep in mind that the weakest copyright status typically found on any research paper is the CC-BY license. That’s a license which gives up almost every right an author has *except* it requires that the author is cited. (“BY” stands for “by attribution”.)
The problem gets worse when you consider that, even if AI does cite its sources, there’s nothing to stop users removing the citations and then passing off the generated text as their own work.
I also think there’s quite a distinction here between generative AI and other kinds of machine-learning. You can train lots of tasks (including generative tasks) on someone’s work without passing their work off as your own. Elicit.org, a service which is already a few years old, deserves a shout-out for its AI-based scientific question answering service (which cites its sources!).
It took a few weeks for me to tire of seeing examples of [whatever] generated by ChatGPT. Now when I see new examples, I quickly scroll past the ChatGPT output and look for something written by a human.
That’s why I’m making it a rule never to use generative AI to make content for this blog. Partly because it’s potential plagiarism, but also because it’s boring.
There is no AI-generated content in this post. Except Ralph. Ralph was fake.