Is Artificial Intelligence (AI) plagiarizing?

From this post @puzzlegal suggested a new thread about this so…here it is.

Artificial Intelligence is often accused of plagiarism. The notion being that it cannot “create” on its own but must, instead, draw on what has already been created (which seems like plagiarism).

But, as posted in the other thread, isn’t AI doing what we do in school and here all the time? Read some sources, distill the info and then present that to others?

Is that plagiarism?

The coming ubiquity of AI will change the very notion of plagiarism. It won’t be nearly as much of an issue as it was when text was generated exclusively by people.

As it stands now, AI uses the life work of millions of creators without giving anything back to the creators. Kinda like platform economy in general, only worse.

Absolutely not. You’re forgetting two massively important parts: quote and attribute others’ words you’ve used, and cite your sources. These text-generators are terrible at both of those: forgetting/failing to do them, and and totally fabricating the attributions/citations when they do.

In general, it’s not the AI itself which is plagiarism. It’s the unwanted use of other people’s data in the training set. Just because someone has said it’s okay for a human to read it or view it doesn’t mean they think it’s okay to use in an AI training set.

Sure, there is also the issue when it does quote things, which it can without attribution. And there is the argument about citing sources, but that isn’t inherently plagiarism. I’m not citing the sources I read to come to this conclusion, for example.

So while I agree those are also problems, the fundamental issue is with the training data, and is what I expect will wind up regulated.

Yes it is to the first question and, at least according to my English Lit teachers in school, no it isn’t to the second question as long as you yourself are distilling the information from several sources and quoting/footnoting any exact reproductions in your work. For example, you can’t expect high school students to come up with something brand new on a subject written about by authors who have doctoral depth of knowledge.

Plagiarizing

When AI does that it’s plagiarizing. Just like when people do that. Otherwise, no.

Thought experiment:

You want to paint a picture of an elephant, but there are no elephants where you live and you can’t remember what they look like. You do a Google image search and look at a whole load of pictures of elephants; some of these images are public domain, others are subject to copyright, some of them are just thumbnails of an image you would need to pay for if you wanted it full size or if you wanted to own a hard copy, some of them are subject to other restrictions I can’t think of right now, but all of them pass before your eyes.
You didn’t seek or require permission to put any of those images into your brain, via your eyes, but in any case you spend a while looking at and studying various pictures of elephants, understanding what they look like, how they work, etc.
Then you shut down your computer and get out your art materials. You paint a lovely picture of an elephant, that undoubtedly was derived from, or inspired by, one or more of the images that passed before you in the Google image search.

This happens all the time and I don’t think anyone has a problem with it. Even if you also studied the paintings of Van Gogh and ended up attempting to paint your own picture of an elephant, in the style of Van Gogh, I don’t think anyone has a problem with it. Try to sell it or otherwise pass it off as an authentic Van Gogh painting and sure, that’s a problem.

And, although there are differences between the human eyes and brain vs the inner workings of an AI based image generation program, the principle is rather similar; it looks at a bunch of stuff, learns about it, formulates something equivalent to an understanding of the material and styles, then generates images that are derived not from the source material itself, but from what it has learned about that source material.

The key difference between a machine doing this vs a human doing it is that the machine can do it far more rapidly or prolifically and in some cases, can produce results that, whilst they are not fine art, are above the level of the average self-trained human, and potentially it can memorise details etc more reliably than a human.

The problem (and I agree it’s a problem) appears to be not what they’re doing, but that they can do it wholesale with great ease.

I have heard these arguments and while I am not convinced that what it is doing, not “distilling” in the sense of abstract critical analysis and understanding that is what “we do” (even if the result looks the same), let’s accept it for the moment:

Using AI would then count as plagiarism. The AI has done the work.

Otherwise, thinking of the AI as the student presenting work? Sure with the caveats already mentioned, having clear real citations that can be followed to their sources, just like I’d expect from any other work. Using sources without that clear attribution would be plagiarism.

I’d like to throw out here the notion of AI image generators creating bizarre pictures of hands with too many fingers. This happens because the AI is very much NOT COPYING existing pictures of hands.

They’re creating an process for drawing hands by looking at a million pictures of hands. No human would ever get accused of plagiarism by creating their own process for drawing, or writing, because they developed that process by researching other people’s work.

When I play with ChatGPT4 through Microsoft Copilot, it cites it’s sources. It will have links at the bottom of the output that go to websites it’s basing what it wrote on.

How is this any different than what any other author, musician, etc. does? Humans and LLM AIs heard or read many other works, and are generating new works that are influenced by the old ones, but are not the same as the old ones.

Once something is published, should the original author be able to say what is done with it?

Copyright limits making reproductions. I’m not allowed to photocopy and redistribute your book. I’m probably also not allowed to make a stage play based on the book. AI is not doing either of these things.

So the author can allow a book to be read and interpreted, but not used to create statistical models about which words tend to go together? That is a big extension of copyright.

I think a big part of the problem is that current words like “plagiarizing” and “copyright violation” do not fit what LLM AI is doing.

The answer to the original question is no, LLM AI is not plagiarizing except when it outputs copied information as if it were its own, exactly the same as when a human plagiarizes.

It’s still ok to not like what LLM AI does, even if it is not plagiarism. It is ok to not want certain works to be part of a training set, even if that is not a copyright violation.

Clearly, yes. The problem is that wasn’t (and maybe still isn’t) a clear definition of how to define what the use is. Specifically, I’m thinking of the typical way that things maybe be published with a specific license applied, such as one of the Creative Commons suite of licenses. CC license apply to all kinds of documents, but for the purposes of, say, photos, they’ve been used on the web a LOT for creators to be able to say “you may use this picture for non-commercial or personal use, with attribution.” E.g. you can grab it and put it in your slide deck for a conference presentation or even as a background on your personal website or whatever, as long you credit the creator. But not use it as the logo for your restaurant or in a TV commercial or print advertisement, or as the cover of a magazine, etc. But lots of these images have likely (and confirmed, I believe) been scooped up for use in these training sets, and at least some of these AI projects/products are decidedly commercial enterprises.

I think that is the crux of it.

Which also suggests other problems (e.g. people are needed to produce new data…the AI can only ever steal that work and not create on its own).

That all fits under the rules of copyright though. I can’t take your picture and use it as a logo in a commercial trade, because that would be a copyright violation, and I need your permission to do that.

The various Creative Commons licenses are licenses to do things that would normally violate copyright. By putting something under CC BY-NC, you’re saying that I can distribute and reuse your work in ways that would normally violate copyright, as long as I’m not doing it for commercial purposes.

If an image generation AI is not actually making a copy or redistributing your logo, then it is not violating copyright. Why does it need special permission to analyze your work to create a statistical model of how the pixels are arranged?

Derivative work is a part of copyright law.

Yes, but whether or not something is a derivative has to be decided on a case by case basis by a court. Despite what Blurred Lines suggests, merely having heard something does not make all future things a derivative work. If I take your logo and add my company’s name to it, and change some colors, I don’t need to wait for a court decision to tell me that is derivative. But what if I just like the way you arranged things, but I create completely new art and words, but have been inspired by your layout. Is that derivative?

If I create a statistical model of the word relationships in lots of things, and I then generate a unique block of text based on that statistical model, is that a derivative work?

But as to your last sentence, how is what I do when I try to e.g. write a short story form about by meets girl, etc.?

My writing will necessarily draw heavily on my knowledge of other literary works. That’ show I lknw what “short story form” is. And also how I know at least something about the basic dynamics of storytelling and of boy meets girl …

A critical difference between me and an AI is that I also have been a boy who has met a girl. And I know lots of other humans who’ve done the same and talked with them about that. So my work might (might) be “closer to the original experience” than someone who has only ever learned of these things from books.

I’ve never been to China. If I write a short travelogue, the adventures of so-and-so navigating Beijing and the sights and sounds there, 100% of my facts come from others’ books and writings. Or are flat fabrications. How is that different from what the AI does now.

As to art, here’s a comment I made when my late first wife was learning to draw & paint. I said “You’re not trying to draw that particular tree there in that photo you took this morning. You’re trying to draw a realistic tree that may resemble that particular tree.”

That’s what all humans do: draw realistic trees or write realistic fictions. With widely varying skill and fidelity to their original source materials.

That’s true, but also "a work must incorporate enough of the original work that it obviously stems from the original. "

I admit I got hung up on semantics of “using the work” , and it’s correct that the license stipulates how one can use the work but rarely how one can use the work. E.g. “you can download this image for non-commercial use but you can only look at it on Tuesdays” doesn’t work.

I sympathize with creators who feel their life’s work is being used to train their replacements. But it’s a complex topic I’m still thinking through. Good discussion.

Yes, this is the main thing. What AI is doing doesn’t fit into our current laws very well at all. Trying to stretch those laws to allow or forbid things will probably make it worse than developing a new set specifically designed to do what we want. Of course anything new can run into other problems, like first amendment issues.

I think one huge balancing factor in play is that what is produced by AI does not fall under copyright. Sure you can use AI to right that movie script, but now nobody owns the copyright. Still in question is how much human modification of that AI generated script must be performed for copyright to attach.

Think of it as if everything that comes out of AI is under public domain. Clearly people make movies and such from public domain books, and those movies are under copyright. What if I take a public domain book, and just do a search and replace for some words, do I get copyright? What if I punch up Ishmael’s dialogue a bit, or change the whale from white to grey?