Is Artificial Intelligence (AI) plagiarizing?

I think this is really the crux.

There have been earlier revolutions in copyright caused by automation.

Way back in the 1950s lists of things like a telephone directory could not be copyrighted. The theory then was that there’s no creative work there, just a mindless compilation of dry facts.

Then computers arrived and other companies could practically copy those lists and re-sell them. At which point the original compilers (the telephone companies) complained about the loss of potential revenue. The end result after a decade of wrangling and court cases deciding all over the map, was that compiled lists of bare facts became copyrightable, even though every single fact within the list is not copyrightable.

The rationale was that what was protected was the “effort to compile” which was embodied in the collective result. It’s essentially a variation on the age old philosophical question of “how many grains of sand comprise a ‘pile’?”. No single grain is copyrightable, but the pile surely is.

But what drove the need to change this was that automation suddenly made it practical to copy (plagiarize?) a list of 100,000 names and addresses. To teh economic detriment of the compiler. Sheer impracticality had been barrier enough before. Until suddenly it wasn’t.


Similar issues have come up in privacy contexts. Before the widepsread computerization of small business, many state driver’s license bureaus considered the roster of drivers, addresses, and license numbers to be matters of public record. Anyone could go visit the office fill out a paper form, and pull the record on anyone else. What prevented most of the abuses of this open data was simply the difficulty of doing so unless one had a darn good reason to want to get info on Joe Schmutz specifically.

Then bureaus started offering the whole file for sale to marketers. In an easy to use computer to computer format. Or even gratis, not even for sale. After all, it’s all public records, right?

Suddenly the practical difficulties of trawling the whole database and e.g. junk-mailing everyone simply evaporated.

Of course this got worse as the early WWW got off the ground and some states made that roster freely available for download to anyone. As a result, many states have revised their laws and policies such that the protections on getting the bulk data are greater than the protections on getting any single individual’s data. And in at least some states, even single individual data is protected.

What changed was the ease of bulk harvesting. Nothing else. But it was a quantitative change which had a huge qualitative impact. And laws were changed to reduce that qualitative impact.


I believe the use of copyrighted works as part of training data sets will eventually end up being decided via an analagous process. LLM generative AI doesn’t alter what humans have always done with copyrighted works: read and absorb / digest them, then use them to inform indirectly the production of new works.

What LLM/AI does is alter the scale, scope, and speed. Which amounts to the “commercial convenience” of doing something. What used to be practically impossible has very suddenly become practically costless / trivial. That is another strictly quantitative change which has vast qualitative implications.

Our society will be thrashing this out for another decade or two.

I think that the two primary wrinkles are:

  1. A human could come up with a new idea and a new style, creating some whole new area of art. He won’t, necessarily, but he could.
  2. A human will almost always narrow his output, based on his personal biases and sensibilities. Even if he’s trying to copy another person’s work, he’ll almost invariably end up creating something else, just because he’s his own person with his own tastes.

There are some avenues for an AI to approach the above.

For the first, you could, for example, tell it to calculate a diff between two art styles, convert that to a vector, extend the line, and then tell it to produce a work in that new space. That’s really the human giving the guidance on how to come up with a new idea, though. It’s been forced to do it, based on a mathematical formula. The AI didn’t, itself, have the idea to do that.

For the latter, we might say that the AI is using the knowledge that it’s gleaned from its entire training set to “fill in the gaps” when it’s taking one artist’s works and trying to create something new in that style. There’s no enough information in just the one artist’s portfolio to create any concept, so it will necessarily have to use training from other sources to help. That underlying training will influence - to some extent - how it approaches the matter and, perhaps, add a unique spin on the output. Likewise, you could hand-pick a training set that has a particular spin on it, in addition to the target artist’s works, to gain that impression of a “personal style”. But that style is the direct result of the hand-selecting that went into the training set. Again, it’s really the human maker that caused this to happen, not the AI.

If you made a camera that swirls the image and reverses the colors in the photos that it takes, you might get some sort of artsy output. That manipulation is the result of a human having chosen to apply those particular distortions. The machine is just a machine, doing what it was built to do.

For what it’s worth, here are two recent legal stances on the issue (seperate judges tossing out parts of lawsuits)

https://www.reuters.com/legal/litigation/judge-pares-down-artists-ai-copyright-lawsuit-against-midjourney-stability-ai-2023-10-30/

https://www.reuters.com/legal/litigation/us-judge-trims-ai-copyright-lawsuit-against-meta-2023-11-09/

Funny you should mention van Gogh (and painting animals) because that’s an example I was thinking of bringing up in this thread. I like to stress test image generating AIs with complex and unusual requests mixing more than one subject and/or more than one style. One of my early test prompts for Dall-E 3 was “A photo of a chimpanzee dressed like Vincent van Gogh holding a painting of a giraffe by Monet”. No result managed to depict all of those elements. But one surprised me when it produced a full (and at first glance fully accurate) copy of Starry Night. But when you compare the reproduction and the original side-by-side you see that, while the copy is extraordinary similar to the original, there are differences in details that show that DE3 did not paste in a copy of Starry Night stored in a file but simply (or not so simply) had a very clear idea what Starry Night should look like and reproduced it on the fly.

I then tried asking both Dall-E 3 and SDXL for “Starry Night by Vincent Van Gogh”. Some of the images produced are remarkably similar to Starry Night, but none are identical to Starry Night and none are identical to each other. No copy/pasting, just a surprisingly detailed concept of the painting.

Dall-E 3

SDXL

I think I tried a few other famous paintings at the time and none of those had a similar level of recognition, but earlier this week I tried for an “ELO on a shelf” image (as a pun on Elf on a Shelf) using Dall-E 3. I was hoping for a tiny band playing on a shelf, but what I got was four versions of a specific album in a very 1970s stereo setup image (in Bing, it uses ChatGPT to add details to the prompt before it is passed along to the Dall E renderer. Even though I didn’t ask for those kinds of details, ChatGPT probably did.)

Again, the generated album covers share a lot of details with the real album cover, but none are identical to the cover or to each other.

So do/should any of these similar but not directly copied images (both van Gogh and ELO) rise to the legal standard of plagiarism or copyright/trademark infringement?

Use probably comes into play here. What are the images being used for?

If you took one of those Starry Night images by itself and tried to sell it as a Van Gogh print, obviously that’s a real problem. If you used that fake ELO cover as a real album cover for your band ‘ELP’, trying to fool people into thinking it was ELO or otherwise trade on their reputation, that’d probably get you into trouble.

BUt all the examples you showed, in the way you showed them, are fair use. Derivative works, clearly not meant to be interpreted as works of the original author, not being used for financial gain or claimed to be the work of someone elese, and not a subsitute for the original creator’s work.

My naive, non-expert opinion is that they should count as “derivative works,” like a popular song that is a cover of, a parody of, or heavily samples from, an existing song. In such a case IIUC the original songwriter(s) would get songwriting credits and royalties.

Wait. How is that “fair use”? Copyright protection in the US at least, protects against derivative works.

Starry Night is public domain, so all of those are safe, though I understand that is not the point you are trying to make! When I asked GPT4 about this, it said, “it is advisable to consult a legal expert before using any images of Starry Night for commercial purposes”.

Is having a flying saucer on an album copyrightable? As you say, it will depend on context, and like almost all copyright questions, the unsatisfying answer is there is no clear answer until all appeals have been exhausted.

Parody and sampling may be acceptable as fair use, so no permission is necessary. Again though, you don’t know for certain until a judge gives a final ruling. Covering a song may require an ASCAP license or similar.

Just to make sure things are extremely confusing, the details of copyright depend on the medium (written text, performed music, etc.), who is involved (author, performer, work for hire, etc.), when the work was created, and where the work was created.

Exactly so.

I hope so, but one thing that is different about this is the balance of power; that time, it was corporations vs other corporations who wanted to profit from their efforts. This time it is corporations profiting from the efforts of individual artists (well, and other corporations, I suppose, but not exclusively), at the same time as exerting significant competitive pressure so as to reduce the financial resources of the individual artists. Maybe class action stuff will prevail, but in some ways, this feels like a case where the corporations might prevail by sheer brute force advantage.

Assuming the AIs don’t decide that the problem most pressingly requiring a solution is our existence.

I almost closed with one last sentence eerily like your own. Great minds and all that. :slight_smile:

The AI may not be making a copy, but the people putting the data into the AI for training are. That’s my point—to sidestep the issue of whether the AI itself is copying by focusing on what the humans are doing.

And copyright is all about being able to control what is done with published works. That’s its purpose. It exists to prevent people from exploiting the published work of others for their own gain. There are certain exceptions, often called “fair use,” but those are made based on human rights and human capabilities. Remember, copyright is about maintaining the public good.

What you did in your last paragraph is exactly the issue I see. Why should I treat the human right to read and interpret information as meaning that it’s okay to copy data for an AI to “interpret” in its statistical models? Sure, the current law allows it, or AI designers would be sued or in jail. But the OP’s question is clearly a moral one about what should be, not a legal question about what is.

You can narrowly define the term “plagiarism”, if you want, and say it doesn’t apply here. But I think that’s missing the point. “Plagiarism” is being used here to link it to the underlying moral idea. The core of plagiarism is about the unauthorized use of other people’s work to benefit oneself. That’s why we consider it wrong.

If we do what you say and treat all this like something completely different, then we don’t have the moral framework in which to argue if it is wrong.

I think this needs a closer look.

What you write is some information (data) and words to link it together.

Your words are yours and you can do a lot with them to express your idea(s).

But data is data and, I think, AI is just hoovering up data and not parroting your expression.

For example, cooking recipes cannot be copyrighted. What can be copyrighted is the story you tell about your grandma’s meatball recipe. The AI does not care about your grandma and just nabs the recipe for meatballs…which is legal.

And this gets real tricky real fast. When I go to a web page my browser is making a copy of all the copyright protected data on that page. It has to, that’s how it works. My browser may even make decisions about which pieces to retain as local copies, and which to delete when I navigate away from the page. This is why digital things have convoluted licenses. Strict violation of copyright may be necessary for any use.

My point is, a law is going to have to be very narrowly crafted to make copying online data for purposes of viewing by a only software illegal versus software, but also humans, legal. Also, try not to ban search engines at the same time.

Morally, I have very conflicting views. I’m definitely a copyright minimalist. I don’t think copyright should be banished, but rein it way back, just to make my biases clear.

I think a creator should have control over what happens with their work, but I also think that some decisions in that regard are going to have consequences. If a work is made public, it doesn’t need to lose copyright, but the creator is going to give up a lot of control over what can be done with it. This may include things like being indexed in a search engine, or used to train an LLM. It also might mean inspiring other creators, or spawning discussions that portray the author in ways the author doesn’t like.

Right now, the search engines and bots which are better citizens do obey the creators wishes as set out in robot.txt files. It wasn’t always this way, and it is purely an honor system, but if a creator doesn’t want their website used in training data, then

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

will take care of the two largest.

So here’s a related issue that I find somewhat troubling.

When OpenAI announced “GPTs” (Personal GPT instances you can spin off and fine tune with your own data and control with your own system prompts) I had an idea: Why not download the entire archive of everything I’ve posted on the SDMB, train an AI on it, and use to query things like where I’ve changed my mind, has the complexity of my writing changed, what things I’ve said in the past that subsequent events showed to be wrong, etc. In short, a tool for examining yourelf and the quality of your thinking and the changes your thinking has undergone since joining.

I’ve got several risk factors for Alzheimers, so I worry about losing my mojo and not realizing it. An AI trained on my historical corpus of writing could be used to alert me to symptoms of early mental degredation.

But once I had the data, I stopped. Did I really want to know that stuff? What if it tells me that I’m getting dumber as I age? What other ugliness might I find? WHat if I find horrible inconsistencies in my logic over time? What if it tells me I’m getting mean as I grow older?

Then I realized that the owners of the Straight Dope could do that now, for everyone. Mods could load up a user and query the user’s history to help make mod decisions. Someone looking to ‘get’ someone on the Straight Dope could have an AI scour everything the person has said here looking for ammunition.

I thought hard about what people could learn about me from having an AI analyze everything I’ve written on here, good and bad. And it’s a LOT. Way more than I’d like, for sure.

So what happens when someone writes a script (or has an AI write a script for them) to scrape every message off the SDMB then train an AI on it looking fr dirt, or using it to identify anonymous people, or using it to evaluate a potential employee or credit applicant, or whatever. That comment you made about skipping a loan payment 20 years ago might come back to haunt you.

We’re probably going to need some cool rules around privacy and AI scraping at some point. Or, people will become much more careful about what they post online.

Facebook feeds would be another great way for an AI to tell someone everything they need to know about people. Add in your TikTok, Twitter, and Instagram postings, and you are probably a completely open book.

Another thought I had: If I trained an AI on every post I made here, then told it to respond as if it were me, would I like the output? Could people tell it wasn’t me? I would never carry out that experiment here, but it’s an interesting one.

Tying it to the copyright issue, there are a lot of potential privacy issues that we don’t care about because of security by obscurity. If humans have to go looking, the volume of stuff out there is probably way too big for casual searches of obscure data. But with AI in the mix, a question like, “Has Sam Stone ever admited to a crime?” Or “Has Sam Stone ever admitted to failing to make a loan payment?” “Has Sam Stone ever admitted to using illegal drugs?” would get an instant answer.

Things we tolerate at small scales can become destructive and unacceptable at scale. Another example of this is surveillance. There are no laws against a cop sitting outside your house watching you. We don’t worry about that too much because cops are expensive, and no one is going to spend money on a fishing expedition just in case you do something wrong. So cops generally only watch people who give them a good reason for doing so.

But when the cop is replaced by a camera and a database, the cost of surveillance dops to almost zero, and now we need to worry about it. AIs are going to lower the cost of darta searches to amost zero at some point, and then our long trails of writing on the internet are going to come back to haunt us. Or some of us, anyway.

Seems a sort of a Turing test. It’d be interesting (I think).

A turing test in the limited enrionment of subject messages and responses, yes. It might pass that, but fail a more general turing test of open-ended questions.

If you only cite “quotes and words you’ve used” and not summaries or paraphrases, you’d still be guilty of plagiarism. From the MLA Style Manual:

Include a parenthetical citation when you refer to, summarize, paraphrase, or quote from another source…Even if you put information in your own words by summarizing or paraphrasing, you must cite the original author or researcher as well as the page or paragraph number.

That ship has sailed, as has protections for anything that has been published or scanned electronically and made available on the internet. There are no protections that will prevent unscrupulous users from training on copyrighted data or using trademark images to produce virtually indistinguishable fakes, and not only is the privacy of your attributable data long gone, it will not be a difficult task for a suitably sophisticated AI to find patterns in your data and link anonymized data to you with a high degree of confidence, and potentially even make accurate predictions of future behavior or come up with algorithms for how to influence you to do essentially anything.

Stranger

–Wilson Mizner, some time before 1933

I try to operate under the assumption that, anything I post here, one of my students will stumble upon sooner or later.

Yeah, I do something similar. In fact, in other places I used to post under my own name, on the theory that if I’m embarassed to post something under my name, I shouldn’t be posting it at all. How Naive I was…

But even if you try to make every post as student’friendly as possible, are you sure that if someone had an easy way to query your entire output here they couldn’t find something embarassing or incriminating?

It doesn’t even have to be the content. How many people here would have a problem if a future employer queried an SDMB AI and asked how many times a certain poster posted during work hours in their last job? Or who posted at all hours of the night during weekdays?

AIs enable a lot of privacy violations by being able to search massive data almost instantly.