Is Artificial Intelligence (AI) plagiarizing?

Quite right overall there @Sam_Stone. I went on sabbatical from here awhile ago largely concerned not about AI, but just the advertising dossier programs that were scraping the internet to build up a saleable profile on all of us.

This will only get worse with time, and as @Stranger_On_A_Train said, 100% of whatever each of us has ever put on the net has already been slurped more than once. And that can never be undone.

We can each change our future by disappearing from the internet. Which would entail also giving up having a mobile phone personal surveillance device. But our past work is an open book that’s already been read and photocopied for their use. For quite a variety of "they"s of widely varying malevolence.

As to my return here, and elsewhere on the net, and my using customer loyalty programs all over the place for the pittance of compensation they offer for all that sweet sweet data, that was simply me realizing resistance was futile. I’m going down in the same ship as all the rest of us.


I do think it’d be really interesting to get a personal GPT and train it on my corpus and turn it loose writing practice posts. I’m sure I’d be fascinated by some and horrified by others.


As to your comment about Alzheimers, we had a thread 4 months ago sharing tales of our cognitive changes. I came away from that feeling a lot more normal about the changes I see in myself as I pass age 65.

At the same time, nw that I’m retired I did just get a full neuro- and memory work-up. As my PCP put it

You’re smarter than the average bear and could lose a lot more than some other folks could before it got obvious. I think you’re fine, but if you don’t think you’re fine, let’s get a baseline now and check again in a couple years. Can’t hurt, might help.

You, Sam (and plenty of other Dopers in this thread) are also a smarter-than-average-bear and might benefit from the same process. You can’t fight what you can’t see and don’t know about.

Maybe not a great idea taking medical advice from phenylcyclohexyl piperidine.

While there is some truth to this, the threat is probably larger from mostly-programmed software, like what search engines use, than from the current generation of programs that get called “AI” like ChatGPT and the like. A search like you’re describing does require a little bit of cleverness, for things like recognizing phrases that have similar meanings, but it’s mostly just brute force on large datasets, which is not what LLMs are best at.

As my PCP said in another vein one day:

Tell me your drug of abuse and I’ll tell you what problem you’re treating.

IOW, that’s just the drugs talking about other drugs. Whoa, dude, that’s so meta!!

The answers you will get will be very different between those who create for recreation, those who create to earn income, and those who don’t think of themselves as creators.

I illustrate books, so I am sensitive to being ripped off by someone photocopying, stealing my jokes, or using AI to save them drawing it themselves or paying an artist.

OTOH I am casual about using ChatGPT to help with arcane programming tasks ( my other job).
I used to pay expert coders to help, but I have ChatGPT now…
I do credit open-source software where used.

There are far more mission-critical issues for non-creators to worry about with AI, like world insecurity, child labour, politics, loss of truth, educational decline, all the human failings which will be magnified ruthlessly in AI and in its unintended consequences.

Let the creatives worry on their own account. You can just enjoy the movies. as usual.

I’m an editor on several StackExchanges, and this is a big issue over there. They have disallowed any AI-generated content, because people started flooding the questions with AI answers. Some are wrong, most are correct, but the whole point to StackExchange is to connect humans up to help each other, and they felt that the process of humans writing the answers is important. I understand that.’

On the other hand, it may also have been self-protection. Allowing AI answers would likely kill StackExchange, as people might as well just go directly to the AI and ask and not have to wait for a human to answer. And that would destroy a lot of value because most people using StackExchange do not post, but they get a lot out of reading many questions and answers.

So whether to allow AI or not realloy depends on the mission. It’s perfectly valid to say “No AI content at all” if you think that allowing it would harm the SDMB over time, or that it wouldn’t be fulfilling its mission.

A colleague of mine wrote two books in the Asimov universe. The first one was not obviously so, but the second one was. Is that fair use? What he did was change all the names. So Hari Seldon became the Psychohistorian and the planet Trantor was renamed Splendid Wisdom. And so on. He told me it was to avoid questions of copyright. I am unsure whether using another’s universe like that violates copyright. I am also unsure if changing all the names avoids the violation.

That’s more or less what E.L. James did when she rewrote her own Twilight fan fiction as Fifty Shades of Grey.

Fan fiction does generally come under fair use if it’snot being used to make money; however that is going to be decided on a case by case basis, as mentioned above. Some authors have harassed websites containing fan fiction with DMCAs to the point it was easier to just stop

This is why Archive of Our Own exists, and it ensures that people are not trying to make money with it - for instance, no links to patreon or ko-fi, and no conveying that the author will write similar for money.

This was not fan fiction. It was published by Tor a major SF publisher. Psychohistorical Crisis by Donald Kingsbury.

As an aside…my GF was once an editor for Tor (one of her first jobs after college).

If you didn’t (or did) get published it might have been her doing (at least in part).

A new wrinkle regarding this:

The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies.

Thinking on this some more…

As long as parts are not copied verbatim from the NYT can it be plagiarism? I mean, if you or I read the NYT and form some new ideas can the NYT sue us because we learned by reading their articles?

Seems to me that is what the AI is doing. Learning as we would learn…by reading stuff. It just does it on a scale that is far beyond us mere humans.

Yeah, New York Times articles were also used to train every single human writer currently employed by the New York Times.

makes sense, what you say …

BUT (IANAL)

… there might be a catch-all phrase on the NY website that any commercial use of their content needs to be autorized in writing by the NY-T company (pending funds)…

might (or no) be enough to sue … → I’d be interested in hearing a more authorative voice on that.

That has nothing to do with plagiarism, though. Maybe they could try to play up some putative license violation, or Aaron Swarz-style “illegal” mass downloading.

The article says nothing about plagiarism - there are other ways of violating the copyright. They are complaining about the unauthorized copying of their copyrighted material for training. If there was a plagiarism charge, you’d think they’d give an example, and demand the plagiarized material be deleted. They do not, the request is for deletion of the knowledge base built on their material.
Is this really a violation of copyright? That’s what the court case will determine.

This is an area where we may need new laws, or an update to existing ones.

Mostly I agree that ‘learning’ shouldn’t be punished. But there are some gray areas. For example, a reviewer apends a week testing equipment and writing a ‘best of’ article, and an AI scrapes it and uses it to answer questions about best products. No attribution is made, the original author gets no credit, and the AI company profits from work it didn’t do.

The counter to this is that there are lots of web sites that summarize reviews or compile top ten lists from other reviews. The difference is that they almost always link back to the original review. The AI may not even tell you the source of the data.

I don’t kmow how to solve that, but it’s certainly an issue. There is value in people reviewing products, and if we don’t fix that, we may stop getting in-depth product reviews at all. That tells you something is wrong. The same goes for other journalism that requires time and money to create. Why send a reporter to a war zone if 90% of people will just read an AI summary of the work and no one gets paid except the AI company?

Really asking here (I do not know):

If you make a list of the five best steak knives and I also make a list of the five best steak knives which happens to be the same as yours am I legally obliged to mention you (assuming I never used your words)?

No, I don’t think so. But if you copy the test data someone else constructed and use it? I’m not sure. I’m not a lawyer, and this seems like a grey area. But if the practice is allowed, why would anyone do any original testing?