I think I screwed up by using ChatGPT too vigorously at my freelance work

If you were responding to that, why were you quoting from this?

And it isn’t the use case here. ChatGPT is not being trained on those sermons. They are being loaded into context memory for translation, and that memory is destroyed at the end of the session. ChatGPT learns nothing from it and has no memory of it after the session ends.

…because of this:

There is nothing in the privacy policy that states that “memory is destroyed at the end of the session.” Session isn’t even defined in either the privacy policy or the terms of use.

What it does say is that the content that you enter might be used to train the AI. This is explicit here in the privacy policy.

When you use Open AI’s services, and if you happen to enter the sermons as content, you agree that Open AI can use the sermon to train on. And my position (that remains unchanged) is that doing that, without the permission of the owners of that intellectual property, is both legally dubious and ethically wrong.

And the thing is: this doesn’t even matter. Because, OpenAI reserves the right to use any “information that is publicly available on the internet” to “train” its AI. And that information isn’t “destroyed.” Because the AI can use that information to create plagiarized works, as I’ve demonstrated.

But my comment was in response to this:

That post would apply to all content uploaded to youtube (and publicly available), not just these sermons. Are you saying that you’d feel differently if we were talking about someone else’s content as opposed to the church’s?

People do not need AI to create plagiarized works: all they have to do is copy some stuff and put their name on it. However, a valid concern is the essential 100% certainty that Open AI harvests every single user interaction with their services. I would have thought that went without saying, but they say it anyway, just in case. NB if something was on the Internet “they” can see it anyway and do not need some user to pipe it to their API; it is really mining the interactive sessions which is gold.

Various people, coders for instance, do need to use these AI tools on a daily basis, but it is possible to run them yourself where privacy is a concern.

…but the door is open to unintentionally plagiarize works. Because it was only a few weeks ago that people were saying it was impossible for AI to plagiarize, but that clearly isn’t the case. This remains a valid concern.

And to bring it back on topic: this is relevant to the situation the OP finds themselves in.

The AI was certainly trained on a dataset that included that Getty image and the Harry Potter books. We can’t tell from those examples whether it was given those up front or by people playing with it. That is, those images don’t prove whether or not anything you give it during a session is retained.

That being said, my employer is now running an in-house instance of chatGPT, because employees had been using the general instance and had given it confidential information. Someone in my employer’s IT department thinks our IT is safer if employees don’t share it with the general instance of chatGPT.

That being said, if the sermons were publicly available on YouTube, i think chatGPT already had access to them, and the copyright issues around whether it used them existed independent of this transcription project.

My guess is the church knows he’s transcribing faster than normally possibly doing it by hand, and knows something is different, and wants to understand what’s up so they can judge whether or not they are okay with it.

…whether it is “retained” or not is besides the point. The output is still plagiarized.

What the privacy policy states is that they can use our inputs (“content”) to further train the AI.

I asked ChatGPT if the inputs were “destroyed” and what it said was, up until October 2021 (the date that the free versions knowledge cut-off) inputs were retained for thirty days and weren’t used as training data. But it also said that the privacy policy may have changed and to rely on the most updated version.

So at one stage the data was probably “destroyed”, and wasn’t added to the dataset. But my reading of the revised privacy policy suggests this is no longer the case.

And I say, Judge not, that ye be not judged.
For with what judgment ye judge, ye shall be judged: and with what measure ye mete, it shall be measured to you again.
And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye?

ETA: all rights reserved by Copyright.

Fwiw, my employer’s experts don’t want us to tell chatGPT anything confidential.

There are undoubtedly large numbers of Getty images in the LAION datasets. Low resolution, watermarked thumbnails like anyone can see on their browsers, not the full-resolution ones you purchase. But seeing an attempt at a Getty watermark somewhere on the AI generated image doesn’t mean that the AI is recreating a specific Getty image, just that the AI learned to associate a certain pattern of pixels with certain types of image descriptions. I’ve never got a Getty watermark, but I get various other nonsense squiggles attempting to represent watermarks, signatures, bugs, chirons, and meme captions. (There are at least two or three different text-to-video algorithms that, thanks to the training set, think that all video they produce should have a Shutterstock watermark, as can be seen in the various “eating things” videos from a couple of months ago.)

The “copies Harry Potter” AI that Banquet_Bear keeps bringing up is a customized system built on top of GTP with lots of scripts added. Standard ChatGTP does not create the second paragraph of Harry Potter and the Philosorcerer’s Stone when fed the first paragraph. As I have pointed out to him before, that is an example of What is Overfitting? - Overfitting in Machine Learning Explained - AWS, a flaw that well-designed AI engines attempt to minimize. His loved-to-hate script generator is an example of an AI engine that is not well designed.

Oh, and The Pile contains 3.73 GB of Youtube captions.

Incidentally, here is a gift link to a New York Times article about a lawsuit in which one attorney relied on ChatGPT to generate a legal brief, one that contained half a dozen citations to relevant cases. Except that all of the cites were bullshit, which soon became apparent when no one, not opposing counsel nor the judge, was able to find any of the cases mentioned.

Yes, that would be a misuse of the tech. A big problem with AI is that the general public just can’t get their heads around what’s going on. They insist on trying to use it as a truth engine, and then when it fails at that they declare that AI is useless.

Take the current conversation. There are a lot of misconceptions. For example, there seems to be a big confusion between ‘training’ an AI and ‘fine tuning it’ or using its context to analyze material that you give it.

In the first case (which ended in Sept or Oct. '21 for GPT 3.5 and 4.0), material that is fed into it winds up modifying an unspecified number of parameters. Anywhere from none to billions. The training material is not retained, but having read it ChatGPT can use what it learned, much in the way a human does. It’s not memorization and verbatim repeat. It’s learning. For example, it’s possible for a chatbot to read something, make a zillion parameter changes, then read something else that invalidates what it read, overwriting many of the changes it made betore. Or it could read something small that pushes it over a tipping point and causes a massive reorganization of its parameter structure. We don’t know. But there’s no verbatim copying of anything really going on.

In the case of ‘fine tuning’, new layers are added to the model, and these layers can be modified by what is read. But the original layers of the trained model remain intact.

When using ChatGPT as a user, you have access to the ‘context’, which is a chunk of memory used for ChatGPT’s prompt. You can use this for fine-tuning of a sort (it also uses additional layers but doesn’t modify the main model), for instance by feeding it a transcript and asking it to format it. When the session is done, those layers are destroyed, and the original model was never modified, so nothing is changed at all afterwards, no IP is violated.

Like humans, it can repeat small passages verbatim. Just like I can repeat the entire poem ‘Jabberwocky’ on command. If someone asked me to write a nonsense poem, no doubt my attempt would lean heavily on my memory of ‘Jabberwocky’. But it’s not plagairism to have read it, or even to have memorized it word for word, or even using the knowledge of it to inform new work. It would only be plagairism if I posted it essentially intact and claimed it was my own work, or tried to sell it for profit without permission. I don’t see why an AI should be held to a different standard.

Now, it’s possible (and likely) that a record of your session is recorded. But that is true of ANYTHING on the Internet. Do you keep proprietary info in Google Docs? Guess what? When you delete something in Google Docs, it is not deleted. It’s just marked deleted so you don’t see it. Google has the ability to look through everything you upload to them. Ever send a private message on Twitter? Twitter employees can and do read them. As does the government.

Even if we assume that the pastor would give permission for his sermons, that’s not really relevant. The sermon would be an input, not training data. The training data would still consist of a lot of text that they may or may not have permission for.

But, also, there’s no reason to assume that someone who wants a human to be able to access their content would also want a LLM to do so. A preacher is a pretty good example: they would obviously want a human to see their sermon and possibly save their soul. But that doesn’t mean they approve of their sermon being altered to say something else by an LLM, or even used with an AI at all.

There’s an easy illustration: most people don’t have a problem if you memorize their content. Disney isn’t suing ProZD on YouTube becasue he has the entirety of Peter Pan memorized. But they very much do care if you make a computer copy of the movie.

They’re not the same. What a human will do with a text and what a computer does with it are different. It makes sense that people might not treat them the same.

Furthermore, humans are legally people, and have legal restrictions that people have. LLMs are not people, and do not have those restrictions.

Plus, according to current US law, content created by an AI cannot be copyrighted. So there’s every reason for the people who want to use AI to get the laws changed.

(Whether this applies in this case, with the sermons, I don’t know. I would doubt it, because the text was written by the preacher first, and only transcribed and rearranged by the computer. But if he used ChatGPT to write the sermon, on the other hand, a lot of what it contains could be ineligible for copyright.)

My employer had a rule that we couldn’t put anything confidential on a cloud service not our own. It pays to be safe, even without evidence that the information was likely to get stolen.

I’m going to a presentation by a copyright attorney in a few hours, so I’m going to ask him about ChatGPT and copyright. But information put on the web is going to be input to both human and now machine content creators. If original content is created from it, there is no plagiarism. If chunks are copied directly or with minor changes, then there could be plagiarism. If ChatGPT regurgitated chunks of content there could be a problem.
But I’ll see what the lawyer says. This is for a writer’s group, so the question will be on topic.

It was trained on a bunch of stock photos and, from that, ‘learned’ that stock photo watermarks are expected in a certain style of photograph some or much of the time. It’s not a matter of being trained on a specific image but rather it probably has tens of thousands or more images in the dataset that included a Getty Images watermark.

Same thing as asking an AI to make an oil painting will usually deliver a fake signature in the corner. Not because of any specific painting it’s trying to copy but the thousands of oil paintings it’s seen that all have a signature so it assumes that an approximation of a signature is expected on an oil painting.

If you are confused about exactly how latent diffusion works, read this detailed explanation and you will still be confused about exactly how latent diffusion works.

If the lawyer is not an expert in AI, he may give you a bad answer. A lot of people seem to have based their opinions on the legality of AI around the notion that all the AI is doing is copying and paraphrasing other works. “AI just copies and paraphrases other people’s work. It does nothing original on its own.” is a common statement, which is also wrong.

…there are undoubtedly large numbers of Getty images that Stability AI unlawfully copied and that were protected by copyright and the associated metadata owned or represented by Getty Images absent a license to benefit Stability AI’s commercial interests and to the detriment of the content creators.

Yes. This is true.

So now the goalposts have shifted, yet again.

So allegedly Standard ChatGTP will not plagiarize, because “reasons.” Its the fault of the customized system built on top of GTP.

The thing is: we know that StandardGTP is absolutely CRAP at creative writing. It can’t write a script past a couple of pages. It is a TERRIBLE creative writer. It can’t even remember that it “killed the cat.”

So the only hope that AI will ever get to the point of doing any of the things people are imagining it can do is to do things like “add lots of scripts.” Create customized systems. If people don’t do that, it will never have a hope of ever “replacing the writers room.”

You don’t get to pick and choose what is and isn’t AI. And I’ve stopped believing any of the hype, especially claims that “it doens’t steal” or “it doesn’t plagiarize.”

And you shouldn’t be entering sermons that aren’t your intellectual property into ChatGPT without permission.