Do AI-powered writing assistants like ChatGPT or Grammarly "scrape" their clients' creative content?

I’ve resorted to ChatGPT for simple fact-checking now and then (with uneven results), but I have major concerns about feeding any of my creative or professional content into generative AI writing assistants such as GPT, Jasper, HIX, or Grammarly, mainly because I suspect the companies who produce these AI products scan their clients’ creative content to further improve their products/services. All AI models were created and perfected by being fed tons of original content. Who is to say these companies maintain a firewall between their client creative content and their product development teams? It simply defies reason and life experience.

During work sessions, ChatGPT tells me it’s safe to ask it any questions or feed it any of my creative content since the company does not retain or store anything generated during our user sessions. Yet, a separate disclaimer cautions me now and then that content produced or fed into GPT during our work sessions is indeed retained and may later be utilized in the training of their AI models. When I seek clarification on that issue, GPT obfuscates, then finally admits that, yes, it may indeed scrape user content to further refine its products. Looking at the larger picture, in December 2023, the New York Times sued OpenAI and Microsoft for using millions of the Times’ articles to train its chatbots, without permission or compensation. As things currently stand, all that AI users have is a pledge from these generative AI companies that their creative content will not be stolen, but I do not believe these bland assurances. On a related note, is creative content sitting in a company’s cloud storage subject to being scraped and sold to an AI company? Some have alleged that Google steals/scrapes creative content stored on its cloud, but Google hotly denied these allegations.

I hope this can be kept in General Questions, as I’m looking for factual answers.

[Moderating]
Before anyone posts a quote of what ChatGPT itself says on the subject, I’d like to remind everyone that the chatbot itself is not a reliable or authoritative source. Quotes from human OpenAI executives or programmers, on the other hand, would be authoritative (though could potentially still be unreliable).

From the “What is ChatGPT?” Frequently Asked Questions:

Will you use my conversations for training?

When you use our services for individuals such as ChatGPT, we may use your content to train our models. You can opt out of training through our privacy portal by clicking on “do not train on my content,” or to turn off training for your ChatGPT conversations, follow the instructions in our Data Controls FAQ. Once you opt out, new conversations will not be used to train our models.

We don’t use content from our business offerings such as ChatGPT Team or ChatGPT Enterprise to train our models. Please see our Enterprise Privacy page for information on how we handle business data.

Whether you trust OpenAI—a company that has been anything but transparent about its acquisition and use of training data—to follow their own claims of users being able to opt out is beyond a question of fact, but even though the development arm of OpenAI is ostensibly a non-profit research foundation, it is funded and does work in support of what is intended to be a profit-earning ‘subsidiary’ that is highly motivated to make a return on the massive investment that has gone into developing and training their GPT models, as is their main investor, Microsoft, Inc, a company with a long line of ethically and legally dubious business practices. The same is obviously true for Google and other large language model (LLM) developers, who ultimately hope to turn their models into multibillion dollar profit machines through business cases that still remain ambiguous and (in the opinions of many experts) doubtful. All of these developers are becoming increasingly hungry for data to expand their training sets because of what massive amount of information they need to make these models capable of such sophisticated responses, so they are highly motivated to find novel data, particularly that which is applicable to their users.

Using an LLM to perform “fact-checking” or to provide any information that cannot be readily verified by independent means is a ill-advised. It should be understood that what these machines do is essentially sophisticated pattern matching using trained Bayesian statistical algorithms to predict an appropriate string of words in response to a prompt, and that they can do so in such an impressively diverse way is not because of some rich symbolic model of the world but by brute force training on truly massive amounts of text that are far greater than a human being could ever read and comprehend in thousands of lifetimes. They do not have any inherent mechanism to distinguish fact from fabrication, and in fact their entire purpose is to synthesize a response that aligns to the prompt regardless of whether the question is well formed or legitimate.

Despite claims that the industry is working on mechanisms to make LLMs have greater information integrity, there is no general way to assure that they provide fundamentally factual information without narrowly confining their subject matter and responses because they literally have no understanding of what is ‘true’. While they’ve made good strides in making the models less prone to obviously ‘hallucinations’ (a misapplied term that masks what is really causing mangled responses) and have worked to develop filtering algorithms to catch obvious mistruths, any response that requires syntactical interpretation or that involves complex semantics will have a significant likelihood of giving a misleading or incorrect response. In short, it is objectively inappropriate to use such machines for ‘fact-checking’ or to provide information where accuracy and truthfulness is crucial.

Stranger

“Whether you trust OpenAI—a company that has been anything but transparent about its acquisition and use of training data—to follow their own claims of users being able to opt out is beyond a question of fact, but even though the development arm of OpenAI is ostensibly a non-profit research foundation, it is funded and does work in support of what is intended to be a profit-earning ‘subsidiary’ that is highly motivated to make a return on the massive investment that has gone into developing and training their GPT models, as is their main investor, Microsoft, Inc, a company with a long line of ethically and legally dubious business practices. [snip] All of these developers are becoming increasingly hungry for data to expand their training sets because of what massive amount of information they need to make these models capable of such sophisticated responses, so they are highly motivated to find novel data, particularly that which is applicable to their users.”

Thank you, Stranger. You’ve addressed my concerns clearly and eloquently. Major privacy breaches happen frequently in the business world, so the so-called accidental migration of terabytes of private content over to a company’s AI training team isn’t hard to imagine, especially given the financial incentives resulting from such a leak and the difficulty a user would face in proving a leak and actual damages occurred.

In general, you will need to check the TOS and, then, continue to check it every time that it asks you to review it again (which companies should do if they change anything or if they don’t recall your answer between sessions).