Do commercial LLMs own the data you input?

A strangely hard thing to google / chatGPT-verb, because so many people want to know about whether the output of LLMs is copyrightable.

That’s a good question, but right now I’m concerned with the input, and just what happens to it.

One thing I use AI for a lot, is throwing around my ideas and checking for feasibility and whether something similar has been done before. Patentable ideas (after some development). It’s unclear to me from the ToS and from its own responses whether this is a bad idea. At the least it seems I am adding to its training data.

This is a major issue for companies and professions where confidentiality is a core requirement, such as lawyers. Some LLM providers are offering subscriptions to services that are said to offer greater confidentiality protections, but data scraping is the whole point of LLMs. How that tension will resolve itself is a major issue.

It depends on the LLM plan you subscribe to. Often the more expensive team or enterprise plans will say “won’t train on your data”. Otherwise, assume they are taking all of it and using it. They were trained ignoring all copyright to begin with and they’d happily do it again unless you pay them enough not to.

If you’re really worried, run a local LLM instead on your own computer, completely detached from the internet.

I know I can place my confidence in these massive companies promising untold intellectual fortunes and scraping every bit of intellectual property they can find on the internet without attribution or transparency to protect my proprietary concepts and personal data.

Stranger

I really don’t think your particular company secrets are THAT valuable to them. Generally they are statistical models that ingest training data and output possibilities. Their value isn’t so much in industrial espionage (for now) but their ability to summarize mountains of data. All that training has tremendous value en masse. Any one specific piece of data, not so much. It’s easier for them to just exclude a certain thing from the next round or training than to spend lawyer time dealing with it.

That said, though, accidents do happen and we don’t entirely know how they work. It’s totally possible for them to inadvertently spit out some company secrets, even with no nefarious intent.

If you want to be sure, running a local version is the way to go.

For me, it’s not the value of the data; it’s my professional obligation of confidentiality.

If I were to hand a client file to some shmo off the street and say “Here, you can read all thorugh this and make copies”, I"m in breach of my professional obligations and could face discipline, even if schmo-off-the-street doesn’t actually do anything with the data.

What’s the difference between that and using AI to generate the first draft of a letter to my client?

I think it depends on the level of confidentiality required. Is your industry one where the vendor’s word isn’t sufficient proof?

These are many of the same questions people asked when Gmail and Gdocs first came out (and cloud computing as a whole). And by and large the world has accepted them, despite the occasional security breaches and data leaks. There’s always risk when someone else physically holds your data. For many companies and industries, it’s an acceptable level of risk. I don’t know what is acceptable for your clients or you, of course.

A lot of software companies struggle with that same thing, whether to give LLMs access to their whole codebase. Sometimes the LLM company is also a competitor. Some think it’s an acceptable risk, others don’t. Totally up to each company I guess.

I do hate that Apple and Microsoft and Google are all trying to remove that choice and forcing it down everyone’s throats, though. I disable the OS AI stuff wherever I can but it won’t be long before it’s mandatory, I bet.

I’m a lawyer. We have strict confidentiality requirements, and to the best of my knowledge, the typical law firm would never use gmail or gdocs, but would have negotiated its IT work with a company that is specifically aware of the confidentiaity requirements the firm needs to comply with.

The business model of most IT companies recognises and implements confidentiality requirements to meet their clients’ needs (again, to the best of my knowledge). It’s part of their general approach.

LLMs, by contrast, are designed to not respect privacy, as far as I can tell, but want to scrape every bit of data they can. That business model is fundamentally incompatible with confidentiality requirements.

Makes sense!

Stranger

You should assume than anything you input into a LLM is being scraped, unless you have specifically subscribed to a bespoke system which promises firewall protection.

Above and beyond that, there are other issues that I’ve seen.

With server online backups, which are supposed to be confidential due to content for professions like law - the additional problem is whose jurisdiction? If the servers are in the USA, for example, Ameican law can subpoena and take that data regardless of what Canadian law says. A corporation using a cloud server for their accounting, needs to ensure their data is within the reach of Candian law. (IIRC probably the same in any jurisdiction, you can’t have the company’s books in another country out of reach of subpoenas).

Then of course there’s the whole isue of whether smart devices like voice-activated TVs or home assistants are storing your conversation on their servers, not to mention your phone if you have Siri enabled… etc.