I think this is really the crux.
There have been earlier revolutions in copyright caused by automation.
Way back in the 1950s lists of things like a telephone directory could not be copyrighted. The theory then was that there’s no creative work there, just a mindless compilation of dry facts.
Then computers arrived and other companies could practically copy those lists and re-sell them. At which point the original compilers (the telephone companies) complained about the loss of potential revenue. The end result after a decade of wrangling and court cases deciding all over the map, was that compiled lists of bare facts became copyrightable, even though every single fact within the list is not copyrightable.
The rationale was that what was protected was the “effort to compile” which was embodied in the collective result. It’s essentially a variation on the age old philosophical question of “how many grains of sand comprise a ‘pile’?”. No single grain is copyrightable, but the pile surely is.
But what drove the need to change this was that automation suddenly made it practical to copy (plagiarize?) a list of 100,000 names and addresses. To teh economic detriment of the compiler. Sheer impracticality had been barrier enough before. Until suddenly it wasn’t.
Similar issues have come up in privacy contexts. Before the widepsread computerization of small business, many state driver’s license bureaus considered the roster of drivers, addresses, and license numbers to be matters of public record. Anyone could go visit the office fill out a paper form, and pull the record on anyone else. What prevented most of the abuses of this open data was simply the difficulty of doing so unless one had a darn good reason to want to get info on Joe Schmutz specifically.
Then bureaus started offering the whole file for sale to marketers. In an easy to use computer to computer format. Or even gratis, not even for sale. After all, it’s all public records, right?
Suddenly the practical difficulties of trawling the whole database and e.g. junk-mailing everyone simply evaporated.
Of course this got worse as the early WWW got off the ground and some states made that roster freely available for download to anyone. As a result, many states have revised their laws and policies such that the protections on getting the bulk data are greater than the protections on getting any single individual’s data. And in at least some states, even single individual data is protected.
What changed was the ease of bulk harvesting. Nothing else. But it was a quantitative change which had a huge qualitative impact. And laws were changed to reduce that qualitative impact.
I believe the use of copyrighted works as part of training data sets will eventually end up being decided via an analagous process. LLM generative AI doesn’t alter what humans have always done with copyrighted works: read and absorb / digest them, then use them to inform indirectly the production of new works.
What LLM/AI does is alter the scale, scope, and speed. Which amounts to the “commercial convenience” of doing something. What used to be practically impossible has very suddenly become practically costless / trivial. That is another strictly quantitative change which has vast qualitative implications.
Our society will be thrashing this out for another decade or two.