AI affecting Stack Overflow and Reddit

If you click on “Tools” you can set a date range. As you say, it’s easy to find other references, like this one:

It’s almost indistinguishable from something that would have been published yesterday. Like:

One thing that would help would be clarity about whether using copyrighted data to build A.I. systems should be considered fair use and therefore not a violation of copyright law. That’s an issue that hasn’t been litigated in the courts yet

Well, it still hasn’t been resolved.

…why would they think that?

What search engines do and what monetized AI does are not the same things.

“Processes content” and “monetizes content” are not the same thing.

You can search by date-range.

https://uc45136344dad2e96b8b102f4481.previews.dropboxusercontent.com/p/thumb/AB4AGsMhz9vkOgAQLXGpFJ1SzkfLzPqUVfvP0fvCKXBkL8V5sJXs42rXh5RM39DTxg-ZbXymAgvw_Co5CQjox4t7qfnkUMriZEdnP3HQawU8MXLq2Jma-O6vfcWPrcsTRCwxGIF6e9PMuym8YRp1b0FTr56aEPfVPXPUiLT_PJElmSeVRZcBLKezBMN74l7W9SjEHj4mHYSZe_Wui1XWYNnVD2w4kzg91mF3w_Q7vpqgEL0J5rqZ7uo2t8li8npo-IkAXiv6e6_RuJe_oIX7pbVjIK5kNXSEg_NNZQIv0G9IA1j2h59GJWbVBJv49DNbZPTEDLeHb9D9gj_1zy4WloeKJZeev015V4q-5kUURg25_zdhxyDzXmQ-JcCIXvq0a4ynqMzh5t9BgH5jfT4_q4p8/p.png

Nvidia’s famous face generator, for example,

was trained on a dataset of 70,000 photos culled from Flikr.

This is explicitly mentioned in the PDF linked in the article.

So are you okay with scraping as long it is in free, open-source, non-monetized models?

They are both taking publicly available data, processing it in some fashion (that doesn’t involve direct local storage), and then making money off of that transformation (by advertising or otherwise).

Last I checked, Google had a $1.5T market cap. You think they aren’t making money by scraping content?

…as I said earlier in the thread, I’m unsure. I’m conflicted.

BB might have a point if all these examples were buried in academic papers. But they aren’t; they’ve been in mainstream news since forever. Like:

In 2012, Google researchers fed their computer “brain” millions of images from YouTube videos to see what it could recognise.

Or:

DeepText was built on top of Facebook’s AI backbone, FbLearner and uses Facebook pages, in part, as training data.

It goes on and on. All this information has been out there for many years in the mainstream media.

Right. The people suprised by it just haven’t been paying attention. It is like me not knowing that soccer fields came in different sizes until I learned in on Ted Lasso. It isn’t something that was kept secret just because I didn’t know about it.

…the legality of what search engines do have largely been settled in law.

What AI does has not been. That distinction is important here. We really need to stop with the overly simplistic analogies. They just don’t hold up under scrutiny.

I think that I can no-index my website to block automated crawling of my website so that Google will be unable to index it. (in fact, I once accidentally did just that, and my website just dropped off the face of the earth)

I can’t no-index my images that might be being hosted somewhere else with my permission.

…you’ve cited an example that is buried in an article about “racist robots”, that is buried in a puff-piece about Facebook’s “DeepText” (which practically nobody has even heard of or cares about), and your examples seem to be proving my point, not yours.

(Which doesn’t cover the older versions, of course.)

I certainly agree that things have not been settled. But courts tend to work by analogy with previous law. And as you note, search engines have been found to be non-infringing due to their transformative nature. It seems very likely that AI will prove to be the same. Because, contrary to popular belief, they are not simply storing the original content.

There should probably be an AI equivalent of robots.txt. But let’s not pretend that it has any real effect. It’s just a suggestion to the most polite search engines. “Malicious” clients will just ignore it.

And if you post your data to social media–well, they just own all of it anyway. You can’t deny access to your own posts except by not using the service.

I see. No matter how many examples you’re provided with, in fact they somehow prove that you were right all along. How convenient for you.

…its interesting that you’ve linked to an article about plans for Stability AI to opt out, and not the link to the opt-out page on the Stablity AI website. Because when I click on the Tweet that the article cites, it leads me to this third-party website that doesn’t allow me to opt out.

https://haveibeentrained.com/

So I google “stability ai opt out” and your article is the first thing I found, and this was the second:

You may well be able to opt out of Stability AI. But they sure don’t make it easy to do so.

That isn’t how courts tend to work.

This doesn’t seem likely to me. But YMMV.

And this ignores the fact that the law isn’t standing still here. Lawmakers are actively working on this. Because the concerns of people that create content are real.

We were using your example of “search engines.” Do malicious search engines exist? Sure. But they aren’t really relevant to the conversation.

This is factually incorrect. You still own the rights to any work you post on any reputable social media network. You typically grant that network the right to upload and share your work

Facebook

Which, in laymans terms means:

This isn’t a transfer of ownership or even sharing of ownership. Facebook don’t own anything you post or share on their platform.

Random news stories about tangentially related AI-things do not demonstrate that “most people did know that things they may have posted on the internet have ended up being part of a dataset that is being monetized.”

This is what you were intending to prove. A story about racist AI or Facebook’s “DeepText” don’t prove this.

You start a hell of a lot of posts with this “its interesting” crap, like you caught someone trying to sneak something past you. I had no nefarious intent behind the link I posted not being the one you think I should have posted.

…didn’t mean to imply nefarious intent, my apologies.

Accepted.

(Eta) and the opt-out option is another thing that wasn’t obscure for people paying attention. It was all over the news at the time (with “the news” being “searching Stable Diffusion mentions in Google News”.)

Sure. But they wouldn’t need additional laws if they thought existing copyright law was sufficient. It’s basically an explicit recognition that this stuff probably is legal for the time being.

It’s more relevant for AI than search. As we’ve seen, it doesn’t actually take that many resources to train your own AI. And I’d fully expect that non-commercial users (whether driven by malice, ideology, or just wanting their own porn generator) will ignore these.

Search is different in that robots.txt files aren’t really excluding anyone from the “good stuff”, and also the scale makes it almost impossible for third parties to build their own engines.

So while robots.txt files are semi-pointless, no one is particularly motivated to bypass them. The same won’t be true of an AI equivalent.

What a useless nitpick. The TaC gives Facebook the ability to do almost whatever they want with your work, including re-selling a license to others without your permission.

Ownership implies a measure of control. You hand over every bit of that when you put your work on Facebook. About all you have left is the ability to regain control if you delete your account. Unless they already used it in their AI training, in which case it’s too late. Then you’re shit outta luck.

…except:

There are lawsuits happening right now under existing laws. We don’t know how they will play out. There is uncertainty at present. But there isn’t an “explicit recognition that this stuff probably is legal.” Many people think it currently isn’t legal, and are taking the appropriate action.

It most certainly is not. It isn’t even a nitpick. Facebook don’t gives Facebook the ability to do almost whatever they want with your work, they certainly don’t have the right to re-sell licences to your images without your permission, and I have absolutely no idea where you get that impression.

This isn’t how any of this works.

Ownership is ownership. And the ownership of intellectual property rights have been settled for a very long time. You don’t get to redfine the word here.

The limits of Facebooks ability to use the work you post are quite explicit in the terms and conditions. You do have control. And you don’t need to delete your account to revoke the licence. Removing the content is enough.

That’s what “sub-licensable” means:

This means that Facebook can license your content to others for free without obtaining any other approval from you!

Fair enough. I was mistaken on that point. However, it doesn’t change the fact that if Facebook has already done something with your content, like used it for AI training, you have no means of withdrawing permission after the fact. All you can do is prevent them from using it in some future training round.

…thats a 2012 opinion. We’ve had over 13 years of these terms being in place on Facebook, Instagram and Twitter and they haven’t to my knowledge re-sold licences to your images without your permission a single time. Because the intent of that clause is to allow them to sublicence to different entities: ie Facebook cross-posting to Instagram.

You retain ownership of anything you post to social media. That isn’t something up for debate. It isn’t a nitpick. Its just a fact.