AI affecting Stack Overflow and Reddit

A few things that came to my attention this past week:

  • Stack Overflow / Stack Exchange observed that moderators were rejecting some posts as having been written by AI tools, and some users were not happy about their posts (which they claimed to have written themselves) being rejected. So the company instructed its (volunteer) moderators to stop using their judgement in determining if a post was AI-generated.
    So now the moderators are unhappy that their own competence is in doubt, and are basically on strike.

  • Reddit, which is planning an IPO in 2023, claims to be worried that its content is / will be pilfered by companies like OpenAI in creating their LLMs. So Reddit will make the APIs external entities can use to consume Reddit content much, much more expensive. (Sort of like Twitter’s API pricing increase a few months ago, but ostensibly for other reasons.) This new price policy will basically kill third-party Reddit readers such as Apollo – which is better loved than the native Reddit app. Many subreddits are now planning a partial shutdown next week in protest.

I’ve seen some ChatGPT answers on scifi.stackexchange - answers that make up stories that don’t exist, with bogus plot summaries, etc. Those are pretty easy to delete/downvote, anyway.

The Reddit one, at least, isn’t really about AI. Reddit didn’t even initially claim it was. Yes, they did eventually mention that AI developers were using their API to scrape data. But they initially just said the problem was that some companies were very much abusing the API limits, costing them money, and that they wanted to charge back what the money they were losing.

They also, however, said that they would work with the bot developers and third party app developers to minimize the effects of this on them, since they are not the ones who were abusing the service.

To put it simply, they lied. They offer the same rates to the app developers, and pick and choose which bots to care for. Thing is, the moderators use third party apps and bots to be able to moderate their subs.

Not only that, but the actual first party app for Reddit has a lot of shortcomings. It is often buggy, has little to no accessibility, and is just much slower than the alternatives. It lacks features beyond moderation that the other apps have.

That accessibility problem is a big one. For example, Reddit’s blind users rely on a third party app to be able to use Reddit at all, because the official app doesn’t work with screen readers.

The issue thus isn’t AI. It’s a poor quality mobile app and lack of first party tools for mods and disabled users. Third party apps fill in the gaps.

Plus, frankly, the AI in question already uses web scrapers, and so would still be able to read all the data on Reddit’s website anyways. It’s third party apps that need the API more, because they need data beyond just what a human has typed. They need to be able to post, to see votes, rankings, etc. in order to function.

Paging @Banquet_Bear.

Trying to prevent crawlers from stealing the content for AI use will essentially prevent the human users who are habituated to free content from using it.

We asked for advertising and got AI instead. Oops.

That’s the thing. Reddit’s decision doesn’t affect crawlers at all. Only those who were actually trying to use the Reddit API.

…I’m sorry, why are you randomly paging me about this?

Thanks for the explanation! I keep running onto stuff about it but never really saw the specifics.

ISTM your overall point in several threads (with which I agree almost completely) is that random AIs should not be sucking up the publicly available human-to-human internet for their own future profit without paying for the value of the content they slurp.

As I understood the OP (which may be an erroneous understanding), these content providers were attempting to add a charging mechanism so AIs would pay for the training data they slurp. IOW, in accordance with your desires.

Which in turn, assuming I understood correctly, seems to be having a chilling effect more on human use than AI use of the content in question.

Which, if accurate, seems to be an example of an unforeseen consequence. I thought you might be interested in the anecdote and might have something thoughtful to say.

I fear we are past the point of grafting fair payment for incremental incidental commercial use onto the content on the internet. But I agree 100% with your moral quest that we ought to be doing that.

Google should pay wiki for every page they crawl. How we get there is a hard road. But I am on your team, at least conceptually.

Seems to me that it’s almost entirely about advertising. The third-party apps bypass the advertising you get with the web page or standard app.

I was thinking earlier about how Cory Doctorow’s thesis about enshittification gets the emphasis wrong. Why is it that platforms are ever not shitty? It’s because early on, they’re funded by VC or other investor money, and can afford to prop up an unsustainable business model.

Then the free money dries up, the platform has to either charge a subscription or crank up the advertising. Both cause enshittification (particularly the latter).

Why do we ever perceive some platforms as sustainably non-shitty? It’s solely because they charged users from the beginning. They never acquired that vocal set of users dependent on the subsidized version of the platform, and so there was never an actual decline, just maintenance of the status quo. They’re still kinda shitty, but things never got worse on the platform, and they’re inevitably smaller and more vulnerable than ones that offer a free service.

Aside from that, Doctorow is on target about what happens when an investor-subsidized platform turns into a subscription or advertising-funded one. Though it’s so obvious why that it’s barely worth articulating.

Doctorow also gets the emphasis wrong when he says enshittification is just “seemingly” inevitable. It is absolutely inevitable, for the reasons above. It’s just inescapable that your platform will suffer when it goes from lots of free money to no free money. Whether the platform itself dies is a different question; certainly, they can go for a long time in zombie mode. But there’s no way around getting worse.

The problem with that theory is that the third party developers were willing to make up that difference. But, by their calculations, Reddit is charging them 10 times as much as they would make with advertising on the site.

It’s not like the third party developers expected to keep using the API for free. They just expected much more reasonable costs. Because they were promised reasonable costs.

They were promised that Reddit would work with third party apps and not try to eliminate them. But everything they are doing suggests their goal is to eliminate them.

I haven’t seen any truly detailed breakdowns, but the limited amount I’ve seen suggests that the app developers are being a tad optimistic in their comparison. Taking the total amount of advertising income and dividing by the amount of traffic doesn’t give you a realistic equivalent number, for example. Or taking their revenue and dividing by number of users, etc.

10x doesn’t really seem like that crazy a multipler given than the app users are probably among Reddit’s most valuable client base–the power users, not just the random browsers. The third-party apps also likely return less data about user behavior, which would imply increasing rates to make up for that loss. I obviously don’t know anything specific about what went into Reddit’s calculation, but the rates don’t seem too out there. From what I’ve seen, a few bucks per month per user would cover the costs.

And yeah, probably some of that is to make third-party apps less favorable than their own. I just don’t see any strong evidence that they are deliberately trying to kill them. They could just turn off the API completely, after all. Most social sites don’t have an API at all.

As I understand it, the rates are based on equivalent usage. So that’s not it.

And the whole point is that they seem to be trying to kill off third party apps without directly saying that’s what they’re doing. They said they would work with them, but then refused. They charge them what they are charging everyone else.

What’s more, the accusation is out there, and they have not tried to defend against it. They’ve not given any sort of defense of their actions, in fact.

Plus, frankly, I’m not on board with trying to find a defense for a company that won’t defend itself. Don’t make excuses they themselves have not offered. Make them do the work of arguing.

I assume your numbers come from this post:
https://old.reddit.com/r/apolloapp/comments/13ws4w3/had_a_call_with_reddit_to_discuss_pricing_bad/

I think the calculation is pretty silly. The 430 million is “monthly active users”; basically the set of people that visited the site at least once in the past month. That’s a minimum of course, and we aren’t getting a more detailed breakdown, but Apollo users in comparison average 344 requests daily. It would be extremely easy for the average Apollo user to be 20x more valuable than the average “monthly active user”. And unsurprising as well, given that Apollo users are people that deliberately sought out an app to improve their ongoing experience, as compared to people that just got there because it was a Google search result.

And, well, Reddit is as far as I know still losing money. If charging for API access is to improve the situation, it needs to bring in more money than their costs. So one would expect a multiplier just from that.

…I don’t think that there is anything happening here that is in “accordance with my desires.” Reddit appears to just be jumping on the same Elon Musk bandwagon and is raising API charge essentially “because they can.”

While the Stack Overflow story is something that every single industry is going to have to deal with now, from education to photography competitions. Its going to be a pain in the butt. It will be a few years of intense disruption, followed by the fallout from any AI lawsuits and new laws, followed by another settling in period.

I’m not entirely sure what my position is at the moment :smiley:

My baseline position is that people should be able to control how their intellectual property is used by others, largely in accordance with the Berne Convention and other relevant laws and agreements. That using other people’s IP in a training dataset is not the same thing as “a human looking at an image” or a “human reading a book”, and that people should be able to say “you need my permission before you train your AI on my work.”

I don’t think this is unreasonable. The problem here is that the people running the AI companies built the training datasets largely before people knew what they were doing, and now AI is the new frontline in the NFT/Crypto/Venture Capital marketplace and they are in a rush to monetize it while its “hot.”

But I’m not sure that payment for incremental incidental commercial use is something that is desirable or practical. Its either “opt in” or “opt out.” I think this is the direction regulations are heading internationally. Training datasets will be allowed: but they cannot be monetized unless the author/creator has given permission. But it is going to be years before we see any of this played out.

I have said this more than once before and I’m sure you read it–the AI companies didn’t “build the training datasets”. The AI companies used free, public datasets compiled by other companies. Want to train an image generator? Get the data right here:

Want to train a text generator? Here’s your data:

Much of that data comes from Common Crawl

A company doing something that is affirmed to be legal in the US

…irrelevant.

It is absolutely 100% relevant to your clam that they were doing it “largely before people knew what they were doing”, as if it was something nefarious and secret that they were trying to sneak by. The training datasets have been available for years. Open public discussion of training models has been going on for years. Nothing was done “under the radar” because nobody thought there was anything wrong with it. I’ve been reading about it for years, I never anticipated the “gimmie money!” outcry, though in hindsight I probably should have.

It is hard to search for earlier discussions and papers on public data used for training thanks to the explosion of mentions since mid-2022, but here’s a bit from a policy paper from 2017. Someone being ignorant of how an industry/technology works does not mean they were not open about.

Data availability: Just over 3 billion people are online with an estimated 17 billion connected devices or sensors. That generates a large amount of data which, combined with decreasing costs of data storage, is easily available for use. Machine learning can use this as training data for learning algorithms, developing new rules to perform increasingly complex tasks.

…but they were doing it largely before anyone knew what they were doing.

Heck: despite the ubiquitous nature of AI at the moment, I’d guess that most people still don’t know that things they may have posted on the internet have ended up being part of a dataset that is being monetized.

This isn’t about ignorance. Even you admit it was hard to search for early discussions on this because discussions on this were largely confined to obscure articles published on “the Internet Society” website.

We never had the chance to opt out of the process. People were creating and selling AI artworks before most people even knew that the datasets existed.

Why wouldn’t they think that?

It’s not like we haven’t been down that road before. Search engines do exactly the same thing: scrape publicly available data and convert the content into some internal form.

One wonders what goes through people’s heads when they make their content available to every networked computer on the planet and then act confused when computers process that content.

You mentioned once being warned about misrepresenting what people said. You have done it to me several times in AI-related posts recently and you are doing it again right now. I said older references were hard to find because they are flooded out by recent references. Google does not let you sort by the year of a web page, so I have to actually plug years into the prompt and hope I get links that way. If I were able to filter out 2022 and 2023 I would find many references, not just obscure ones.