How can a captcha both test for bots AND train the bot on the other side?

We all know the “click on all photos with a fire hydrant in them” kind of captcha. The idea behind them, presumably, is to weed out bots trying to access the web page protected with the captcha - a human user is able to correctly identify the fire hydrants, while a bot isn’t. In other words, the captcha tests for fire hydrant-distinguishing abilities, which are taken as a proxy for humanness. So far so good.

But at the same time, we hear that by answering a captcha, you’re contributing (for free) to the training of some AI algorithm - you’re telling the algorithm which photos have fire hydrants in them, and that provides training data for a model that aims to tell photos which have a fire hydrant in them from those that don’t.

I understand both aspects in isolation, but I don’t see how they can both apply at the same time. I would presume that if the captcha wants to test users for their fire hydrant-distinguishing abilities, then the machine behind the captcha must already know the correct answers, so it can assess the user against them. If the machine doesn’t know that, then the answers provided by the user might provide valuable training data, but there’s no way for the machine to know if the user that provided that date is itself capable of identifying fire hydrants.

Is it maybe a fuzzy logic kind of thing - that the machine behind the captcha already has some idea of where the hydrants might be, but with a certainty < 100 %, and if the user provides an answer that is in line with the machine’s current best guess, then this is interpreted both as a human response and as a confirmation of the current best guess? I could see that, but it seems to me it would greatly limit the usefulness of the training opportunity (it could only confirm existing guesses, not lead to improved new guesses), and risk a lot of false negatives where the current best guess is actually wrong and a human user provides the correct (but unexpected, for the machine) answer.

I’m pretty sure that they mostly use us as labelers of data. Millions of pictures don’t help an AI determine if something is a hot dog. Millions of pictures that it knows are a hot dog and millions that it knows are not a hot dog is what helps it learn. If I show a photo to hundreds of people and 95% of them say there is a hot dog in it, it now has a label.

So you’re saying it’s only the AI training aspect that matters (and makes websites use captchas), and that keeping out the bots isn’t what it’s about?

Not saying that at all. I’m saying why not kill two birds with one stone. If you are using identification tasks to determine if it’s a human on the other end, why not take advantage of it. I could have you identify random strings of letters and numbers all jumbled on top of each other, or I could have you help me verify my OCR, label my pictures, or some other useful task. My team also does data science work and I don’t have the luxury of having a built in audience of people willing to label things for free. When I want to train a model using some supervised learning method (vs. unsupervised, which are models that work with unlabeled data), I have to pay people to do the labeling, via Mechanical Turk or the like.

Sure, but my question is precisely how the same captcha can kill those two birds with one stone.

I’m not the CEO of Captcha and I don’t work for them, so I have no insight into exactly who their audience is unless they publicly release it.

I’m saying that in the world of data science, labeled data is highly prized and expensive as hell if you need a data set that doesn’t already exist, and if Captcha is smart, they are selling the resulting labeled data sets and/or are choosing labeling tasks based on the needs of whoever has contracted them to do so.

For example, if I want to train self-driving cars, having labeled traffic lights from all sorts of odd angles and cropping situations would help me build the object detection model that splits a computers image of an intersection into its various component parts. After traffic lights, I’d label pedestrians, other cars, stop signs, shopping carts, baby strollers, cross walks, etc.

Upon rereading this, I think I understand where the confusion is coming from. If the SDMB uses Captcha for user registration, to avoid bots from creating millions of spam accounts, they’re using it for exactly that purpose and gain nothing else from it. What I’m saying is, that if Captcha chooses to use pictures and have us label them as the method to doing this, they can then give the resulting labeled data to someone who would use it to train models. So, Straight Dope reduces bot issues, AI-r-Us gets labeled data, all from the same processes.

But the point is that for the Captcha to work, the system has to know whether the user has answered correctly - that is, it has to already know where the fire hydrants are, in order to know whether the user has properly selected hydrants and should be allowed in. The system only works as a captcha if it already knows where the hydrants are, and the system only works to generate labels if it doesn’t already know where the hydrants are.

True, but the system can also track which fire hydrants get identified with milli seconds repeatedly, which fire hydrants take a little time and which fire hydrants are often missed. So, it can learn too.

The first time someone is served a specific photo and labels it as a fire hydrant, Captcha can just assume that they are correct and move on. No one is expecting its bot-stopping abilities to be absolutely infallible and no such guarantees are likely made. Unless the majority of people who see the photo in the future don’t click it when asked for fire hydrants, it will retain the label of “fire hydrant” and be passed on to the folks who were asking for the labelled data.

Captcha’s serving of tasks isn’t AI, machine learning, deep learning, or data science of any kind. It’s simply a task factory that serves a useful purpose and ends up gaining something useful to others as a result of that service.

When it asks you to “click on every image containing a fire hydrant”, it shows you, what, 16 images? So 8 of them are ones that it already has a pretty good idea about (possibly because a bunch of other humans have already seen those images), and 8 of them are ones that it isn’t sure about yet (because humans haven’t yet labeled those images, or not enough humans for it to be confident). As long as you get the right answer on the 8 it already knows, it takes that as sufficient confirmation that you’re human, and trusts your answers for the other 8. And of course, it doesn’t tell you which ones it’s grading you on.

One explanation I’ve heard is that they pair a known image with a vague/unknown one. If the human correctly identifies the known image, it gives credence to them having identified the vague one correctly as well. So in the process you home in on the correct identification of the unknown through many users.

Short answer: a lot of captchas don’t know the right answer. They judge you a human based on other criteria, such as how fast you solve it and how similar your answer is to others they’ve deemed human. Once deemed human, the program assumes your answers are mostly correct and the learns from that.

The training process is separate from the verification process. When you (a member of the general public) are asked to identify the stoplights, the verification process already knows (or thinks it knows) the right answers, and it compares your answers to what it thinks are correct. Before any of that happens, however, the server is trained to learn what stoplights look like, by presenting it with known images and telling it whether it has analyzed them correctly (or it might be a lot simpler, maybe there are just a limited set of canned images).

So in theory, if a robot is trying to impersonate a member of the general public, and it has been trained to recognize stoplights, then it could correctly identify the stoplights. I suppose this is feasible. since I’ve only seen a small number of types of questions (stoplights, fire hydrants, bridges,…).

Oh, by the way, for those of you who don’t like video replies. The Vox piece, provided above, interviews the inventor of the captcha and goes into how they work, including how they learn.
It’s worth a view.

And just who do you think is “telling it whether it has analyzed them correctly”?

And yes, in principle, a robot as well-trained as Google’s could defeat Google’s captchas. But just how many such robots do you think there are in the world, and how many of them are controlled by spammers?

What @Chronos said.

Note that the original concept of a captcha didn’t involve the sorts of image recognition problems that current ones do. It was a known computer-generated image that was distorted. So in that case the computer did know what the right answer was.

No, this is wrong. The whole point of the modern “identify the things that you need to identify as a self-driving car” system is to use tiny bits of human labor to classifier images as the input for Google’s machine learning. If they already had a classifier that could tell if you got it right, they wouldn’t gain anything from showing you those images.

There was an old generation of captchas where it was obvious which were the images it already had a pretty good idea about and which ones were the unknowns. But it’s not like anyone would deliberately mislead the AI about the new images :wink: :wink: :wink:

I thought this was the answer though: that some images it is certain about and others are adding to it’s data set.
Of course sometimes humans will click incorrectly out of mischief or just in error, but those will cancel out over the course of asking thousands or millions of people to identify the images.

I can recall plenty of times clicking the wrong image by mistake or where the correct answer was not obvious and still being judged to be human, all suggesting the Captcha doesn’t have an absolute idea of a perfect input.

There still seems to be some confusion about where the machine learning comes into play. It’s not necessary at all to determine if you picked the images correctly. In fact, it would just slow the process down for no reason. I’m pretty sure they are using a machine learning model (or they might stack many of them on top of each other, in the way the Google search engine does) to determine who just has to check the “Not a robot” box and who has to complete the challenge, but that’s the only place I can see it adding any value at run-time.

These days, most of the CAPTCHA challenges that you see are from reCAPTCHA, which was a Carnegie Mellon research project that Google acquired a long time ago. It was originally free (still is for up to a million hits a month) as it provided an internal benefit to Google while also combatting bot mischief. As I said earlier, the primary benefit for Google is that it labels their images that they can then use for other purposes. They’re not using a sophisticated AI to determine if that picture you said has a traffic light really has a traffic light, the fact that most other people said the same thing gives them that information. Now that they know the picture has a traffic light in it (which is the real value for them), their various models built off of this data can identify traffic lights in new images that it has never encountered, such as in a self-driving car situation. I have to pay people to label images. Google gets you to do it for free.

In their own words:

reCAPTCHA offers more than just spam protection. Every time our CAPTCHAs are solved, that human effort helps digitize text, annotate images, and build machine learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.