What's the deal with Captchas?

On NPR today someone was talking about a Captcha project. He said that the smeary, wavy words we are seeing in the Captcha box are words from old scanned books that computers can’t recognize. By typing in the word, we are contributing to the digitization of these old books. Millions of people type in these Captchas per day, so whole books are being converted from old printed pages to digital form in this way.

This makes no sense to me. The purpose of the Captcha is to make sure it is a real person and not a bot trying to access whatever. So the website must compare what I have typed (my interpretation of the smeary words) to what the words actually are. If I got it right, then I’ve passed the human test and I get access. Which means that the correct words of the smeary image already exist in the programs memory. Which means somebody had to type it there. Which means, me typing it again doesn’t add up to anything in terms of providing digitized words to convert old books.

I’m missing something - but what?

No ordinary captcha, but reCAPTCHA. You get two words, one which corresponds to a word being digitized and the other for which the answer is already known. The primary purpose is still to catch autologins, but as long as they’re making a legitimate user recognize digitized words, they might as well get something useful out of it.

And the way they know the known word is that they have previously given it to many people who have interpreted it the same way, giving a high degree of confidence as to what it is.

The person doing the reCAPTCHA doesn’t know which word is known and which is unknown. By entering the known word correctly, the system can gain some confidence that his guess for the unknown word is correct. If many people enter the same interpretation for the unknown word, then that word is considered solved and is thereafter used as a known word.

And, of course, if the spammers manage to come up with an algorithm that can defeat them, then Google can appropriate that algorithm and use it to improve their own automated OCR methods. It’s a win-win.

Sometimes you do. And if the other word is too hard you can just keyboard mash to save time. Rather than mucking about squinting and guessing.

Actually you can tell the difference between the known word and the unknown one, at least unless it has been changed. The known word always looks a certain way, it has a particular (for lack of a better word) font.

On a side note, there have been movements by groups of people on the internet to replace the unknown word with a certain racial slur.

You can also just press the recycle button to get a new one, if you can’t read it.

That seems implausible, given the way the system is designed to work. The known and unknown words come from the same sources. If you can provide an example of this I’d like to see it.

It’s not as implausible as you make it sound. It makes sense that the algorithm might have more trouble with certain fonts.

My observation, though, is that other differences are more common. One word always looks a lot more smeared than the other, for example. The letters are less distinct. Also, one word often isn’t one that’s in a normal dictionary (in fact, I’ve seen quite a few that were obvious typos). I think it is quite likely that these are the words that are hard to determine by OCR software. I know the software I’ve used before has problems with such text.

Interesting - that explains why sometimes I get by the captcha even though I’m pretty sure I typed one of the words wrong (hey, they can be hard).

What makes you think the spammers are just going to hand over their source code to Google?

I had to do a recaptcha after reading this thread, and it did seem I could pick the real word. It was similar to the example in the link, where the one was a dictionary word, and was much straighter than the other.

You seem to be suggesting that the known word is one that’s been created by the organization, while the unknown word is one that’s from a genuine scan. My understanding is similar to friedo’s; that the known word is also from a genuine scan but is one that’s already been identified by several users.

Is there a reason for this, other than some peoples’ endless capacity to be assholes? A “principled” objection to being “forced” to do a couple of seconds of “work” solving a puzzle not of their own choosing? :rolleyes:

You’ve clearly never been to 4chan.

That was subsumed into “some peoples’ endless capacity to be assholes.” :smiley:

Aside from psychonaut’s point, the algorithm currently used by spammers is to hire people in the Third World to solve them at a few pennies per word.

They’re mainly there as a first defense against spambots. I encounter many captchas that only require a vague approximation of what’s in the box. The same number of characters within one, and at least two matching characters. Some are very strict, others are not. You may see some that have an algebraic equation in them. That’s a sure sign that it’s looking for a loose interpretation, just the numbers will do.

I’ve never heard of this project to clarify digitized content but it seems feasible.

This is true. Also, several major spambots advertise the ability to break CAPTCHAs (including recaptcha) using OCR techniques.

Did a short bit of searching and found the picture I was referring to. Link* it contains an image which describes what I posted.

The image I just linked provides an example. A few times I entered incorrect words to see if I could distinguish between the fonts used, but I didn’t do it enough to rule out the possibility that I got lucky.

Interesting. That explains why many times I don’t have high confidence in my interpretation of one of the words but it still accepts my answer most of the time. I would think I would be wrong more in trying to interpret the very distorted letters.