What's the deal with Captchas?

randwill · April 9, 2011, 2:36am

On NPR today someone was talking about a Captcha project. He said that the smeary, wavy words we are seeing in the Captcha box are words from old scanned books that computers can’t recognize. By typing in the word, we are contributing to the digitization of these old books. Millions of people type in these Captchas per day, so whole books are being converted from old printed pages to digital form in this way.

This makes no sense to me. The purpose of the Captcha is to make sure it is a real person and not a bot trying to access whatever. So the website must compare what I have typed (my interpretation of the smeary words) to what the words actually are. If I got it right, then I’ve passed the human test and I get access. Which means that the correct words of the smeary image already exist in the programs memory. Which means somebody had to type it there. Which means, me typing it again doesn’t add up to anything in terms of providing digitized words to convert old books.

I’m missing something - but what?

Terminus_Est · April 9, 2011, 2:43am

No ordinary captcha, but reCAPTCHA. You get two words, one which corresponds to a word being digitized and the other for which the answer is already known. The primary purpose is still to catch autologins, but as long as they’re making a legitimate user recognize digitized words, they might as well get something useful out of it.

friedo · April 9, 2011, 2:55am

And the way they know the known word is that they have previously given it to many people who have interpreted it the same way, giving a high degree of confidence as to what it is.

The person doing the reCAPTCHA doesn’t know which word is known and which is unknown. By entering the known word correctly, the system can gain some confidence that his guess for the unknown word is correct. If many people enter the same interpretation for the unknown word, then that word is considered solved and is thereafter used as a known word.

Chronos · April 9, 2011, 7:07pm

And, of course, if the spammers manage to come up with an algorithm that can defeat them, then Google can appropriate that algorithm and use it to improve their own automated OCR methods. It’s a win-win.

Cugel · April 10, 2011, 1:29am

Sometimes you do. And if the other word is too hard you can just keyboard mash to save time. Rather than mucking about squinting and guessing.

Aversin · April 10, 2011, 2:38am

Actually you can tell the difference between the known word and the unknown one, at least unless it has been changed. The known word always looks a certain way, it has a particular (for lack of a better word) font.

On a side note, there have been movements by groups of people on the internet to replace the unknown word with a certain racial slur.

friedo · April 10, 2011, 3:10am

You can also just press the recycle button to get a new one, if you can’t read it.

That seems implausible, given the way the system is designed to work. The known and unknown words come from the same sources. If you can provide an example of this I’d like to see it.

BigT · April 10, 2011, 6:36am

It’s not as implausible as you make it sound. It makes sense that the algorithm might have more trouble with certain fonts.

My observation, though, is that other differences are more common. One word always looks a lot more smeared than the other, for example. The letters are less distinct. Also, one word often isn’t one that’s in a normal dictionary (in fact, I’ve seen quite a few that were obvious typos). I think it is quite likely that these are the words that are hard to determine by OCR software. I know the software I’ve used before has problems with such text.

Rigamarole · April 10, 2011, 7:29am

Interesting - that explains why sometimes I get by the captcha even though I’m pretty sure I typed one of the words wrong (hey, they can be hard).

psychonaut · April 10, 2011, 8:46am

What makes you think the spammers are just going to hand over their source code to Google?

ZenBeam · April 10, 2011, 1:26pm

I had to do a recaptcha after reading this thread, and it did seem I could pick the real word. It was similar to the example in the link, where the one was a dictionary word, and was much straighter than the other.

Dewey_Finn · April 10, 2011, 2:26pm

You seem to be suggesting that the known word is one that’s been created by the organization, while the unknown word is one that’s from a genuine scan. My understanding is similar to friedo’s; that the known word is also from a genuine scan but is one that’s already been identified by several users.

John_Bredin · April 10, 2011, 2:50pm

Is there a reason for this, other than some peoples’ endless capacity to be assholes? A “principled” objection to being “forced” to do a couple of seconds of “work” solving a puzzle not of their own choosing? :rolleyes:

Interconnected_Series_of_Tubes · April 10, 2011, 4:21pm

You’ve clearly never been to 4chan.

John_Bredin · April 10, 2011, 5:04pm

That was subsumed into “some peoples’ endless capacity to be assholes.”

Derleth · April 10, 2011, 5:20pm

Aside from psychonaut’s point, the algorithm currently used by spammers is to hire people in the Third World to solve them at a few pennies per word.

Nunzio_Tavulari · April 11, 2011, 3:26am

They’re mainly there as a first defense against spambots. I encounter many captchas that only require a vague approximation of what’s in the box. The same number of characters within one, and at least two matching characters. Some are very strict, others are not. You may see some that have an algebraic equation in them. That’s a sure sign that it’s looking for a loose interpretation, just the numbers will do.

I’ve never heard of this project to clarify digitized content but it seems feasible.

tellyworth · April 11, 2011, 5:11am

This is true. Also, several major spambots advertise the ability to break CAPTCHAs (including recaptcha) using OCR techniques.

Aversin · April 17, 2011, 7:25pm

Did a short bit of searching and found the picture I was referring to. Link* it contains an image which describes what I posted.

The image I just linked provides an example. A few times I entered incorrect words to see if I could distinguish between the fonts used, but I didn’t do it enough to rule out the possibility that I got lucky.

control-z · April 18, 2011, 4:56pm

Interesting. That explains why many times I don’t have high confidence in my interpretation of one of the words but it still accepts my answer most of the time. I would think I would be wrong more in trying to interpret the very distorted letters.

Topic		Replies	Views
reCaptcha is the Worst The BBQ Pit	57	16408	August 19, 2012
Just Sharing A Captcha Miscellaneous and Personal Stuff I Must Share	11	1876	November 11, 2011
Stupid Captchas!... The BBQ Pit	30	3732	April 25, 2009
Are captchas more trouble than they are worth? Factual Questions	25	2350	October 15, 2008
reCAPTCHA: Digitize Books -- One Word at a Time Miscellaneous and Personal Stuff I Must Share	2	1358	May 28, 2007

What's the deal with Captchas?

Related topics