reCAPTCHA: Digitize Books -- One Word at a Time

I think it’s a nifty idea.

Basically, it’s like CAPTCHA, except it uses two words: one “known” word that is used for security, and a second “unknown” that is scanned from old texts. The unknown words are collected to generate a digitized version of old texts.

details

It seems like a neat idea at first, sorta like Seti@home, where unused CPU cycles are put to good use. But I wonder how it would work – human-entered text needs to be compared with what is expected before it can serve a useful purpose. Here, we don’t know what the correct text is, so how can the entry data be validated? Wouldn’t we need some human super-editor to look over the results to avoid deliberate deception, not to mention unintentional human error? And using a human super-editor would defeat the purpose of saving man hours, right?

Security would be ensured using one of the words, much like regular captcha. This word is known beforehand.

The digitizing is done on the second word. If the user entered the known word correctly, then the second one is concidered. To avoid mistakes, deceptions and typos, each word would be digitized multiple times, and a vote taken to select the proper one.