How do spambots get past "human-proving" tests?

Another forum I post on is having a lot of trouble with spambots recently. In order to ward them off, it has a test common to a lot of profile-creating sites; during the sign-up, you get a series of pictures of a code (not the actual text) which you then have to re-enter. That way, a bot can’t simply use the text on the page itself since the code isn’t text.

One way I can think of to get round this would be to look at the filenames of the pictures. For example, if your random code is gj039u23, and the filenames are g.jpg, j.jpg, etc. a bot could be programmed to try entering the filenames of the pictures on the page (something that would take time if there’s more images than just the code ones, but would eventually work). I don’t know if something like this is possible, though.

Anyway, does anyone know how spambots can get past this test, and ones like it?

OCR - http://en.wikipedia.org/wiki/Optical_character_recognition

Advanced Details on Methods of Defeating ‘Captchas’ http://sam.zoy.org/pwntcha/

Can I just suggest that site-designers not make these things too much more difficult to decode?

I find myself having difficulty decoding these “CAPTCHA’s” about half of the time.

Outsourcing to China perhaps? <- Possible actually

But to your example; well…most of the security check image things I have ever seen have been a single image, not a sequence of different ones. If the site you are talking about does have it as a sequence, and the filenames are as you say then…

With a single image though, probably the best you could do with the filename would be to assume that it is a timestamp, and then assume that the code is derived from the time. But you would need to know what they were doing to the timestamp to convert it into such a string. So…if the site is using an open source engine, and the engine is doing something like that, then it would be economically feasible to build an app that searched the net for any forums using that engine, and then exploit the timestamp → code hashing weakness. But for a non-open source engine, determining the hashing would be pretty hard, and for anything that wasn’t a public engine, figuring out an exploit for just a single board would be a waste of time. So, lots of Chinese people. :cool:

Ah, or if the site isn’t using a proper deformer…

Yes, it would work, but only if the CAPTCHA is so poorly designed that there is a one-to-one correspondence of filenames (or other identifying marks, such as checksums and hashes) to images, and there is a reasonably small number of images in total. There are such poorly designed CAPTCHAs out there, though, and records of this kind of exploit happening.

A popular tactic is for the spammer to present the CAPTCHA to an unwitting human to get him to solve it. For example, the spammer will set up a website offering free porn, but in order to view the porn, you have to solve the CAPTCHA. The trick is that the it isn’t really the porn site’s CAPTCHA; it’s really another site’s CAPTCHA—a site that the spammer wants to spam.

Pretty cool link on http://www.pwntcha.net/test.html,
if you goto http://linuxfr.org/user_new.html and save the picture it generates, then upload it to that site, you can see it defeated in realtime.

On one site I frequent, instead of using these image codes, requires the poster to answer a simple math equation. For example it might ask, “What is two plus four?” You have to provide the correct number (“6”) for your post to be accepted.

It’s a little buggy (if you take to long to compose your post, the question changes, and your post is rejected) but it has cut down on spam significantly.

I have just had a revelation: The world’s first A.I. the first artificial entity to be able to pass the Turning test, may well be a spambot. :eek: :frowning:

Or they can outsource it to people who are interested in something have – so interested that they’ll do a captcha to get access. Things such as porn, mp3, and software cracks.

I’m not sure you’d even need to outsource it. Think of how many captchas a person could do in, say, a half-hour. A spammer might be quite willing to do a half-hour of non-taxing work per day. And for each captcha he decodes, his spambot gets into another free webmail service or message board, and can send thousands of messages on it.

This would be considerably curtailed if a new captcha were required for every e-mail or post (since then the spammer would need to personally review thousands of images to send out thousands of spams), but I’ven’t encountered any sites which require this. Usually, it’s just to sign up for an account.

You could have something like requiring a captcha on the 1st, 2nd, 4th, 8th, 16th etc message. A normal user would only ever see about 10 or so captchas but spammers would be quickly caught before they sent out more than a few dozen messages.

Ebay requires that a CAPTCHA be solved for every inter-member e-mail.

To every bot, turn, turn, turn!

(The Turing Test.)

BTW, how do those CAPTCHA sites deal with blind people?

I think some sites provide audio captchas as well where you download an MP3 of a person reading a set of numbers.

I’ve heard of another interesting way to defeat the bots: use pictures.
the site displays ,in random order,about 10 small photos of similar items (like bicycles and motorcycles), and tells the human to click on the 6 pictures of the motorcycles.

it sounds like a good idea…does anybody know if it works?

That is a solution, but it isn’t a very good one for two reasons:
[ul]
[li]Not everyone can play audio files. This goes double for patent-encumbered formats like MP3.[/li][li]I think voice recognition is actually better than OCR at this point. I’ve run into more than a few automated telephone trees based on the computer’s ability to recognize the words coming outta my mouth.[/li][/ul]

Yes, but not very well. There are a number of problems:
[ol]
[li]The instructions change every time the CAPTCHA is presented. This is a problem if the human doesn’t speak the language very well or at all. (For example, I might not know what the word “motorcycle” means, so I wouldn’t know which pictures to click on.) Text-based CAPTCHAs don’t have this problem, since the instructions are the same no matter what image is displayed. I could solve a CAPTCHA even if the instructions were in Slovenian or Polish.[/li][li]The number of possible combinations of puzzles is smaller with photos, as there are fewer photos than there are words. Spammers could eventually build a database of all photos used by a particular system and tag them with the appropriate keyword(s).[/li][li]Photo CAPTCHAs need to be painstakingly human-generated—that is, the human writing the CAPTCHA puzzle has to take or find the photos, crop them and resize them, upload them to a server, assign them the appropriate keywords (“motorcycle”, “kitten”, etc.), and then enter all this information in a database. On the other hand, text-based CAPTCHA software can generate puzzles at random on the fly by simply choosing words from a dictionary or even picking letters at random.[/li][li]Photo CAPTCHAs are more bandwidth-intensive. Instead of just sending the human one image, you have to send them a whole bunch. A lot of the world still uses slow dialup Internet access, particularly in third-world countries.[/li][/ol]

Actually I think people will have more trouble with the non patent-encumbered formats. All the computers I have used windows/linux etc. have come with an mp3 player without any additional programs to install. Until very recently I have had to download ogg format codecs separately.

And more importantly, there are deaf people surfing on the net…