What’s the purpose of the weird text at the bottom of spam email? Like this I received today at the bottom of an email advertising cheap Viagra:
seldom necessaries, but then they cost so little,girl will come out at once, as she did not before"oh, march. whose tender little heart had known him in temper, i’ll say no more. talk it over with the such restraint, and looked like a blissful
Just to get past screening algorithms, so it looks like the ad is part of a genuine email.
This has been covered before, I know, so if you search (I can’t as a guest) you will probably find more comments.
Many spam blockers rely on contextual analysis and word frequency to determine if an email is spam. Adding the nonsense–note it isn’t a completely random selection of words, but a variety of random clauses made to look like an English sentence–fools the spam reader into thinking this is a legitimate part of the message, and so it counts the words as part of the “word frequency” check.
An email with too high a percentage of words being “penis”, for example, would get caught, but adding the nonsense text at the bottom increases the denominator enough that the percentage falls below the threshold. Making it look something like a legitimate English sentence fools the more sophisticated readers that are able to recognize random patterns of words–say, from a portion of the dictionary. Finally, spammers can’t just use the same text every time, because once an email is ID’ed as spam by a reader it will block any similar-looking email in the future.
There’s also the literary quotation variety; I used to get spam with quotes from Pride and Prejudice and The Lord of the Rings. It made for some amusing reading, I can assure you. I don’t have any examples here, but I’m sure others will chime in. It’s all in aid of getting the spam past filters, as CJJ* explained.
I was at a computer expo a few years ago, and I mentioned to the anti-spam guys that I had a pretty good way to detect spam: just see how many words were not in the dictionary! Since the vast majority of spam I get is filled with w0rd5 L1ke t4is, it should be easy to filter based on a certain percentage of non-words. Their software guy thought it was a pretty good idea, but nobody seems to be doing it yet…
That wouldn’t work for our company – we send lots of intra-company mail, and to our clients, which includes words not in any dictionary, but which are technical terms and acronyms we invented for our own use. Unless we provide an updated dictionary for the spam-detection software to every company we send email to. Not to mention that our emails often include file names with very weird names.
That is a pretty good idea. I wonder if it’s part of the algorithms used by any filters already out there. You might get false positives if the email uses a lot of acronyms and industry jargon, but every system gets some false positives. But… cool idea, beowulff. I wish I were a programmer, because then I’d have the skill to steal it.
Really, what you’d want to do is construct a script to translate leetspeek to letters (not at all hard; I could do it in an hour, or in five minutes if I didn’t have to refresh my memory on regular expressions), and see how many of the “non-English” words get converted to real words by it. And, of course, how many of those real words are “penis”, “pills”, “replica watches”, etc. So when you’re talking about your new model R73-a bingbongifier, it wouldn’t trigger the filter, but “en14rg3 your m@nh00d” would.
Another aspect of the random gibberish is that some server-level spam filters will pick up on many identical messages being sent to different users on the server. If every recipient gets a different random passage, then this test won’t be triggered.
I think this would snag a lot of them, but you’d be hard pressed to keep on top of all the possible combinations of letters and symbols that could be transliterated into a real word. This is a pretty interesting article showing the ridiculously large number of ways to spell “Viagra” with symbols and letters.
I think the key in beowulff’s plan is the percentage threshold. I’m in IT, but even in my most acronym-ized, jargon-ized, tipo-fillde email, I might hit 5% non-words. The spam I see is usually way higher. And, you’d probably want to look for both percentage and clustering of non-words. If the spammer puts the complete works of Shakespeare at the end of the email, the percentage of non-words would be low, but they would be highly clustered in the beginning. Say, if any block of 200 words contains 75 non-words, you flag it. Something like that might work.
One of the comments that the spam guys made was that this would be hard for non-english (and unicode) emails. Frankly, I don’t give a damn. Those folks can use whatever spam filtering they want, but I think this method would work quite well for me.
Of course, the spammers would figure out a way around it after a while.
This brings me to the crux of the biscuit, which is that the current email system is fundamentally broken. There are so many spam filters up, that people don’t receive valid email and** never know it**. As an example, I NEVER get emails from my GF, probably because she’s on msn.com. I have no idea what happens to them, they just never make it to any of my mailboxes. sigh
That website says that there are 600,426,974,379,824,381,952 ways to spell Viagra, which may sound like a lot. But it’s still really easy to write a script that catches every last one of them. How do I know this? Because he gave a number. The only way anyone can quote a number that big is if they have in mind an algorithm for generating Viagra-spellings, and I could just use a variation of that same algorithm for recognizing Viagra-spellings. I don’t need to have the list of all 600+ quintillion spellings in order to recognize one, any more than the author of the webpage needed a list of all 600+ quintillion to count them.
Incidentally, I presume that the major spam filters already do use techniques like this, and they manage to block out the vast majority of spams. The only reason any get through at all is because of measures put in place to prevent false positives. Nobody wants to miss an important e-mail, so most filters will include a whitelist of some sort, such that any message that meets certain criteria will be allowed through, no matter what red flags it has. And sometimes, a spam message will randomly happen to meet one of those criteria.
Chronos, I see your point, and it makes good sense. I was just now playing around with writing a script that checks words found in an email against a dictionary. I just parsed through the source file and searched for the resulting strings in a dictionary.txt file. That would seem to me (I am not a programmer) to be an easier approach than trying to duplicate the spammer-lingo algorithm and then checking the dictionary. Whether it’s easier, of course, doesn’t have a lot of bearing on its relative usefulness.
In any case, I also would imagine that the things we’re talking about are in use in some combination in existing spam filters.