In an attempt to stem the never ending flow of spam, I’ve starting wondering if there’s a list of letter combinations that do not occur in the english language. I thought that if I filtered on them, it might be a weapon against those spammers that use random letters in the subject or body.
For example, can it be said that the letter “X” is never followed by the letter “P”?
Depending on your email use (Are you getting program info, i.e. XP, or is it just email between friends?) I would say set your email up to filter the messages of people you know into specific boxes, and everything else away. Also, forward the spam to the FTC. If the spam breaks the rules, the “Feds” should bear on them and it will help to curb the spam. Until then, never reply or use any links on spam. It confirms that you are a valid account and that they should send you more. Instead copy the link and paste it in a new window. It seems to work for me in keeping the spam down.
And as for the English letter combination, you will probably have to get to 4 characters before you start to really limit the possibilities. The only one that I can think of in common language is q_. Typically should be a U but there are examples, specially if you include abbreviations: QA or QTY
I’m rather surprised I got this long a list. I also now have a list of the frequencies for the other 558 pairs. Of course, in addition to odd, seldom used words or proper names not in that spelling list, we will have a lot of those combinations occuring in abreviations and acronyms - mg for classic sports cars and metric measurements, for instance. And any combination you hit upon as highly unlikely now may well become an important abbreviation in the future.
Sorry,
“you will probably have to get to 4 characters”
Had my logic backwards there, was thinking that you wanted to limit to English by proving that it was. Much different than exclusions.
As for the list above, I take that down to 97 with just the common abreviations that either occur in “netspeak” or in standard language. e.g. VT - Vermont.
And what is the likely hood that these are used by the spam? What percentage will you hit by writing this code? 5% (not worth it by my estimation compared to what you can exclude by easier means), 15%, 50%? What is it worth to you?
or “buzzbomb” or “subquery”. There will be a lot of words not included in the words list for a simple spelling program. I was just performing the exercise out of curiosity. If we worked on it, we could probably find legitimate occurrences of each of those combinations.
TK is pretty uncommon. Which is probably one reason why it’s hung around so long in the publishing world, where it’s used to indicate something that will be inserted later (TK=To Come), i.e., “The USA exports TK tons of wheat every year.” In modern times, a quick word search for “TK” usually brings up any offending “yet to be inserted” instances.
Some spammers are even resorting to leet-speak to get around the filters, which, because they substitute numbers and punctuation for letters, would be even harder to trap than to filter out rare letter pairings. I get things like:
Enlar6e y0ur p3n!$ t0d@Y!
L0w3r m0rtga93 ra+3s n0w @/a!ila8le.
They just won’t give up, no matter how much we hate them and try to thwart their efforts.
Do you have a cite for that? My impression was that it was to fool the spam filters, nothing more. And I frankly can’t imagine how a subject header could do anything for flash or java.
I find these pretty easy to filter on, when you get a few key words set up. There’s no chance I can see that “p3n” and “v1a” are going to be anything other than spam so I feel really secure about pitching them.
All captured mail goes to a junkmail folder that I review before discarding so risk is low for any false positives.
As an aside, I also have to question the “Java” or “flash” stuff. Most of the random characters are to keep the incoming spam detectors from triggering on a thousand identical incoming mail. Every received message to a specific mail server is slight different from every other one.
The one’s I’m finding I’m finding impossible to filter are those that have the message as a image. When “Penis” is part of a gif, nothing short of OCR will catch it. These ones, though, to keep the spam detectors down, will often still have a string of random letters on the bottom. It’s these that I’m hoping to catch with the letter-frequency method.
How difficult would it be to ask all my correspondents to include a certain string in the subject line or body of the message? That would make filtering out any emails which do not have that string fairly easy. If I include it as part of my signature then anyone who includes my message in their reply will have it. I am just afraid some people might not pay attention and would not include it. How effective would this technique be?