English letter order - what's never seen?

In an attempt to stem the never ending flow of spam, I’ve starting wondering if there’s a list of letter combinations that do not occur in the english language. I thought that if I filtered on them, it might be a weapon against those spammers that use random letters in the subject or body.

For example, can it be said that the letter “X” is never followed by the letter “P”?

Is there a definitive list somewhere?

-B

…obviously a bad example in this world of “Windows XP” but you get the idea, I hope…

I expended much thought on this, but I expect that I don’t have enough experience to expound.

As in “experience”? Sorry I can’t come up with a working example.

Depending on your email use (Are you getting program info, i.e. XP, or is it just email between friends?) I would say set your email up to filter the messages of people you know into specific boxes, and everything else away. Also, forward the spam to the FTC. If the spam breaks the rules, the “Feds” should bear on them and it will help to curb the spam. Until then, never reply or use any links on spam. It confirms that you are a valid account and that they should send you more. Instead copy the link and paste it in a new window. It seems to work for me in keeping the spam down.

And as for the English letter combination, you will probably have to get to 4 characters before you start to really limit the possibilities. The only one that I can think of in common language is q_. Typically should be a U but there are examples, specially if you include abbreviations: QA or QTY

I got curious, and messed about with the word list for a spelling program. The following 118 pairs never appeared in the list:

bq bx bz cb cf cg cj cv cw cx dx fq fv fx fz
gq gv gx hj hx hz jb jc jd jf jg jh jj jk jl
jm jn jp jq js jt jv jw jx jy jz kq kx kz mg
mj mx mz pq pv px pz qa qb qc qd qe qf qg qh
qi qj qk ql qm qn qo qp qq qr qs qt qv qw qx
qy qz sx sz tq tx uw vb vc vd vf vg vh vj vk
vl vm vn vp vq vt vw vx wj wq wv wx xj xk xv
xx xz yy zb zc zd zf zj zm zp zq zr zx

I’m rather surprised I got this long a list. I also now have a list of the frequencies for the other 558 pairs. Of course, in addition to odd, seldom used words or proper names not in that spelling list, we will have a lot of those combinations occuring in abreviations and acronyms - mg for classic sports cars and metric measurements, for instance. And any combination you hit upon as highly unlikely now may well become an important abbreviation in the future.

Sorry,
“you will probably have to get to 4 characters”
Had my logic backwards there, was thinking that you wanted to limit to English by proving that it was. Much different than exclusions.

As for the list above, I take that down to 97 with just the common abreviations that either occur in “netspeak” or in standard language. e.g. VT - Vermont.

And what is the likely hood that these are used by the spam? What percentage will you hit by writing this code? 5% (not worth it by my estimation compared to what you can exclude by easier means), 15%, 50%? What is it worth to you?

Yabob–take out
bz “subzero”.
jj “hajj”
zm “hazmat”

or “buzzbomb” or “subquery”. There will be a lot of words not included in the words list for a simple spelling program. I was just performing the exercise out of curiosity. If we worked on it, we could probably find legitimate occurrences of each of those combinations.

TK is pretty uncommon. Which is probably one reason why it’s hung around so long in the publishing world, where it’s used to indicate something that will be inserted later (TK=To Come), i.e., “The USA exports TK tons of wheat every year.” In modern times, a quick word search for “TK” usually brings up any offending “yet to be inserted” instances.

Some spammers are even resorting to leet-speak to get around the filters, which, because they substitute numbers and punctuation for letters, would be even harder to trap than to filter out rare letter pairings. I get things like:

Enlar6e y0ur p3n!$ t0d@Y!

L0w3r m0rtga93 ra+3s n0w @/a!ila8le.

They just won’t give up, no matter how much we hate them and try to thwart their efforts.

I can come up with 19, including yabob’s that aren’t useable. What others?

bq subquery
cg centigram
cv curriculum vitae
gq gentlemans quarterly
jc jesus christ
jd jack daniels
jj hajj
jp japan
jv junior varsity
mg milligram
mx mx missle
qb quarter-back
qt quick time
vd veneral disease
vt vermont
wv west virginia
xx strikeout,xxx
zb buzzbomb
zm hazmat

You do realize that these “series of letters” aren’t random. They’re setting up code to start flash or jave or some other routine.

You do realize that these “series of letters” aren’t random. They’re setting up code to start flash or java or some other routine.

Do you have a cite for that? My impression was that it was to fool the spam filters, nothing more. And I frankly can’t imagine how a subject header could do anything for flash or java.

I find these pretty easy to filter on, when you get a few key words set up. There’s no chance I can see that “p3n” and “v1a” are going to be anything other than spam so I feel really secure about pitching them.

All captured mail goes to a junkmail folder that I review before discarding so risk is low for any false positives.

As an aside, I also have to question the “Java” or “flash” stuff. Most of the random characters are to keep the incoming spam detectors from triggering on a thousand identical incoming mail. Every received message to a specific mail server is slight different from every other one.

The one’s I’m finding I’m finding impossible to filter are those that have the message as a image. When “Penis” is part of a gif, nothing short of OCR will catch it. These ones, though, to keep the spam detectors down, will often still have a string of random letters on the bottom. It’s these that I’m hoping to catch with the letter-frequency method.

Of course, Tk is a programming language, so that letter combination has a valid usage.

How difficult would it be to ask all my correspondents to include a certain string in the subject line or body of the message? That would make filtering out any emails which do not have that string fairly easy. If I include it as part of my signature then anyone who includes my message in their reply will have it. I am just afraid some people might not pay attention and would not include it. How effective would this technique be?

cw is used in “cwm,” which is a Welsh word that’s been adopted into English. (it’s pronouned “koom”)

I really doubt the headers can have any effect regarding java, but they certainly cannot be used to trigger Flash.