Spammimic: Fun, but how is the message encoded?

Musicat · April 5, 2004, 12:08am

Good idea. As long as the spam-mimic site is in operation, or if we can figure out the algorhythm as a backup!

I wonder how spam-heuristic programs treat this kind of email? It seems like it would trigger most any of them to dump it unceremoniously. Personally, I go “bareback” myself, not using a filter, so I can’t try it out. I would hesitate sending this kind of encoded message to friends unless warning them first.

Musicat · April 5, 2004, 5:41pm

<bump>

Bletchley Park we may not be, but with all us smart Dopers around just hungering for intellectual exercises, can’t we figure out how this works?

Taran · April 5, 2004, 7:29pm

sibyl had the right idea, but to test all short strings would surely drive us mad. I’m trying to figure out how to write a script that automatically asks spammimic for all combinations and stores the result in some text files. Then we could ‘diff’ all pairs, and at least have plenty of raw data to go on.

As for me: ; produces a spamtext ~800 characters long, which is the shortest I’ve found. It exhibits one other interesting property: when the line “…become rich inside 30 DAYS !” is changed to “…become rich inside 31 DAYS !”, the spamtext decodes to &?.That exact thing. C&P’ing it has made the cursor behave strangely; it was leading my actual text by 2 characters. After I passed a linebreak, everything returned to normal. Also, when I try to insert spaces before “That exact…”, the final untranslatable character “sticks” to the period, which “sticks” to the T, so the spaces get inserted two characters to the left of the cursor. Anything I do on that line to the right of the weirdness is affected.

For “…become rich inside XX DAYS !”, when XX is changed to any one-digit value, the result is undecodable, even when I insert a leading zero. Similarly, all the 3+ digit numbers I tried came up empty. However, all the two-digit numbers I tried had valid, if peculiar, decodings:



Value of XX          Decoded Result
32                         &?V
33                         &?
34                         &?U
35                         &?T
36                         &?S
37                         &?R
38                         &?P
39                         &?Q
40                         &?
41                         &?^
42                         &?]
43                         &?\
44                         &?X
45                         &?Y
46                         &?[
47                         &?Z
48                         &?K

There’s an obvious pattern, but it keeps breaking down. Could someone test the untranslatable characters, and see what hex numbers they correspond to?

Incidentally, those spaces between the untranslatable characters only appear after I’ve pressed the enter key in the near vicinity of the characters in question. That’s why all the entries but K have them.

Taran · April 5, 2004, 7:32pm

Hmm. Turns out those extra spaces only appear in the text window. What the heck are those, anyway?

Oh, and I tried messing with the other numbers in the text, but my lack of success was total.

Musicat · April 5, 2004, 9:04pm

I found a Perl script that purports to mimic the spam mimic program. But it seems to have some differences as well. You can download the PL text file here. Scroll down to “mimic.zip.” The test site it links to is no longer there.

Taran, what is your source text? A single semicolon? Then, after encoding, you are altering the 30 DAYS to 31 DAYS?

Those chars that don’t display properly are in the high 7-bit ASCII set, I believe. I think you are forcing the translation engine to work outside of its range, and that returns values not anticipated. Then they are further mangled by vB code since they are outside the standard font set. That makes me believe that the numbers are only “modifiers” of other parameters, not pure substitution code.

Mangetout · April 5, 2004, 9:07pm

I mean the terms within the text that are expressed in numeric characters, rather than alphabetic ones.

Musicat · April 5, 2004, 9:14pm

Taran, I just fed “;” into the encoder, changed the 30 DAYS to 31 DAYS, then fed it back thru the decoder. I got 5 bytes:

Hex: 3B 3F 7F 7F 7F

The 3B and 3F display as “;?”, but 7F is the ASCII code for backspace, inherited from teletype days. It is rarely used in text strings nowdays, and your experience with the funny cursor movement shows why. It is also the highest value possible in 7-bit ASCII: 0111 1111B.

As I said before, this is probably because we are feeding the decoder something outside of the expected range of input values; GIGO.

I found the Perl script interesting, but in spite of the author’s claim, I don’t think it mimics the mimic program as much as we would like.

Mangetout · April 5, 2004, 9:19pm

It appears to use the entire message; encoding something short, then making minor alterations (deletions, additions, substitutions, transpositions) to the encoded text always seems to result in something that cannot be decoded. Except alterations to the numeric terms, which result in a mangled result. I think (vast oversimplification) that each character is encoded as something like a pair of phrases, one of which has a numeric expression that unlocks the other phrase.

There’s other clever stuff going on too that is similar to compression.

Musicat · April 5, 2004, 9:42pm

Mangetout, I think you’re close. But not all alterations to the numbers result in undecodeable text. A while ago I changed the Senate Bill 1621 to 1622 and the unencoded message changed from “abc” to “abd”. Yet there is a limited range of values for these numbers. I was guessing something to do with the LSb; that is, phrase #N can be decoded two ways according to the LSb of some number.

I’m not sure about the pseudo-compression idea, tho. I think that very short plaintext causes dummy words to be used at the end for padding. When the plaintext gets longer, the dummy area is used for good, then the output gets longer proportionally from then on. Just a hunch; I can’t prove it.

I was hoping the Perl script example would lead us to water, but I can’t find anything there other than simple letter-to-phrase substitution. Maybe I’m missing something; I’ll look at it again. Unfortunately, I don’t have an easy way of running the script to test it and see what the output looks like. Do you?

Taran · April 5, 2004, 9:52pm

Thank you Musicat. How’d you get the actual byte values?

Mangetout · April 5, 2004, 9:57pm

You may be right; I was basing the compression idea on ver short repeating segments, but it looks like there is a minimum message length anyway.

Mangetout · April 5, 2004, 10:00pm

It could be, of course, that altering the numbers only coincidentally results in a decodable (but altered) message - if the algorithm contains lookup tables with various phrases in there, it may be that changing, say, the number of days in a phrase still results in something that can be found in the lookup table.

Musicat · April 5, 2004, 10:09pm

It seems spaces are needed to separate words and punctuation (that explains the extra spaces before periods and exclamation marks). I am able to add any number of extra spaces anywhere there already are single spaces without affecting the decode. My guess is the punctuation separates phrases for parsing and each phrase so delimited equals N bits of plaintext.

To sum up: Taking out spaces to run words together returns a (cannot decode) error, but adding spaces has no effect.

Taran, I copied from the site’s encoded text, pasted into Notepad, saved the 5-byte file, and examined it with DOS Debug. Hey, it’s old, but it works! Any hex editor would work, too – you can get these at freeware sites.

Importing to Word is not advised; Word adds & alters chars.

vB probably interprets 7FH as non-displayable, and wisely converts those to the generic box you see in your code block.

Or something found OUTSIDE the lookup table, like a buffer overflow. Programmers rarely try to exclude what they never expect to get in the first place!

Or the number of days is not a part of the static phrase, but a generated value inserted (like a macro) inside a stock phrase to either add variety or provide additional crypto info.

TJdude825 · April 6, 2004, 1:34am

This thread is a bit over my head. But has anyone tried looking at the actual program code itself? Is there something that’s the opposite of a compiler that express the actual coding program in some human-readable language, like java or perl or whatever? I hope that made sense.

I doubt this would work, or someone would have done it, but I just thought I’d ask.

Taran · April 6, 2004, 2:35am

Two things:

Decompiling code is impossible in general, and really hard even in special cases.
The way the page is designed, the thing we type gets sent to the server, it does its magic, and a result is sent back. So there’s never a time when we have a chance to look at the script that’s doing the work, even in compiled form.

xt23 · February 1, 2015, 12:03pm

I know it’s an old post…but any other thoughts on how to figure out the algorithm behind Spam mimic?
I am thinking of writing a small program that checks the differences among encoded message, to figure out the pattern. But really, it seems impossible…

Saint_Cad · February 1, 2015, 3:21pm

Anyone try numbers yet?
123 becomes

Dear Colleague , Thank-you for your interest in our
newsletter ! If you no longer wish to receive our publications
simply reply with a Subject: of “REMOVE” and you will
immediately be removed from our club . This mail is
being sent in compliance with Senate bill 2516 , Title
1 ; Section 307 . This is a ligitimate business proposal
! Why work for somebody else when you can become rich
as few as 98 DAYS . Have you ever noticed most everyone
has a cellphone and more people than ever are surfing
the web . Well, now is your chance to capitalize on
this ! WE will help YOU decrease perceived waiting
time by 200% and turn your business into an E-BUSINESS
! The best thing about our system is that it is absolutely
risk free for you ! But don’t believe us . Mr Ames
of Massachusetts tried us and says “My only problem
now is where to park all my cars” ! We are licensed
to operate in all states ! We beseech you - act now
. Sign up a friend and your friend will be rich too
! Thank-you for your serious consideration of our offer
!

and 321 becomes

Dear Colleague , This letter was specially selected
to be sent to you ! We will comply with all removal
requests . This mail is being sent in compliance with
Senate bill 1624 ; Title 4 ; Section 304 . THIS IS
NOT MULTI-LEVEL MARKETING . Why work for somebody else
when you can become rich inside 30 DAYS ! Have you
ever noticed people love convenience & most everyone
has a cellphone ! Well, now is your chance to capitalize
on this . We will help you use credit cards on your
website and decrease perceived waiting time by 200%
. You can begin at absolutely no cost to you ! But
don’t believe us ! Prof Ames of Florida tried us and
says “Now I’m rich many more things are possible” !
This offer is 100% legal ! We BESEECH you - act now
! Sign up a friend and you’ll get a discount of 50%
. Thanks .

Looking at the 123 code, I changed 98 days to 94 days and got a decode of
123(hex0F)(hex7f)m. 95 days gave me the same decode (so we know the algorithm is not 1-1) and 96 days changed the m to a } and 97 days changed the } into a | (pipe)
And notice now I have “Colleague”

Now heres what is interesting I took the 321 code and substituted the “Prof Ames” line fron 123 code and got
321(hex0F)(hex7f)(hex7f)(hex7f)(hex7f)(hex7f)w]S
Hmmmm, seems all of the content is above that.

Starting with “Well now is the chance” and substituting 123 into the 321 I got
321MB3ni

Let’s try Saint

Dear Friend ; We know you are interested in receiving
cutting-edge news . If you no longer wish to receive
our publications simply reply with a Subject: of “REMOVE”
and you will immediately be removed from our club !
This mail is being sent in compliance with Senate bill
2716 , Title 5 ; Section 305 ! This is NOT unsolicited
bulk mail . Why work for somebody else when you can
become rich within 32 weeks ! Have you ever noticed
how many people you know are on the Internet and nobody
is getting any younger ! Well, now is your chance to
capitalize on this . WE will help YOU process your
orders within seconds and turn your business into an
E-BUSINESS ! The best thing about our system is that
it is absolutely risk free for you ! But don’t believe
us . Mr Ames of Massachusetts tried us and says “My
only problem now is where to park all my cars” ! We
are licensed to operate in all states ! We beseech
you - act now . Sign up a friend and your friend will
be rich too ! Thank-you for your serious consideration
of our offer !

And now Cad

Dear Friend , We know you are interested in receiving
cutting-edge news . If you no longer wish to receive
our publications simply reply with a Subject: of “REMOVE”
and you will immediately be removed from our club !
This mail is being sent in compliance with Senate bill
1626 ; Title 9 ; Section 306 ! This is a ligitimate
business proposal . Why work for somebody else when
you can become rich as few as 98 weeks ! Have you ever
noticed how many people you know are on the Internet
and nobody is getting any younger ! Well, now is your
chance to capitalize on this . WE will help YOU process
your orders within seconds and turn your business into
an E-BUSINESS ! The best thing about our system is
that it is absolutely risk free for you ! But don’t
believe us . Mr Ames of Massachusetts tried us and
says “My only problem now is where to park all my cars”
! We are licensed to operate in all states ! We beseech
you - act now . Sign up a friend and your friend will
be rich too ! Thank-you for your serious consideration
of our offer !

WoW! the Ames paragraph is exactly the same! And the both have that “Have you ever noticed…”

OK switching “This is a ligitimate [sic] business proposal .” for “This is NOT unsolicited
bulk mail .” we get Sain| with the | 8 spaces ahead of t in the ASCII code. And switching the other way we decode it as Cad{ which has added a terminal character from the extended ASCII set but interestingly the { is just before the | in ASCII.

Musicat · February 1, 2015, 4:29pm

So, Saint Cad, what is your conclusion? We’ve waited 10 years to find out!

xt23 · February 1, 2015, 10:29pm

I tried 312 and got

Dear Colleague , This letter was specially selected
to be sent to you ! We will comply with all removal
requests . This mail is being sent in compliance with
Senate bill 1621 ; Title 4 , Section 308 ! THIS IS
NOT MULTI-LEVEL MARKETING . Why work for somebody else
when you can become rich inside 30 DAYS ! Have you
ever noticed people love convenience & most everyone
has a cellphone ! Well, now is your chance to capitalize
on this . We will help you use credit cards on your
website and decrease perceived waiting time by 200%
. You can begin at absolutely no cost to you ! But
don’t believe us ! Prof Ames of Florida tried us and
says “Now I’m rich many more things are possible” !
This offer is 100% legal ! We BESEECH you - act now
! Sign up a friend and you’ll get a discount of 50%
. Thanks .

Comparing with 321, it seems that “This letter was specially selected
to be sent to you !” has something to do with 3.
Substituting the second sentence "Especially for you - this red-hot news . " in text generated by 213 with “This letter was specially selected
to be sent to you !”, I didn’t got 313 but “3<$V>FB3ni”. 3 appears but 1 is gone…

and 313 actually gets:

Dear Colleague , This letter was specially selected
to be sent to you ! We will comply with all removal
requests . This mail is being sent in compliance with
Senate bill 1621 ; Title 4 , Section 304 . THIS IS
NOT MULTI-LEVEL MARKETING . Why work for somebody else
when you can become rich inside 30 DAYS ! Have you
ever noticed people love convenience & most everyone
has a cellphone ! Well, now is your chance to capitalize
on this . We will help you use credit cards on your
website and decrease perceived waiting time by 200%
. You can begin at absolutely no cost to you ! But
don’t believe us ! Prof Ames of Florida tried us and
says “Now I’m rich many more things are possible” !
This offer is 100% legal ! We BESEECH you - act now
! Sign up a friend and you’ll get a discount of 50%
. Thanks .

So yes as above said, it’s not simple substitution… I suspect that the numbers in Senate bill /Title/ Sections indicate where to find the corresponding sentences in a “spam sentences pool”?

Why sentences like “THIS IS NOT MULTI-LEVEL MARKETING” appeared in 312 and 313 text but not in 213…

But as ** Saint Cad** said

, so the numbers in Senate bill /Title/ Sections might indicate ASCII code?

Musicat, this post reminds me of time capsule

Musicat · February 2, 2015, 6:46pm

There are so many possibilities. Some phrases might be randomly inserted and have no decoding value. Numbers might mean a lot of things, including nothing.

My gut feeling is there is some simple rule (simple if you know how), perhaps recursive, not a complicated truth table. Just like amateur magicians looking for chemicals, pulleys, smoke and mirrors instead of simple misdirection.

If we let this thread run for a few more decades, maybe the code designer will show up and let us in on the secret? Or Bletchley Park 2.0 will tackle it?

Topic		Replies	Views
The Great SDMB Cipher Challenge Miscellaneous and Personal Stuff I Must Share	50	1825	March 16, 2002
New twist from spammers The BBQ Pit	31	2380	July 29, 2003
Can You Break the Code? (a game) Miscellaneous and Personal Stuff I Must Share	51	1985	August 23, 2001
Anyone wanna crack my relatively simple code? In My Humble Opinion	67	2888	April 7, 2007
What I don't get about Spam The BBQ Pit	78	3102	February 4, 2004

Spammimic: Fun, but how is the message encoded?

Related topics