Good idea. As long as the spam-mimic site is in operation, or if we can figure out the algorhythm as a backup!
I wonder how spam-heuristic programs treat this kind of email? It seems like it would trigger most any of them to dump it unceremoniously. Personally, I go “bareback” myself, not using a filter, so I can’t try it out. I would hesitate sending this kind of encoded message to friends unless warning them first.
sibyl had the right idea, but to test all short strings would surely drive us mad. I’m trying to figure out how to write a script that automatically asks spammimic for all combinations and stores the result in some text files. Then we could ‘diff’ all pairs, and at least have plenty of raw data to go on.
As for me: ; produces a spamtext ~800 characters long, which is the shortest I’ve found. It exhibits one other interesting property: when the line “…become rich inside 30 DAYS !” is changed to “…become rich inside 31 DAYS !”, the spamtext decodes to &?.That exact thing. C&P’ing it has made the cursor behave strangely; it was leading my actual text by 2 characters. After I passed a linebreak, everything returned to normal. Also, when I try to insert spaces before “That exact…”, the final untranslatable character “sticks” to the period, which “sticks” to the T, so the spaces get inserted two characters to the left of the cursor. Anything I do on that line to the right of the weirdness is affected.
For “…become rich inside XX DAYS !”, when XX is changed to any one-digit value, the result is undecodable, even when I insert a leading zero. Similarly, all the 3+ digit numbers I tried came up empty. However, all the two-digit numbers I tried had valid, if peculiar, decodings:
Value of XX Decoded Result
32 &?V
33 &?
34 &?U
35 &?T
36 &?S
37 &?R
38 &?P
39 &?Q
40 &?
41 &?^
42 &?]
43 &?\
44 &?X
45 &?Y
46 &?[
47 &?Z
48 &?K
There’s an obvious pattern, but it keeps breaking down. Could someone test the untranslatable characters, and see what hex numbers they correspond to?
Incidentally, those spaces between the untranslatable characters only appear after I’ve pressed the enter key in the near vicinity of the characters in question. That’s why all the entries but K have them.
I found a Perl script that purports to mimic the spam mimic program. But it seems to have some differences as well. You can download the PL text file here. Scroll down to “mimic.zip.” The test site it links to is no longer there.
Taran, what is your source text? A single semicolon? Then, after encoding, you are altering the 30 DAYS to 31 DAYS?
Those chars that don’t display properly are in the high 7-bit ASCII set, I believe. I think you are forcing the translation engine to work outside of its range, and that returns values not anticipated. Then they are further mangled by vB code since they are outside the standard font set. That makes me believe that the numbers are only “modifiers” of other parameters, not pure substitution code.
Taran, I just fed “;” into the encoder, changed the 30 DAYS to 31 DAYS, then fed it back thru the decoder. I got 5 bytes:
Hex: 3B 3F 7F 7F 7F
The 3B and 3F display as “;?”, but 7F is the ASCII code for backspace, inherited from teletype days. It is rarely used in text strings nowdays, and your experience with the funny cursor movement shows why. It is also the highest value possible in 7-bit ASCII: 0111 1111B.
As I said before, this is probably because we are feeding the decoder something outside of the expected range of input values; GIGO.
I found the Perl script interesting, but in spite of the author’s claim, I don’t think it mimics the mimic program as much as we would like.
It appears to use the entire message; encoding something short, then making minor alterations (deletions, additions, substitutions, transpositions) to the encoded text always seems to result in something that cannot be decoded. Except alterations to the numeric terms, which result in a mangled result. I think (vast oversimplification) that each character is encoded as something like a pair of phrases, one of which has a numeric expression that unlocks the other phrase.
There’s other clever stuff going on too that is similar to compression.
Mangetout, I think you’re close. But not all alterations to the numbers result in undecodeable text. A while ago I changed the Senate Bill 1621 to 1622 and the unencoded message changed from “abc” to “abd”. Yet there is a limited range of values for these numbers. I was guessing something to do with the LSb; that is, phrase #N can be decoded two ways according to the LSb of some number.
I’m not sure about the pseudo-compression idea, tho. I think that very short plaintext causes dummy words to be used at the end for padding. When the plaintext gets longer, the dummy area is used for good, then the output gets longer proportionally from then on. Just a hunch; I can’t prove it.
I was hoping the Perl script example would lead us to water, but I can’t find anything there other than simple letter-to-phrase substitution. Maybe I’m missing something; I’ll look at it again. Unfortunately, I don’t have an easy way of running the script to test it and see what the output looks like. Do you?
It could be, of course, that altering the numbers only coincidentally results in a decodable (but altered) message - if the algorithm contains lookup tables with various phrases in there, it may be that changing, say, the number of days in a phrase still results in something that can be found in the lookup table.
It seems spaces are needed to separate words and punctuation (that explains the extra spaces before periods and exclamation marks). I am able to add any number of extra spaces anywhere there already are single spaces without affecting the decode. My guess is the punctuation separates phrases for parsing and each phrase so delimited equals N bits of plaintext.
To sum up: Taking out spaces to run words together returns a (cannot decode) error, but adding spaces has no effect.
Taran, I copied from the site’s encoded text, pasted into Notepad, saved the 5-byte file, and examined it with DOS Debug. Hey, it’s old, but it works! Any hex editor would work, too – you can get these at freeware sites.
Importing to Word is not advised; Word adds & alters chars.
vB probably interprets 7FH as non-displayable, and wisely converts those to the generic box you see in your code block.
Or something found OUTSIDE the lookup table, like a buffer overflow. Programmers rarely try to exclude what they never expect to get in the first place!
Or the number of days is not a part of the static phrase, but a generated value inserted (like a macro) inside a stock phrase to either add variety or provide additional crypto info.
This thread is a bit over my head. But has anyone tried looking at the actual program code itself? Is there something that’s the opposite of a compiler that express the actual coding program in some human-readable language, like java or perl or whatever? I hope that made sense.
I doubt this would work, or someone would have done it, but I just thought I’d ask.
Decompiling code is impossible in general, and really hard even in special cases.
The way the page is designed, the thing we type gets sent to the server, it does its magic, and a result is sent back. So there’s never a time when we have a chance to look at the script that’s doing the work, even in compiled form.
I know it’s an old post…but any other thoughts on how to figure out the algorithm behind Spam mimic?
I am thinking of writing a small program that checks the differences among encoded message, to figure out the pattern. But really, it seems impossible…
Looking at the 123 code, I changed 98 days to 94 days and got a decode of
123(hex0F)(hex7f)m. 95 days gave me the same decode (so we know the algorithm is not 1-1) and 96 days changed the m to a } and 97 days changed the } into a | (pipe)
And notice now I have “Colleague”
Now heres what is interesting I took the 321 code and substituted the “Prof Ames” line fron 123 code and got
321(hex0F)(hex7f)(hex7f)(hex7f)(hex7f)(hex7f)w]S
Hmmmm, seems all of the content is above that.
Starting with “Well now is the chance” and substituting 123 into the 321 I got
321MB3ni
Let’s try Saint
And now Cad
WoW! the Ames paragraph is exactly the same! And the both have that “Have you ever noticed…”
OK switching “This is a ligitimate [sic] business proposal .” for “This is NOT unsolicited
bulk mail .” we get Sain| with the | 8 spaces ahead of t in the ASCII code. And switching the other way we decode it as Cad{ which has added a terminal character from the extended ASCII set but interestingly the { is just before the | in ASCII.
Comparing with 321, it seems that “This letter was specially selected
to be sent to you !” has something to do with 3.
Substituting the second sentence "Especially for you - this red-hot news . " in text generated by 213 with “This letter was specially selected
to be sent to you !”, I didn’t got 313 but “3<$V>FB3ni”. 3 appears but 1 is gone…
and 313 actually gets:
So yes as above said, it’s not simple substitution… I suspect that the numbers in Senate bill /Title/ Sections indicate where to find the corresponding sentences in a “spam sentences pool”?
Why sentences like “THIS IS NOT MULTI-LEVEL MARKETING” appeared in 312 and 313 text but not in 213…
But as ** Saint Cad** said
, so the numbers in Senate bill /Title/ Sections might indicate ASCII code?
There are so many possibilities. Some phrases might be randomly inserted and have no decoding value. Numbers might mean a lot of things, including nothing.
My gut feeling is there is some simple rule (simple if you know how), perhaps recursive, not a complicated truth table. Just like amateur magicians looking for chemicals, pulleys, smoke and mirrors instead of simple misdirection.
If we let this thread run for a few more decades, maybe the code designer will show up and let us in on the secret? Or Bletchley Park 2.0 will tackle it?