Spammimic: Fun, but how is the message encoded?

Ever heard of Spam Mimic?

Researching Steganography recently, I ran across this interesting site, spammimic.com. Anyone can type a short message and have it encoded into spam-like text. For example, “Now is the time” encodes to (I have left the CRLF’s in, since they might be significant)

This is 100% lossless encryption that can be reversed, that is, decoded. Cute, eh?

For kicks, copy & paste this exact message to their “decode” site. It should decode to my original.

Now why would anyone want to deliberately make their message into spam? Privacy and secrecy come to mind. I am unsure as to how serious this site is (only short messages can be handled) but my factual question is: What is the algorithm used to encode & decode?

A google search turned up this:

But that doesn’t seem to be the case, unless a translation table is used. Put “zzzz” in as the original message, and you won’t see “z’s” in the 2nd vertical column of the output.

So far, my guesses seem to be wrong. The encoded message has all hi bits set to zero (so it’s 7-bit ASCII), there are no control or invisible characters embedded (nothing below 20H); it is 100% pure, plain text and there is no appended checksum. Experiments with the numbers (“Senate Bill 1622…”) suggest that they might be significant, but the entire message is not encoded in them alone. I think there is a lot of dummy stuff included. Changing single chars at random suggests that some are significant, others, not so much.

So what is the principle behind this scheme?

Please use this link for spammimic.

I wish the hamsters would stop nibbling on the edges of my posts. The first line of the OP was supposed to read, “While researching Steganography, I ran across this interesting site, spammimic.com…”

What happens if you try to decode text that is not code? Can it tell the difference, or does it output garbage?

-FrL-

Thanks, fishcheer15. You read and type faster than I do. :slight_smile:

It has an error message: “(can’t decode)” – some changes in a sample encoded text cause this, others cause a readable and recognizable decoded message. That’s why I think some chars are significant and others are just padding.

Looks like it can tell the difference between coded and non-coded text.

Interestingly, it seems like every single little letter is important. I changed the word “of” in the “mr so-and-so of such-and-such says” to random two letter groups, thinking if anything in the coded text is gratuitous it’s the “of” in those expressions. But nope–changing it makes it “undecodable.”

-FrL-

Thanks for the correct Merricat. What text can be changed without messing up the decodeability?

I guess the changes I made probably screwed with some parsing routine that looks for an “of” sequence after the names.

-FrL-

I don’t think all chars in all positions are significant. I’ve tried changing some of the numbers one digit at a time: some result in error, some change the decoded text (from “a” to “b”, for example), and some result in no change, no error.

Stating with “abc” as the plaintext, I removed only the the first period from the encoded version. It would not decode. I changed one “than” to “then,” it would not decode. I changed a single “!” near the end to a period and I got a decoded message of “abc…” (that is the exact decode, including the 9 dots!). So changing a single char resulted in a different length in the decode!

One theory was that the numbers encoded the entire message – that is, remove all the text as dummies and just use the remaining digits, perhaps in groups of 2. I don’t think so.

Frylock, try just changing a single char at one time so you don’t have too many factors to consider at once. And remember that position might be as important as char value. Also I have a hunch there is something going on with odd/even ASCII values – that is, the LSb may be ignored, or be a special factor.

Underneath all this, the algorithm must have been designed to be easy to generate and modify the spam text – pick one from col A, one from Col B, and concatenate – that’s why I think much of the text is dummy or the text fragments stand for something by themselves.

I rarely ask for such favors, but I would be eternally grateful if a Mod would fix the first screwed-up link in my OP so Dopers won’t be put off by a bad link when they first read this thread. And throw the hamsters an extra ration on me and maybe they’ll leave my posts alone.

I know nothing about this scheme in particular, but I know how it could be done. Do you happen to know anything about formal grammars like those used to describe the syntax of programming languages? You can create such a grammar for spam messages. Note that we needn’t be able to create every possible spam message. Then you choose rules from your grammar according to the bits in your message. If your grammar is unambiguous (and you can easily ensure that if you create the grammar yourself) this process is reversible.

This is a very simple example:

<spam> ::= <greeting><body>
<greeting> ::= “Hi!” (0)
<greeting> ::= “Hello!” (1)
<body> ::= “You have won!” (0)
<body> ::= “Free viagra for everyone!” (1)

If your message is 01, you first choose a 0-rule, then a 1-rule.
This results in “Hi! Free viagra for everyone!”. Note how it can still be reverted.
If you encounter the spam: “Hello! You have won!” you know that the message was 10.
Of course your grammar can be a lot more complicated than that, allowing different numbers of sentences and strutures that are harder to recognize.

Kellner, if I understand you right, a particular phrase translates (decodes) to a single digit (or bit). When concatenated with other phrases/bits, this can be interpreted as ASCII text. 100 0001B (7-bit) would then be ASCII “A”.

A reasonable proposal; it would mean that a single char would generate 7 (assuming 7-bit) phrases. However, in my experiments, adding a single char to a test message (adding “z” to “abc”, making “abcz”) resulted in only 4 chars added to the encoded text, where 7 * (phrase length) would be expected.

And if you pass several source texts thru this site, you will notice considerable repetition in phraseology. (I don’t know what that means; just thought I’d mention it.)

I think something else is going on here. Perhaps we are all blind to the obvious.

(Yes, I am familiar with formal grammars used to describe programming languages syntax, having written several compilers/assemblers. :slight_smile: )

It appears to be tied to certain phrases. I put each lowercase letter in, alone, from a-z, and got the following data for the very beginning of the generated email. The email goes on for several more sentences after these though. There does seem to be a correlation between length of input and length of output as well - if you put in a sentence you get a significantly longer output email.

a Dear Professional , Thank-you…
b Dear Professional , Especially…
c Dear Professional , This…
d Dear Professional ; You…
e Dear Professional ; Thank-you…
f Dear Professional ; Especially…
g Dear Professional ; This…
h Dear Decision maker , You…
i Dear Decision maker , Thank-you…
j Dear Decision maker , Especially…
k Dear Decision maker , This…
l Dear Decision maker ; You…
m Dear Decision maker ; Thank-you…
n Dear Decision maker ; Especially…
o Dear Decision Maker ; This…
p Dear Business person , You…
q Dear Business person , Thank-you
r Dear Business person , Especially
s Dear Business person , This…
t Dear Business person ; You…
u Dear Business person ; Thank-you…
v Dear Business person ; Especially…
w Dear Business person ; This…
x Dear E-Commerce professional , You…
y Dear E-Commerce professional , Thank-you…
z Dear E-Commerce professional , Especially…

The first part cycles through sequentially, with the second part being either a comma or a semicolon and one of four phrases.

Let me try a few two-letter combinations and see if I can figure something out.

Shouldn’t you be able to overcome this by using a more general grammar? You can keep the number of phrases down by generating the structure within the phrases. In addition to that you can encode whole strings of bits into single lexical tokens. The ouput is always a lot longer than the input, and I think nothing but a (not directly visible) lower boundary for the output length has to grow with the input length. If you don’t have to remain context-free many additional tricks become possible but reversibility might turn really ugly.

You may be on to something, sibyl. We need more like you. I’ll volunteer to pay your $4.95 if need be. :slight_smile:

It seems likely that there is a minimun length for the output, so entering “a” for the input might generate a minimum output with junk padding. Longer strings might not add to that until a threshold is reached, then the padding is no longer needed. So far, that theory doesn’t contradict your experiments and could explain some of mine.

Tried some letter combinations with a and got the following. Definately some sort of pattern emerging here, but I have no idea how it works still. The rest of the emails following all of these are typically similar but there are small differences and added sentences in those too.

ab Dear Professional , Your… ! If you no longer wish to receive our … will immediately be removed from our club ! This…
ac Dear Professional , Your… ! If you no longer wish to receive our … will immediately be removed from our mailing list ! This…
ad Dear Professional , Your… ! This is a… !
ae Dear Professional , Your… ! This is a… .
af Dear Professional , Your… ! If you are not… !
ag Dear Professional , Your… ! If you are not… .
ah Dear Professional , Your… ! We will comply… !
ai Dear Professional , Your… ! We will comply… .

a, aa and aaa code to different messages of the same length, but with different embedded numeric terms.

Embedded numeric terms? The only embedded terms I see are the line breaks.

Not exactly the same length:


source data   length of encoded text
a 1027
aa

Whoops. I must be hitting the wrong keys today. Apologies.

Mangetout, I get similar but not always identical lengths to these source tests:


source len of result
a      1027
aa      977
aaa     977
aaaa    971
aaaaa   973
aaaaaa  976

sibyl, I think the embedded numbers referred to are the ones following “senate bill…” and other phrases. Note that the X%, “Section X”, Title X" phrases have different values for X. This may be padding or it may be significant, dunno just yet.

Perhaps it is a combined code…

Line breaks do not seem to make a difference. I removed them and the message was decoded correctly.

This is a good way to hide a password or other short info. Just email it to yourself and keep it with other emails.