Spammimic: Fun, but how is the message encoded?

Musicat · April 4, 2004, 8:53pm

Ever heard of Spam Mimic?

Researching Steganography recently, I ran across this interesting site, spammimic.com. Anyone can type a short message and have it encoded into spam-like text. For example, “Now is the time” encodes to (I have left the CRLF’s in, since they might be significant)

Dear Friend , This letter was specially selected to
be sent to you . This is a one time mailing there is
no need to request removal if you won’t want any more
! This mail is being sent in compliance with Senate
bill 1622 ; Title 6 ; Section 306 . This is NOT unsolicited
bulk mail ! Why work for somebody else when you can
become rich within 12 DAYS . Have you ever noticed
people are much more likely to BUY with a credit card
than cash plus how many people you know are on the
Internet ! Well, now is your chance to capitalize on
this . We will help you sell more plus increase customer
response by 140% . You can begin at absolutely no cost
to you . But don’t believe us . Mrs Anderson of New
Mexico tried us and says “I’ve been poor and I’ve been
rich - rich is better” ! We are licensed to operate
in all states ! So make yourself rich now by ordering
immediately . Sign up a friend and you’ll get a discount
of 60% . Thank-you for your serious consideration of
our offer . Dear Cybercitizen , You made the right
decision when you signed up for our mailing list .
This is a one time mailing there is no need to request
removal if you won’t want any more . This mail is being
sent in compliance with Senate bill 2316 ; Title 9
; Section 302 . This is different than anything else
you’ve seen . Why work for somebody else when you can
become rich inside 40 days ! Have you ever noticed
more people than ever are surfing the web plus people
will do almost anything to avoid mailing their bills
. Well, now is your chance to capitalize on this !
WE will help YOU deliver goods right to the customer’s
doorstep & SELL MORE . You can begin at absolutely
no cost to you ! But don’t believe us . Mrs Jones of
New Jersey tried us and says “Now I’m rich, Rich, RICH”
! We assure you that we operate within all applicable
laws ! We beseech you - act now ! Sign up a friend
and you’ll get a discount of 10% ! God Bless .

This is 100% lossless encryption that can be reversed, that is, decoded. Cute, eh?

For kicks, copy & paste this exact message to their “decode” site. It should decode to my original.

Now why would anyone want to deliberately make their message into spam? Privacy and secrecy come to mind. I am unsure as to how serious this site is (only short messages can be handled) but my factual question is: What is the algorithm used to encode & decode?

A google search turned up this:

But that doesn’t seem to be the case, unless a translation table is used. Put “zzzz” in as the original message, and you won’t see “z’s” in the 2nd vertical column of the output.

So far, my guesses seem to be wrong. The encoded message has all hi bits set to zero (so it’s 7-bit ASCII), there are no control or invisible characters embedded (nothing below 20H); it is 100% pure, plain text and there is no appended checksum. Experiments with the numbers (“Senate Bill 1622…”) suggest that they might be significant, but the entire message is not encoded in them alone. I think there is a lot of dummy stuff included. Changing single chars at random suggests that some are significant, others, not so much.

So what is the principle behind this scheme?

Fish_Cheer · April 4, 2004, 8:56pm

Please use this link for spammimic.

Musicat · April 4, 2004, 8:57pm

I wish the hamsters would stop nibbling on the edges of my posts. The first line of the OP was supposed to read, “While researching Steganography, I ran across this interesting site, spammimic.com…”

Frylock · April 4, 2004, 8:58pm

What happens if you try to decode text that is not code? Can it tell the difference, or does it output garbage?

-FrL-

Musicat · April 4, 2004, 9:03pm

Thanks, fishcheer15. You read and type faster than I do.

It has an error message: “(can’t decode)” – some changes in a sample encoded text cause this, others cause a readable and recognizable decoded message. That’s why I think some chars are significant and others are just padding.

Frylock · April 4, 2004, 9:04pm

Looks like it can tell the difference between coded and non-coded text.

Interestingly, it seems like every single little letter is important. I changed the word “of” in the “mr so-and-so of such-and-such says” to random two letter groups, thinking if anything in the coded text is gratuitous it’s the “of” in those expressions. But nope–changing it makes it “undecodable.”

-FrL-

Frylock · April 4, 2004, 9:07pm

Thanks for the correct Merricat. What text can be changed without messing up the decodeability?

I guess the changes I made probably screwed with some parsing routine that looks for an “of” sequence after the names.

-FrL-

Musicat · April 4, 2004, 9:25pm

I don’t think all chars in all positions are significant. I’ve tried changing some of the numbers one digit at a time: some result in error, some change the decoded text (from “a” to “b”, for example), and some result in no change, no error.

Stating with “abc” as the plaintext, I removed only the the first period from the encoded version. It would not decode. I changed one “than” to “then,” it would not decode. I changed a single “!” near the end to a period and I got a decoded message of “abc…” (that is the exact decode, including the 9 dots!). So changing a single char resulted in a different length in the decode!

One theory was that the numbers encoded the entire message – that is, remove all the text as dummies and just use the remaining digits, perhaps in groups of 2. I don’t think so.

Frylock, try just changing a single char at one time so you don’t have too many factors to consider at once. And remember that position might be as important as char value. Also I have a hunch there is something going on with odd/even ASCII values – that is, the LSb may be ignored, or be a special factor.

Underneath all this, the algorithm must have been designed to be easy to generate and modify the spam text – pick one from col A, one from Col B, and concatenate – that’s why I think much of the text is dummy or the text fragments stand for something by themselves.

Musicat · April 4, 2004, 9:35pm

I rarely ask for such favors, but I would be eternally grateful if a Mod would fix the first screwed-up link in my OP so Dopers won’t be put off by a bad link when they first read this thread. And throw the hamsters an extra ration on me and maybe they’ll leave my posts alone.

kellner · April 4, 2004, 9:38pm

I know nothing about this scheme in particular, but I know how it could be done. Do you happen to know anything about formal grammars like those used to describe the syntax of programming languages? You can create such a grammar for spam messages. Note that we needn’t be able to create every possible spam message. Then you choose rules from your grammar according to the bits in your message. If your grammar is unambiguous (and you can easily ensure that if you create the grammar yourself) this process is reversible.

This is a very simple example:

<spam> ::= <greeting><body>
<greeting> ::= “Hi!” (0)
<greeting> ::= “Hello!” (1)
<body> ::= “You have won!” (0)
<body> ::= “Free viagra for everyone!” (1)

If your message is 01, you first choose a 0-rule, then a 1-rule.
This results in “Hi! Free viagra for everyone!”. Note how it can still be reverted.
If you encounter the spam: “Hello! You have won!” you know that the message was 10.
Of course your grammar can be a lot more complicated than that, allowing different numbers of sentences and strutures that are harder to recognize.

Musicat · April 4, 2004, 10:01pm

Kellner, if I understand you right, a particular phrase translates (decodes) to a single digit (or bit). When concatenated with other phrases/bits, this can be interpreted as ASCII text. 100 0001B (7-bit) would then be ASCII “A”.

A reasonable proposal; it would mean that a single char would generate 7 (assuming 7-bit) phrases. However, in my experiments, adding a single char to a test message (adding “z” to “abc”, making “abcz”) resulted in only 4 chars added to the encoded text, where 7 * (phrase length) would be expected.

And if you pass several source texts thru this site, you will notice considerable repetition in phraseology. (I don’t know what that means; just thought I’d mention it.)

I think something else is going on here. Perhaps we are all blind to the obvious.

(Yes, I am familiar with formal grammars used to describe programming languages syntax, having written several compilers/assemblers. )

sibyl · April 4, 2004, 10:34pm

It appears to be tied to certain phrases. I put each lowercase letter in, alone, from a-z, and got the following data for the very beginning of the generated email. The email goes on for several more sentences after these though. There does seem to be a correlation between length of input and length of output as well - if you put in a sentence you get a significantly longer output email.

a Dear Professional , Thank-you…
b Dear Professional , Especially…
c Dear Professional , This…
d Dear Professional ; You…
e Dear Professional ; Thank-you…
f Dear Professional ; Especially…
g Dear Professional ; This…
h Dear Decision maker , You…
i Dear Decision maker , Thank-you…
j Dear Decision maker , Especially…
k Dear Decision maker , This…
l Dear Decision maker ; You…
m Dear Decision maker ; Thank-you…
n Dear Decision maker ; Especially…
o Dear Decision Maker ; This…
p Dear Business person , You…
q Dear Business person , Thank-you
r Dear Business person , Especially
s Dear Business person , This…
t Dear Business person ; You…
u Dear Business person ; Thank-you…
v Dear Business person ; Especially…
w Dear Business person ; This…
x Dear E-Commerce professional , You…
y Dear E-Commerce professional , Thank-you…
z Dear E-Commerce professional , Especially…

The first part cycles through sequentially, with the second part being either a comma or a semicolon and one of four phrases.

Let me try a few two-letter combinations and see if I can figure something out.

kellner · April 4, 2004, 10:40pm

Shouldn’t you be able to overcome this by using a more general grammar? You can keep the number of phrases down by generating the structure within the phrases. In addition to that you can encode whole strings of bits into single lexical tokens. The ouput is always a lot longer than the input, and I think nothing but a (not directly visible) lower boundary for the output length has to grow with the input length. If you don’t have to remain context-free many additional tricks become possible but reversibility might turn really ugly.

Musicat · April 4, 2004, 10:47pm

You may be on to something, sibyl. We need more like you. I’ll volunteer to pay your $4.95 if need be.

It seems likely that there is a minimun length for the output, so entering “a” for the input might generate a minimum output with junk padding. Longer strings might not add to that until a threshold is reached, then the padding is no longer needed. So far, that theory doesn’t contradict your experiments and could explain some of mine.

sibyl · April 4, 2004, 10:50pm

Tried some letter combinations with a and got the following. Definately some sort of pattern emerging here, but I have no idea how it works still. The rest of the emails following all of these are typically similar but there are small differences and added sentences in those too.

ab Dear Professional , Your… ! If you no longer wish to receive our … will immediately be removed from our club ! This…
ac Dear Professional , Your… ! If you no longer wish to receive our … will immediately be removed from our mailing list ! This…
ad Dear Professional , Your… ! This is a… !
ae Dear Professional , Your… ! This is a… .
af Dear Professional , Your… ! If you are not… !
ag Dear Professional , Your… ! If you are not… .
ah Dear Professional , Your… ! We will comply… !
ai Dear Professional , Your… ! We will comply… .

Mangetout · April 4, 2004, 10:56pm

a, aa and aaa code to different messages of the same length, but with different embedded numeric terms.

sibyl · April 4, 2004, 11:02pm

Embedded numeric terms? The only embedded terms I see are the line breaks.

Musicat · April 4, 2004, 11:10pm

Not exactly the same length:


source data   length of encoded text
a 1027
aa

Musicat · April 4, 2004, 11:18pm

Whoops. I must be hitting the wrong keys today. Apologies.

Mangetout, I get similar but not always identical lengths to these source tests:


source len of result
a      1027
aa      977
aaa     977
aaaa    971
aaaaa   973
aaaaaa  976

sibyl, I think the embedded numbers referred to are the ones following “senate bill…” and other phrases. Note that the X%, “Section X”, Title X" phrases have different values for X. This may be padding or it may be significant, dunno just yet.

Perhaps it is a combined code…

sailor · April 4, 2004, 11:52pm

Line breaks do not seem to make a difference. I removed them and the message was decoded correctly.

This is a good way to hide a password or other short info. Just email it to yourself and keep it with other emails.

Topic		Replies	Views
The Future is Now! Miscellaneous and Personal Stuff I Must Share	3	760	June 9, 2006
Weird Spam Miscellaneous and Personal Stuff I Must Share	9	1802	April 10, 2011
another damn e-mail scam Factual Questions	6	1042	May 6, 2004
What I don't get about Spam The BBQ Pit	78	3057	February 4, 2004
New Horizons in spam Factual Questions	1	635	December 19, 2002

Spammimic: Fun, but how is the message encoded?

Related topics