How important are "spaces" to cryptanalysis?

Senggum · May 8, 2004, 10:17pm

Are “spaces” the most important element in cryptanalysis? Here is a simple cryptogram of a famous quote with just 2 spaces eliminated. Can it be deciphered?

“nxcpoh kpqwboq xlomtqpbc phepdx hpeyxf pct rpcx.”

Mathochist · May 8, 2004, 10:41pm

Spaces are, if anything, the least important element. If I cared to decipher your quote I’d work on letter frequency charts (and digrams and trigrams). Eliminating spaces might give a few false hits to the digram or trigram frequencies, but not enough to seriously skew the attack. Spaces are most useful in cracking simple letter-substitution codes by hand.

Modern codes don’t even notice spaces. They turn the whole message into one big number (or a sequence of big numbers) and encode those. Spaces don’t even affect the algorithm as much as they do in substitution ciphers.

KP · May 9, 2004, 12:34am

I know you are more knowledgable than I in matters of math, but I beg to differ.

Spaces are as important to to modern cryptoanalysis as any character because they are characters, and at least as important to the meaning of the message as any other. Punctuation, too, must be preserved and accurately reconstituted. For a government, military or business to only encode the letters, and then guess at the spacing and punctuation afterwards would be begging for eventual disaster. While it’s possible to incorporate gimmicks that make spaces take up less space than most characters, one could use most of those same tricks to reduce the space taken by the letter “e” instead. In modern crypto, there’s no real point. As you said, the whole message becomes one number -“bits are bits.”

Even for the simple substitution ciphers of centuries gone by, the actual frequency tables I’ve seen counted the space as a letter. Languages differ in average word length, as well as letter frequencies (and dem dam furriners refuse to talk English].
Actual substitution cyphers have long used standard size units. Breaking a coded message into groups (e.g. five character groupss for ease of accurate transmission by writing, radio or telegraph) was explicitly done to deprive a codebreaker of the valuable lingustic information contained in, and implied by, word spacing.

Frequencies of letter combinations were used as much as basic letter frequencies: professional codebreaking is a high volume biz, and even kids playing at spies quickly learn them by experience. Fragments of words, letter positions, and other hints are what make breaking a straight English message literally childsplay for many kids “TH_” corresponds to only one English word: “the”. “T_E” could be “tie” or “toe”, quickly resolved by comparison to the resto of the message. Common one-letter words are usually “A” or “I” (though it could be a numeral). Only one English word in common use ends in “q” (Iraq), etc. Common letter pairs and triplets are also more readily discerned if the word spacing is preserved, and were listed on the old codebreaker tables I’ve seen.

Try actually solving a substitution cypher with the word spacing obscured. Though substitution cyphers are easy to break (their primary advantages were ease of training, portability and flexibility), you’ll find it much harder.

You can improve security by encrypting a code, which eliminates inherent linguistic clues. Even if an outsider breaks the encryption, they won;t know what “Tora! Tora! Tora!” or “Peccavi” means to you. Codes can also be remarkably concise: one bit can encode an entire message, e.g. “one if by land, two if by sea”, a flower on a windowsill means “the coast is clear”, etc.

[Though I cited “peccavi”, it’s a pun not a code. When Gen. Napier captured Sindh, he is said to have sent this this one word message, which is Latin for “I have sinned.” The pun actually came from in a cartoon in Punch, not a military message, but I couldn’t resist the familiar example.]

I should clarify that a “code” means (crudely put) “assigning an arbitrary meaning to something”, while a cypher converts the original text, intact, to another form. Cyphers can be considered pure algorithms, codes are more like secret languages. ASCII is technically a cypher, as are most schoolkid “codes” including pig latin.

Codes strike at the heart of one of the basic problems of defeating modern encryption: how do you know when your decryption is correct. Any sequence of n bits can be decrypted to any other n bits, i.e. any message of the same length. By brute force “Attack the fortress” and “Withdraw our forces” can be equally likely.

In a sense, the one-time pad (the ultimate cypher, unbreakable even in principle) can be considered a constantly changing code, where each page of the codebook is discarded as soon as it is used. (OTPs can only be broken if protocol is violated, and a page is re-used. This can reduce it, mathematically to a simple substitution) It also shares many of the weaknesses of a code, such as the need to somehow convey a codebook (the pad), which is vulnerable to capture and copying.

Mathochist · May 9, 2004, 12:49am

Yes, so what does it matter to IDEA or RSA whether the spaces are removed or not?

Frequencies of letter combinations were used as much as basic letter frequencies: professional codebreaking is a high volume biz, and even kids playing at spies quickly learn them by experience. Fragments of words, letter positions, and other hints are what make breaking a straight English message literally childsplay for many kids “TH_” corresponds to only one English word: “the”. “T_E” could be “tie” or “toe”, quickly resolved by comparison to the resto of the message. Common one-letter words are usually “A” or “I” (though it could be a numeral). Only one English word in common use ends in “q” (Iraq), etc. Common letter pairs and triplets are also more readily discerned if the word spacing is preserved, and were listed on the old codebreaker tables I’ve seen.

More easily, but it’s really not that significant. Let’s say you’re trying to find which digraph corresponds to “th” (the most common digraphs in English). Yes, occasionally there is a word that ends in ‘h’ followed by a word that starts with ‘e’, which would make a fake ‘he’ (the second most common) if the space were removed, but in practice it doesn’t happen often enough for this to make a difference. The top digraphs and trigraphs have such nice spacing that it takes a huge number of false digraphs to throw the frequency off.

By hand, yes. I even said as much in my response. When you have a computer running frequency tables, though, it makes very little difference.

The rest of your post reads more like an encyclopedia entry than an answer to the OP. I say again: spaces in letter-substitution codes make deciphering easier for solving by hand, but computer frequency tables make the whole point moot anyhow. Modern computer encryption is based on strings of bits, which assign no inherent meaning to a space any more than to any other letter. “Cracking” one of these codes consists of finding the “key” used as an input to the algorithm, which will get you the plaintext out whether the spaces are part of the plaintext or not.

Peter_Morris · May 9, 2004, 1:15am

Really? I’ll have to stop eating coq au vin, then.

(Yeah, its a foreign word - but isn’t Iraq?)

ftg · May 9, 2004, 1:32am

As noted, any message sent today is just a sequence of characters and blanks are just another character. You will never see true blanks in any modern crypto system. There may be blanks in the coded message, but they don’t represent blanks in the plain text (no more than an “s” represents an “s”) or they are used for blocking uses. E.g., the encrypted text is groups of 5 symbols sep. by blanks. The text are the symbols, the blanks would be there only for humans who easily get lost transcribing a bunch of jibberish. (You see things like this in product activation codes, note how your MS-Windows product key is broken up into blocks. )

Every TV and movie display of code breaking is ridiculously wrong. Take “Sneakers”, when they are breaking a page of code, you see blanks and the rest are converted into letters as it progresses. That is so wrong that words fail me. You would see jibberish. Some of the jibberish might be blanks. But the blanks do not encode blanks, or any single symbol at all. There would be no smooth transition from encoded text to plain text. It makes no sense to even overwrite the old with new symbols.

zut · May 9, 2004, 2:06pm

Genius without education is like silver in a mine.

And I suck at cryptograms.

Senggum · May 9, 2004, 2:13pm

Kudos, zut! Could you explain the process you used to decipher it with the 2 spaces missing?

Topic		Replies	Views
Modern Computers and WWII Codes Factual Questions	31	4483	February 11, 2005
How would cryptographers break this type of code? Factual Questions	61	2378	November 19, 2021
Time to break 'Enigma' using modern methods? Factual Questions	21	17811	July 6, 2005
I need a simple breakable cypher Factual Questions	30	2711	September 3, 2006
Cryptogrpahy questions Factual Questions	39	2066	July 15, 2000

How important are "spaces" to cryptanalysis?

Related topics