This is a code I once saw (or heard about) described somewhere, so it must be a known method, and it seems impenetrable to mathematical methods to me, but what do I know.
You have a list of 50 books, numbered 01 to 50. Each book has hundreds of pages. The 7th through 11th words in each message identify the book and page (you would need letters corresponding to numbers to avoid making this obvious). Once you have that you have your cipher. The way the code works is that every letter corresponds to a number. Let’s say “a” is 1, “b” is 2, and so on. In decoding the message, you advance the first letter in the code by a number of digits corresponding to the first letter on the selected page, the second letter by the second letter on the page, and so on.
So let’s say the first word on the selected page happened to be “cat”. That means that the first three letters in the coded message are advanced by 3, 1, and 20 letters, respectively (if you get to the end of the alphabet you start again from the beginning, so that if you have a “j” and the code is a “t” calling to advance 20, then it becomes a “d”). And so on, down the page, and on to the next page, if necessary.
The point of all this is that not only would you easily have access to 20.000 unique ciphers (assuming 50 books of 400 pages each) but the code changes each letter. ISTM that this would undermine all sorts of frequency analyses and such.
But it’s a simple thing, and again not something I myself have thought of, so I assume Great Minds have come up with something. But what?
(Actually, I’m wondering now if such a code might actually be unbreakable but too unweildy. I see from Wikipedia (Cryptanalysis of the Enigma) that the Engima would have been unbreakable if properly executed, so perhaps being unbreakable is not all that.)
Are the books actual books written in some language, or are they random jibberish? If the former then the frequency of the advancements would have the same distribution as letters in the language, which is an exploitable weakness. If the latter then you have what amounts to a one time pad which are considered unbreakable so long as the pads are truly random. It would however be simpler to just look up the next entry in the pad rather than the somewhat convoluted procedure you outline. Both methods suffer the drawback that the books/pads must never fall into the wrong hands and should not be reused as that will open the way to analysis. Creating and distributing the books securely then becomes a problem if the method is going to be used for any volume of traffic as you will soon exhaust your 50 books.
As @QMatic points out, Unless the books used as a key are actually random pads, this doesn’t stand up to modern cryptanalysis. In addition to statistical methods, there simply aren’t that many generally available books. For this cipher to be more logistically manageable than random pads, the book needs to be something that can be generally found in the same edition by both ends. Such books generally also exist in electronic form. Given enough ciphertext, you can just brute force test against every known book text in a reasonable amount of time.
Harry Potter, War and Peace, etc. ISTM that the issues you raise don’t apply in such cases. (Though you would make sure you have the same edition of works like W&P).
Which is actually pretty much unbreakable if done right. The gotcha is the only two copies of books in would need to only exist in the entire world (one with the encoder, one with the decoder), never repeated, and they should be completely random.
The venona decrypts were possible because the OTP used by the soviet union started using repeated pages during WW2 due to resource constraints, they later stopped doing that and were completely unbreakable the US (even though at that point they had a pretty complete picture of the system and how it worked)
As well as the OTP (which was only used for high level traffic, such as communications between embassies and Moscow, as it was such a complicated system) the soviets had a Enigma-type machine that didn’t have the flaws of Enigma, and it was unbreakable until the explosion of computing power in the 70s and 80s allowed brute-force decryption.
That is my thinking, too. It is pretty straightforward to come up with a series of operations so convoluted that it would be unfeasible to decode given a small number of messages passed. The difference with modern crypto is the ability to be unbreakable at scale without requiring a lot of cleverness from the parties involved.
One time pad cyphers are theoretically unbreakable in the middle (provided you have a sufficiently random source to generate the pad) but require that both sender and receiver have the same pad, which obviously has to be secured lest it be compromised, and should not be reused. Public key encryption (PKE) and in particular the Rivest—Shamir—Adleman (RSA) cryptosystem has come to dominate commercial encryption because it doesn’t require agents to share entire cypher pads but just the “public” portion of the public-private pseudorandom key, such that you can encrypt a message using the receiver’s public key and sender’s private key, and the receiver can decrypt it using the receiver’s private key and sender’s public key, and each only has to swap public keys which can be done in open text. The system is theoretically breakable by just going through every combination of primes but is an NP complete problem so as long as you select suitably large keys it will be computationally prohibitive to compute it in finite time with a digital computer. A great book explaining this on an implementation level is Paar and Pelzl’s Understanding Cryptography; note that it isn’t for the casual reader, it is a textbook on applied cryptography but it is reasonably light on discrete math and heavy on algorithms. (I don’t think it has any proofs, so you don’t have to be Will Hunting to work it out.)
As for the scheme of the o.p. it would certainly be breakable by brute force if you had digitized versions of the books in question, and you’d have to do this to try to decode a smaller message but once you have sufficient message traffic you can use letter frequency analysis to back out the encryption, especially if you reuse the books, which you will have to if you have enough traffic. Since you are essentially multiplying the frequency of occurrence of one letter by another, so there is a huge amount of bias, and if course if you can get even a small number of matches you can then go back and do a rapid search of digitized texts to figure out the source since there will be a high degree of uniqueness on the sequence of letters in a text. I suspect if you tried this method and sent a few tens of thousands of words of material, a decent cryptographer who knew that this scheme was being employed and with access to a library if digitized texts could break this in days if not hours on even a small computing cluster. From a cryptographic standpoint, this is just a couple steps up from simple substitution in terms of complexity.
If I’m understanding the proposal correctly, this is a book cipher, and the ciphertext is basically the SUM of two English messages, the plaintext and some text from a book. This would be fairly easy to break. One approach would be to subtract some common English strings, such as THE or AND from the ciphertext in every possible position. Since THE is likely to appear in the part of the book being used as a key, this subtraction will likely yield some three-character substrings of the plaintext. Repeat this with a large number of common English strings and some parts of the plaintext will appear. Combined with some more sophisticated data about the frequency of various digraphs, trigraphs, etc. in English text, the security would be pretty poor.
One particularly weak feature of this cipher is that patterns in English text might be used to reveal not only parts of the plaintext, but parts of the key. If you discover that part of the key reads GxNDAxF SAxD, you may be able to guess not only the rest of the key but the book that is being used, a catastrophic leak of information not only for the current message but for others encrypted with the same book.
A basic idea in applied cryptography is that that you should assume that the cipher algorithm is either known or will be discovered. As described above, statistical analysis is capable of providing clues that a book cipher is in use. Code breakers would perform such analysis as a matter of course analysing any ciphertext. It is just too well known a cipher algorithm not to.
As noted above, a one time pad is provably unbreakable. But only if it is never reused. If it is reused it reduces to a trivially breakable cipher. Cryptanalysis will, as a matter of course, perform tests across messages to see whether a slip has occurred and a OTP has been reused. OTPs have the significant disadvantage that they are used up at the rate of one character on the pad for each character of text encrypted. This significantly limits their utility.
Both of the above contain the core problem of cryptography. The key exchange problem. For a book cipher, the identity of the book is the key. There is depressingly little entropy in that. You would be struggling to get 16 bits worth. It can at least be memorised, so each side of the communication could be expected to obtain their copy of the book locally, and thus avoid transmitting the key outside of the head of the agent.
OTP keys were physical pads of paper, and traditionally were distributed by a diplomatic courier in a briefcase handcuffed to the arm of the courier. Return of the courier with an intact arm was generally proof of safe delivery.
Public key encryption solves the key exchange problem by generating session keys using the very expensive encryption process with the public/private key pairs to exchange short lived random keys.
The other problem is what is termed key amplification. OTPS eat up the key, and it is generally more useful not to need keys that are as long as the message. So part of most cipher algorithms effectively unfolds the key into a much longer form. This is subject to attack, so care must be taken not to use a session key past the point where it may become open to attack. A book cipher is performing key amplification with the algorithm that skips and selects letters from the book. But it is quite weak, and the nature of the algorithm leaves it open to brute force attack.
For completeness, for all its hype, quantum cryptography is basically intended to solve the key exchange problem. The channel is tamper proof, so you don’t need to check the number of limbs still attached to your courier, just check that the entanglement of bits in your channel is showing no evidence of tampering during the exchange.
While factoring integers is clearly in NP as well as co-NP, as far as I know it is not known to be NP-complete, and such a result would be rather interesting and surprising. But the relevant property is that since it is not known to be in P, this tried and tested system can still be used today simply by starting with big enough primes whose product can still not be decomposed in a reasonable amount of time.
Nobody today who is serious about cryptography would ever use an algorithm that isn’t fully in the public domain and well known.
If it’s public, it means it can be rigorously tested for weaknesses by thousands of experts over a long period of time, and any successful method of attack will immediately be known.
A secret algorithm is always considered to be weak and suspect. If one guy thought it up – even if that guy is a top cryptographer – it means that it may be subject to methods of attack that didn’t occur to him.
An openly known and tested algorithm will always be far more secure, as well as being necessary for general interoperability between different computer systems.
Most of the internet runs on RSA for public-key encryption, along with symmetric-key ciphers like AES, Blowfish, Twofish, Sepent, etc., and hash algorithms like SHA256 and the older MD5.
Hard slog, an inquisitive mind and a penchant for solving puzzles. The absence of computing technology also limited the range of encryption algorithms possible. Human error on the part of the people encrypting ciphertext has always been a big factor as well. Laziness and complacency lead to errors that expose weaknesses. But it takes human effort to dig out the specifics. Perhaps the best example is how Bill Tutte reverse engineered the German Lorenz machine from only intercepted messages. Human beats machine. Of course they still needed to develop the Colossus machines to perform the needed code breaking in useful time, but Bill Tutte’s work was the fundamental breakthrough.
But prior to any mechanical systems, ciphers were of limited capability. One time pads are unbreakable, then and now. A cipher needed to be designed to be tractable to both encode and decode in sensible time with limited resources. So there were always limits on what could be usefully thought up ad be useful.
Even if you can subtract some common English strings and decipher some of the plaintext, 1) you won’t necessarily know that you’ve deciphered part of the plain text (as opposed to being a random result), and 2) even if you do, you will only know that text is from a book which contains the word THE or AND, which does nothing for you for the rest of the cipher.
The most common method used in the 19th and early 20th century was the codebook. They were widely used for military, diplomatic, and financial purposes.
A codebook was a list of normal English words (or French or German or Russian or whatever) with special meanings corresponding to each word.
e.g.
canoes = have authority to
canopies = on whose authority
cantered = you have no authority
canticle = authorize them to
canvassed = not authorized
capacify = you are authorized to answer
Using ordinary words meant that messages could be sent quickly by normal telegraph with normal operators.
It was reasonably secure… except that if an enemy got hold of a codebook they could decipher the messages immediately.
Hence codebooks were kept in locked safes, and people who worked with them had strict instructions to destroy them if there was any chance of an enemy getting hold of them.
The reason we have modern digital computers is because of cryptography. As generally invaluable and ubiquitous computers are today for doing everything from communications to controlling traction and stability responses in your car, the first digital computers were special purpose machines designed to break enemy codes, which they could do in hours, compared to the many days that an electromechanical “bombe” would take, or the weeks for hundreds of women “calculators” using mechanical adding machines to perform iterative calculations.
Once you’ve match even a small amount of plaintext you can easily match the pattern against digitized texts. The position of a few words or a few hundred letters would be sufficient to uniquely identify a text. The fact that your code would have such frequent and trivial repeats makes it inherently vulnerable to being cracked.
The high redundancy of English text addresses both of these questions. You can guess, with probability much higher than chance, whether a sequence of letters is part of an English message or not, and given a substring of a part of English text, you have a good chance of guessing what precedes and follows it.
If you subtract AND from 3 letters of the ciphertext and get the result XFP, it is almost certainly not a correct decryption. But if you get NER, it is more likely to be correct and you can note that AND in this position is worth more investigation. This also gives you some ideas for further attacks – the word after the AND in the key is likely to be a noun, which reduces the number of trials you need to do in that position. And the known frequency distribution of English letters can tell you what is likely to appear both before and after the NER in the plaintext.
Worse than that - take the example of “the” - if you find a string that looks likely to be a fragment of English in the text by subtracting that, then the next step is to try “then”, “there”, “thee”, “other”, “whether”, etc. since that particular sequence of characters is often embedded in a longer word. “And” also is embedded in bigger words. If one of these is the actual key or message, then the partial fragment becomes an even more likely longer fragment. The only challenge then is to determine whether you are deciphering the message or the key; with enough text, you are likely to produce word fragments that make adding additional characters easy to guess. There are likely concordances of most common words in English (or Russian, or French, etc.)
Some random book, unless it’s technical, probably draws mainly from a vocabulary of maybe a few thousand words. It should be possible to use this list to analyze the text in this way against the list, and “score” the output based on how many sequences of characters are a match for English common letter combinations. process a massive amount of text and you will find things like “ea”, “th”, “re”, “ed” and “qu” are very common combinations. Using that relative frequency to score the attempts at decryption, extra points for 3 or 4 in a row (“ent” or “tion”) high probability combinations…This would direct the cryptographer to likely fragments of the key or text.,