Trying to understand creating my own file encryption program

I am trying to understand file encryption. Basically suppose I created a lookup table that transformed each byte of a file per some lookup table that I developed. Such as


Input Byte	Output Byte
0x00		    0x54
0x01		    0x9A
. . . 
0xFF		    0x43

and then suppose I did this for each file. How could anyone recover the data assuming I applied the same rule to the file name and extension? For this exercise make the assumption that I do this on a personal computer and then move this to a flash drive. Then you are only given the flash drive. Could you uncover the original file?

I can obviously make this much more advanced by merging files (perhaps by merging all my files and creating some kind of key delimiter between the files. Then I shuffle the data according to some other algorithm I keep so I can reorganize it later.

  • Suppose that the 256 byte transform (lookup) table is kept in my head.
  • Suupose that the file delimiter is kept in my head
  • Suppose the the algorithm to shuffle the data is kept in my head

It seems to me that a string of bits is meaningless unless you know how to interpret it. So even if you had this information I told you you would have to figure out how to shuffle the data to get the delimited file back. Then you would have to parse the data to get each file. Then you would have to reverse engineer my lookup table to generate a usable file. Of course that isn’t impossible, but then suppose all you are given is a chunk of data and you have no idea (and suppose I refuse to tell) how to interpret said data? Could you figure out what my data is?

The key idea of encryption is that people know what the algorithm is that you are using to encrypt the data. If you’re keeping the algorithm secret and are confident that statistical analysis of the data wouldn’t reveal patterns of some sort, you’re probably okay. But really it’s a shot in the dark unless you’re wise to how one detects and reverse engineers such patterns. You also need to be certain that no one could ever beat the algorithm out of you.

It’s unlikely that this is a better method than using already available software that uses accepted algorithms. There’s a few free ones that are, supposedly, perfectly good.

Disclaimer: I am in no way qualified to answer this, but I have some experience in programming and I do take an interest in encryption.

  1. A strict byte-swapping scheme has a fairly limited number of encryption keys (byte-to-byte mappings)
  2. Most languages hve a fiaryl hihg amuont of redanduncy - basically: you can reconstruct source text even if, say, 30% of it is lost - or more if it’s just mangled.
  3. If you can apply any kind of statistical data to the source, you can apply it to the encrypted data too, given enough encrypted data and some knowledge of the source language: for instance, the byte with the highest frequency would probably encode the letter “e”.
  4. your scheme has been tried since the classical Greeks. there really is a reason it’s not seriously used anymore.

too late to edit my post, but see this, especially the “Breaking the cipher” section:

http://en.wikipedia.org/wiki/Caesar_cipher

edit and this: http://en.wikipedia.org/wiki/Substitution_cipher

Applying it to the filename and extension might be difficult due to the rules the filesystem imposes on what kind of bytes are allowed in a filename. Besides, it doesn’t make your system any more secure.

But your system is fundamentally insecure as designed at this point. Simple substitution is trivially breakable by looking for patterns in the encrypted text: There are only so many English words that look like XYY, for example, and only so many that are only one letter, and only so many that are seventeen letters. A simple substitution cipher hides the letters but not the patterns.

The shuffling algorithm is a lot more important than the substitution table in this version of the system.

Doesn’t matter, for the reasons I outlined above.

Doesn’t matter.

Then it’s likely insecure, because designing an algorithm you can’t break is easy, but designing one that someone else can’t break is very difficult. (We still don’t know how to do it so nobody can break it, but with a lot of work and a lot of public review we can come close. All secret algorithms have sucked.)

This might be difficult.

These steps would likely be easy.

Actually, a one-time pad is completely, 100% unbreakable, not to mention dead simple. It’s just very awkward to use in practice.

http://en.wikipedia.org/wiki/One-time_pad

This isn’t relevant in the slightest, because a one-time pad is an extremely public algorithm. Only the key is kept private. It’s impractical because the key needs to be at least as long as the total length of every message sent on the channel the one-time pad secures.

And a one-time pad isn’t a substitution cipher: The key is XORed with or added to or otherwise mathematically applied to the messages. This means none of the patterns in the message show through, as they do with a substitution cipher.

Anyway, a substitution cipher is an algorithm; the specific table used is the key.

Thanks for the input guys. I really didn’t give my method more than 5 minutes of thought. But this is real interesting so I will check out all your links when time permits.

A known-plaintext attack on a Caesar cipher? It’ll be done by lunchtime.

Some of the biggest mistakes people have ever made in cryptography came from trying to invent their own algorithms, then keeping the algorithms secret. I can’t repeat this often enough: Cryptographic algorithms and processes must be made public and submitted for analysis by mathematicians and cryptographers so they can tell you what you did wrong. You don’t want to be the next guy to develop something and have it wind up being no more secure than a ROT-26 transposition.

It’s the keys that you keep private.

If you want a good algorithm, the RSA algorithm is pretty good. And pretty simple - so simple that it can be put on a T-shirt.

Xoring is one way of implementing a one time pad. Using a lookup table works as a one-time pad if you meet all these conditions:

  1. The table is truly randomly generated. Not pseudo-random.

  2. Each byte/character is mapped by a different entry in the table. No re-use.

  3. The message is no longer than the table. (Hence a 256 byte table can be used to encode a message at most 256 bytes long.)

  4. The table is never, ever re-used.

Needless to say, the OP’s idea is very, very far from meeting these criteria. It can be broken in a flash.

To the OP: If you don’t know why it is so easy to break you scheme, you have no business trying to create one. It’d be like someone trying to design a bridge who doesn’t understand gravity.

If you learn nothing else from this thread, memorizing this quote will put you ahead of 90% of the people out there. If you want to be ahead of 99% of the people out there, you should understand that in any real system, the encryption algorithm is the strongest link in the chain. We all know what that’s worth.

I would say the OP has every business in the world doing what he/she is doing, it’s fun and you learn.

Here’s an additional question for crypto-types:
If the OP altered the algorithm to continuously be shifting the starting point of the simple translation table, how would that type of change impact the statistical analysis required to break the code? Would it be harder for an experienced crypto-type to break?

For example, for the first character in the message, the translation table starts at index 0, an input character value of 64 would go to index 64 to get the output character. For the 2nd character in the message, an input character value of 64 would go to index 65 in the table to get the output character, etc. etc. (wrapping at 256).

Assuming we’re talking about plain English text, the highest frequency character would most likely be the space. For binary data, I suspect NULLs would be the front-runner.

There was an online variant of the Vigenère cipher that the coder had adapted to include encoding the spaces (to prevent cryptogram-style attacks from guessing words based on length). However, that opened the system up to a new attack: Just decrypt the entire message based on the assumption that the entire plain-text is spaces. Anywhere there’s a space, you’ll get one of the letters in the key, and anywhere else, you’ll get a semi-random letter. Do a frequency count on the results, and the letters in the key will have the highest frequencies.

(Here’s the site: http://www.seabreezecomputers.com/encrypter/)

Assuming that the OP is doing it just for fun, sure. If, on the other hand, the OP is doing it to encrypt anything important, then such behavior is foolish or even dangerous. The only thing worse than no crypto is bad crypto – it has all the disadvantages of no crypto with the added problem that the user is lulled into thinking those disadvantages are gone. And it’s really easy to create bad crypto.

I’m not sure you’re following the result of the tweak.

The common characters (e.g. “e”) would not get xlated to the same output character every time. The character “e” would get xlated to any of the 256 characters depending on how the xlate table is shifted at that point in the message.

It seems quite likely that the most common character in the xlated message has no relation to the most common character in the input message, whether thats an “e” or a space.

Here’s a great book on the subject, if anyone is interested:

http://www.amazon.com/Cryptology-Unlocked-Reinhard-Wobst/dp/0470060646

I’m not sure you’re following the thread of conversation. It was a much earlier message I was replying to that was just focused on the translation table, not any tweaks.

It would be harder to crack but trivially so with today’s computing power. A better bet would be to compress the data first and remove as much redundancy as possible. Redundancy is the Achilles Heel of most cryptographic techniques. Of course, as soon as the hacker realized the data was compressed then cracking becomes a whole lot easier.

Yes, my bad. I saw a response in the thread and assumed it was related to my post.