File compression: the general case

Well, sorta general. I’m about to go hunt down some information on Google about it, but I feel like I’ve thought up an ingenius method for file compression that will work just as well on any file whatsoever. Hopefully I will be able to implement it and test, but first I’d like to know how such popular compression agents work.

Do they all simply look for redundancy in straightforward manners? That seems… well, disingenius.

I have reasons to doubt that any compression scheme will work well on any file. Nevertheless, not all compression schemes use redundancy. You might want to do some reading on prefix codes. There’s a decent, if rather short, introduction here.

Good luck to you. I remember a thread where someone was proposing a “new” method of encryption. I am always amazed that people who have not bothered to learn what is already out there think they have a worthwhile idea. Thousands of people who know alot about the topic have spent countless hours thinking about it and someone who has not bothered to study the basics thinks they have a better idea? As I say, good luck. If you care to post your invention I am sure we will have some knowledgeable people (I do not claim to be one of them) pick it apart in no time.

At any rate… compression algorithms are specifically designed for the type of data they will deal worth. It is impossible for a general purpose compression algorithm to be more efficient in all cases. Also, some of them are “lossy”, like jpg, and some are not like when you need to keep the integruty of the data. If you compress a computer program you need to reconstruct the exact same thing, not an approximation. But MP3 can lose some data and you wo’nt notice it.

We have discussed compression algorithms before. JPG is designed for pictures. It separates color from intensity and then compresses them separately. GIF is better suited for images with very few colors and solid areas of same color. I believe GIF is non-lossy. GIF is more suited for a B&W document than JPG for example.

Anyway, what did you have in mind?

Well, stating what I have in mind is a lose-lose situation for me. If it is a good idea, it is open to be stolen; if it isn’t, I play the fool. Now, I don’t mind playing the fool, but I wouldn’t want my idea to be stolen. sigh

i know it sounds somewhat crazy to have a compression system which would work well on all file types, but I truly believe I have one. Understand I is an arithmetic compression agent, not a redundancy detector.

My idea really sounds so fucking simple that I cant believe anyone hasn’t thought of it before, so I’m reading up now on several compression methods. None of them even come close, which puts me on perpetual motion or genius level. :frowning: You know you’re on the edge when you are stuck in such a dichotomy LOL

It would be a lossless compression method which could be used on any file, any size. Because of the possible nuances of binary files versus text files, I might need to tweak the algorithm to account for data storage, but I might not.

I would consider a candid conversation with someone through the “high security” of email :stuck_out_tongue: if anyone is willing… My email is open to all, so fire away if you are truly interested. In fact, I could probably use some help with it.

I do have doubts that it would successfully encode text better than the standard text-compression algorithms, but again, I’d need to actually program one to test this, so this is some time in the making. Fucking C++! should try it with Java first, which is emminently more writable…

Ah, found it. There’s no way that a compression scheme will work equally well on all files. There are 2[sup]n[/sup] files with n bits, and there are 2[sup]n[/sup] - 1 files with fewer than n bits. So no matter what compression scheme you’re using, there just aren’t enough shorter files to go around, and some file will not be compressed by even one bit.

http://www1.ics.uci.edu/~dan/pubs/DataCompression.html

Here is an apparently extensive overview of data compression methodologies. It might be of interest.

I have been involved in R&D on compression alogorithms before, so I my have a little to contribute here. If you have an idea that you are worried about having stolen, you are going to have to obtain a patent to protect it. As observed elsewhere on SDMB, that is an expensive process and often not worthwhile. And, since you can’t patent an idea, you are going to have to create an “invention” that expresses that idea (ie. actually create an encryption routine that expresses your idea).

Let me observe that there are literally hundreds of compression algorithms out there, and many millions of dollars and Ph.D.'s have been devoted to the subject. That’s not to suggest that it’s impossible that your idea is both workable and original, but the chances are fairly small.

If you do want some help evaluating your idea, you might try getting a non-disclosure agreement with someone qualified. You’ll have to provide some “consideration” to the other party (e.g. payment up front, or a share of the proceeds of commercialization) to make the contract binding.

Well, I don’t want to be the guy to burst anyone’s bubble, but you say your algorithm works on any file, so it must be a lossless algorithm, and most research in the field of compression algorithms nowadays focusses on lossy algorithms (DCT, wavelets, fractals and such) since they compress pictures, sounds and film much much better, so I doubt that your method is anything of value.

But to answer your question, yes, compression algorithms usually look for redundancy in data (since that is what compression algorithms do ;-), but if they do it in a straightforward way depends mostly on your definition of straightforward I think. So, perhaps you could shine some more light on your thoughts about data compression.

You might want to do a search on “information theory” and “entropy”. (Entropy in this context is different from physics entropy). There has been a lot of analysis in this area, and it’s highly unlikely you thought up something credible without any background info. I don’t mean that to be condescending; it’s just the nature of new ingenious things that their discoverers are almost always steeped in knowledge about the old ways of doing things.

http://wol.ra.phy.cam.ac.uk/pub/mackay/info-theory/course.html
http://www.faqs.org/faqs/compression-faq/

To put it bluntly, all forms of compression involve recognizing redundancy and coding for it. I’m not sure what “straightforward” means in this context, but I imagine that the answer is that it is not straightforward, in the sense that you can’t easily do it with pencil and paper in short order.

What led me to investigate this was a thread that appeared here the other day about “illegal primes” and the thought occured to me after researching “executable primes.” I initially began wondering about the nuances of executable primes, and then began the thought process that led me to my “idea.”

As to further pique anyone’s interest in the matter, the file could just as easily decompress itself or it could be a static decompressor (like, for example, WinZip). Either would work.

Of course, when I say it would work “eqully well” on any file type I am lying through my teeth. The compression method would work best on certain files, not certian file types. It also wouldn’t necessarily represent the best possible compression method (that is, the compressed file might not be perfectly small) but again, I am going to need to flesh out the algorithm more on paper and then implement it to see.

I am pretty upset about the whole cost of patenting things, and of patenting things that are more of a method than a device. Not that methods don’t get patented (pharmaceutical companies and the people who sell instruments to them are well versed in such affairs) but for something so terribly simple as an algorithm the cost is really prohibitive. So much for capitalism and the “start-up” guy. :stuck_out_tongue:

My earlier post is so badly written I am embarrassed. Anyway, some further badly written and disjoined thoughts.

We are dealing with information here and you need to understand the basics of information theory. A file (message, whatever) with no redundancy cannot be compressed. Compressing is just taking out the redundancy (I am referring to lossless compression).

As has been said, a file of n bits can have n^2 values. any compression method will turn some of those into smaller files but some into larger files.

I do not know the details of your idea but it sounds to me not like you claim to have an idea no one has had before but rather that you have an idea which has been proven mathematically to be impossible.

Just seconding some earlier posts.

Please read the comp.compression FAQ pointed to earlier. It will make it abundantly clear that you are beyond perpetual motion territory. This is Math we are talking about. Proofs mean you can’t do it no matter how clever you are.

Note the standard rules of crackpottery:

  1. Don’t study the area in advance.
  2. Come up with a “brilliant but amazingly simple idea.”
  3. Don’t let the laws of Physics, Math or whatever deter you.
  4. Put down the “so-called experts” in the field who ignore you.

You’ve got at least 2 of these going for you.

I’ve been hearing of “compress any file” garbage since 1977. Sheesh.

>> an arithmetic compression agent, not a redundancy detector.

You really need to study the very basics of information theory and compression. No redundancy means no (lossless) compression is possible. All compression does is remove redundancy. There is no other way. You just cannot transmit n bits of information with fewer than n bits. No way. If you think you can do it you are overlooking something.

This is more of an academic point, but you don’t necessarily have to have redundancy in a file to compress it. Consider an algorithm which reduces the file that is the entire text of the Encyclopedia Britannica (current edition) to the string “#EBC.”

Every other file it prefixes with ‘#’ and leaves as is.

The compression rate for the Encyclopedia Britannica would be phenomenal, but for every other long file, pretty crappy.

This illustrates the point made before that no algorithm could compress every file ideally, but it also illustrates another point that I think is more interesting – the compression algorithm itself can be thought of as possessing a certain amount of information. Very simple algorithms do not have a lot of this abstract “information”, but they also don’t do a stellar job overall. A program similar to the one above, but which was capable of recognizing every file of up to 1GB in size, would do an excellent job (it would just have to output a certain number corresponding to the file), but the compression/decompression algorithm would be huge. So we have a sort of Conservation of Information when it comes to compression. The more you save on compression, the more bloated your algorithm is going to be. If you look at it a certain way, you can’t really compress information, just hide it inside the algorithm’s data.

I was also always fascinated by the idea that you could represent the complete works of Shakespeare by two real numbers such that the ratio of those two numbers is an ASCII number corresponding to the works. Such a scheme would work miracles for any type of file, but execution of it in traditional “computing” is probably intractable.

Let me take a wild guess. You are proposing to generate a string of data representing the file exactly by having the data set crushed down to a mathematical equation or expression, or set of equations that when calculated or otherwise computed will generate this data string.

An interesting proposition if this is the “general case”, but as JRootabega noted (welcome aboard BTW) this likely would take more computing horsepower than is practically available even if the idea was workable (which I have no clue about).

Well, I must admit that my first pass proved that my method served to expand a file, but that work showed room for improvement in other ways.

Consider it a mild mathematical diversion. :slight_smile:
(and: I was already reading that FAQ when I originated the thread; my initial method was already mentioned in there as being impossible :eek:).

Pease go easy on me guys, I am only an “intellectual” in my spare time, and find these things “fun” not a way of life. Sheesh! I even allotted that I was quite likely on the path to “perpetual motion” in the OP. [grumble damn den of wolves in here grumble]

>> you don’t necessarily have to have redundancy in a file to compress it

Wrong!

>> the compression algorithm itself can be thought of as possessing a certain amount of information

(1) That’s called cheating and (2) it is quite inefficient as you (a) have to transmit that information beforehand and (b) you are limited in the messages you can transmit. It is the equivalent of the group of people who were listening to a guy say a number and they laughed. When a newcomer asked what was going on they said since they knew all the same jokes, there was no need to tell them, they just numbered them and laughed at each joke as told by number. Th fact is you still have to transmit the information beforehand.

I will repeat, there is no way to transmit n bits of information with less than n bits. No way. If you can compress it it is because you are taking out redundancy.

If you say “yes but I only have a limited amount of messages to send (say 4) but each one is 1MB long so I can just number them and send the number hence I have sent one MB contained in two bits”.

I will repeat: Wrong! the amount of information contained in those two bits is exactly two bits and they point at a 1MB file which was transmitted before and took 1MB to transmit. Sorry, there is no way around it.

But that is, quite simply, demonstratably false, sailor.

Consider a self-executable prime number. The “compressed” file would simply be an index to that prime number. The index to a prime is at worst the same size (in binary) as is the prime number itself, and the larger the prime the greater the compression. Huge compression could be achieved in that scenario, while if the number wasn’t prime the “decompression” method would simply do nothing.

In this particular scenario there is no need to have an indexed array of prime numbers, one would simply have a method for calculating primes and comparing them; once compared, it ran that output as input again because index of the prime is itself prime, the process can work again.

Admittedly, what I explain here is not an optimal solution to compression, but I find it to be, personally, rather interesting. Neither the decompressor nor the compresser need to actually know any primes, it only needs to be able to calculate them.

erislover, you really need to understand certain very basic concepts of information theory. What you are saying is that you can establish a biunivocal relation between two different numbers of objects and that is just impossible. The fact that whatever prime you choose has many digits does not mean that when I say to you “prime number 67” I have transmitted to you all that information.

To put it simply: If I have a message (file, whatever) of n bits where all 2^n combinations are possible, there is NO way to transmit that message in less than n bits. It has not been done and it has been proven that it cannot be done. The fact that the 2^n messages point to functions which yield many digits does not mean the information is contained in the message as the outcome is limited.

Suppose my message X can be one of four (0-3) and the resultant function was 23^X. The result may have many digits but it is still limited to four possibilities. The fact that 23^3 has many digits does not mean I can represent any file of that many digits but only that one so it is of no use. The transmitted message only carries 2 bits of information.

Look, if you are interested in this, get a good book on information theory. I cannot condense in a couple of posts a couple of courses I took in college. After you have understood all the basics then we can discuss the basic problem which is: Can a totally random message of n bits be comprssed into fewer than n bits? I think we will agree then the answer is NO. That is the reason no one has been able to do it.

I know I’m doing a lousy job of trying to explain this. The point is this: just beause your prime (or whatever number) has n digits this does not mean you have sent that much information. Suppose your prime has 45 bits. Now, are all other 45 bit numbers valid messages? The answer is NO. This means there is redundancy and that is why you can compress it the way you say (which would be a very bad way of doing it BTW). The information contained is not measured by the number of bits in the message only but also by the number of alternative valid messages.

If I say forty fibe bits you can guess i meant five because fibe is not a valid word. This means there is redundancy and the possibility of compression. But if I say the thread number is 95765 you have no way of knowing if it is right or wrong because even if I change a digit it is still a valid number. But suppose now that only numbers divisible by nine are valid. Now I have redundancy and possibility of compression: the easiest way is to divide by nine:

original message: 50670 (5 digits)
“compressed” message: 5630 (4 digits)
When I transmit the compressed message I am not sending 5 digits of information but only 4. If all numbers with 5 digits were valid there is no way to compress them into 4 digits. It is only because I am declaring many of them invalid that I can do that. The reconstructed message may have 5 digits. It also has one redundant digit.