File compression: the general case

erislover · October 29, 2001, 3:15am

239 is a decimal prime number. It is the fifty-second prime number.
11101111 is 239 in binary
110100 is 52 in binary.

Have I sent the same information? 239 is the “raw data” whereas 52 is the compressed data. Decompression involves asking “what is the fifty-second prime number?” We have “encoded” an 8-bit number into a six-bit number, quite literally.

As I mentioned, this alone would not compress all files (since, quite obviously, not all files are prime numbers). In fact, chances are pretty good it would compress none of them. But this does show that redundancy is how you look for it; it isn’t “built in” to whatever data you are looking at.

JRootabega · October 29, 2001, 3:33am

sailor, we are in agreement here, but I think it would be clearer if you said that you could not compress EVERY file of a certain length into a file smaller than a certain number. Obviously some compression algorithms do actually work for certain files, which is where your redundancy point comes in. A compression algorithm that performs really well will not be able to reliably compress many files, and one that works well overall will generally not achieve very high compression rates.

But I also think you are misunderstanding me when I say that a compression algorithm itself contains information. I was only giving an extreme example with my Encyclopedia Britannica. (Although it most certainly is an example of a function which maps at least one very long string to a very short one. It may not use ‘redundancy’ to compress things, but it sure does compress that one long string.) What I meant was not that the compression alg. contains specific examples of things it should compress, but the very steps that the algorithm takes are a quantity of information. Or perhaps I should say work. More work is probably done compressing a file, sending it, and uncompressing it, but the point is that the work is done much more quickly, minimizing the information that passes through the bottleneck, internet transfer (or what have you). My point is that compression works, but it is not magic. You have to do the work somewhere, but if you play your cards right and use the same compression algorithm multiple times, you will have saved yourself time and space.

sailor · October 29, 2001, 3:34am

>> We have “encoded” an 8-bit number into a six-bit number, quite literally

You really do not understand the issues involved. You cannot encode all 8 bit numbers into 6 bit numbers. You can only encode 1/4 of them. If you transmit the original 8 bits, then you have two redundant bits.

You have not transmitted so much information because 238 is not a valid message nor is 237, nor 236 etc. You have to show me you can encode 238 different messages into fewer than 238 different messages. but you are encoding 52 possible messages into 52 different messages. The amount of information carried by the message “238” is not 1/238 but 1/52.

By your definition if I say “hamlet” I have just encoded an entire play into one word. You are playing fast and loose with definitions.

As I have said and I repeat and maintain: There is no way to compress n random bits into fewer than n bits. It has not been done and it has been proven by those who know that it cannot be done.

It would be helpful if you can show us how to encode 8 bits into 6 bits. Just as an example.

If you still think you can do it I wish you good luck and I leave you on your way to fame, glory and endless amounts of cash.

Personally, I think you will end in the company of the free-energy crowd, the water motor, perpetual motion, and other such “inventors” and “discoverers”.

Britt · October 29, 2001, 3:37am

The problem is expressing ‘all files’ as prime numbers.
** sailor** is saying, is that converting a given file to a unique prime number is going to grow your information space just as quickly as your compression algorithm is going to reduce it.

I think you may be working under the misapprehension that all files/executables are prime numbers.

Britt · October 29, 2001, 3:49am

That last sentance should really say:

I think you may be working under the misapprehension that all files/executables are encoded as prime numbers, or are easily converted to a unique prime number.

sailor · October 29, 2001, 3:49am

the fact that some files (messages) are not valid is the basis of redundancy. If all possible values are equally valid then there is no redundancy. If all 5 digit numbers are valid, then there is no redundancy and compression is not possible. If only multiples of 9 are valid then you have redundancy and the possibility of compression.

The information contained in the message is not measured by how many bits the message contains but by the ratio between that message and all possible messages.

Suppose I ask "is the price (a)13654 or (b)13645? the answer, a or b, only carries 1 bit of information, I don’t care is you express it as (a) or as 13654.

Look it is pointless to argue this as the examples posted so far are faulty. It is just playing fast and loose with definitions. If you are interested in the topic, get a good text and study it. I will believe you can compress n bits into fewer bits when I see it.

erislover · October 29, 2001, 3:51am

Well, in no way do I feel that all programs are prime numbers. I thought I made that clear enough.

This is quickly becoming a series of personal attacks. Could someone link me to what that famed omega number is? (if it is, in fact, a single number and not a type of number (like a godel number)).

Britt · October 29, 2001, 4:03am

Sorry, I didn’t mean to attack you :(. I was just looking for a possible incorrect assumption that you may have made.

OK I’ll guess at a famed omega number. Of course I am biased as the work is being done by one of my old professors.
Here’s a more general definition.

Britt · October 29, 2001, 4:12am

Here’s another related but different well known omega number.

erislover · October 29, 2001, 4:12am

Interesting… so it is a type of number and not an actual number (or so it seems from those two links).

Well, I just used my “method” on the following text file:


test TEST *test*

which is normally 128 bits long, but I got it down to 112. The method I used does not remove a single damn thing. It does not test for redundancy! It simply encodes numbers. Can anyone guide me in a direction which would give me a number I shouldn’t be able to “compress” without me giving up the algorithm?

sailor · October 29, 2001, 4:14am

>> This is quickly becoming a series of personal attacks

Well, I can’t find any. I do not feel anybody has attacked me and I hope nothing I have said is interpreted that way as it certainly is not meant that way although I do feel a bit frustrated as I feel I cannot explain this well. OTOH I guess I cannot condense into a few posts what took me a whole semester to learn.

I do not understand this fascination with primes (or with pi in another thread). Primes are useful in generating encryption keys for public key cryptography but once you have the keys all the rest is done without reference to any primes or pi or any other fancy number. In compression primes are no more useful than any other number. You just cannot use primes to do the impossible.

All this is becoming just generic talk. I’d like to see how you can compress n bits into fewer than n bits. I don’t care if you do it using primes, the rule of three or vaseline. I just want to see it done.

Britt · October 29, 2001, 4:22am

Try compressing all of the possible 128 bit files, and checking the results. Take an average then add the size of your program, (including all dlls tables etc that it needs). If you believe the overhead of the program will distort these results use a larger file size. Oh do be wary of non-scalable results such as prime factorisation etc.

Britt · October 29, 2001, 4:29am

Also try running the algorithm against typical files such as large power point presentation, large word documents with graphics mpegs mp3s etc and check the results. Check for both accuracy and reduction. If it works better than win zip (in the same amount of time) for these files then you are probably onto a winner regardless of the in general case.

Britt · October 29, 2001, 4:32am

That is not saying anything about the efficiency of the WinZip compression routine mealy giving you a common standard against which to compare yourself.

erislover · October 29, 2001, 4:32am

Er, haven’t written the program yet, I wanted to get a working model up first. I did that one by hand. I’m working on running the output through the input a second time to see how much more condensed it could get because 128 --> 112 isn’t anything to get excited about. However, I should be able to get it down much farther on successive runs.

I understand the impossible nature of what I propose, but please understand I am not hoping to get a universal compression algorithm done. I understand, fully, that in some cases the output may not be smaller than the input, in which case one would simply get the message that this is the case.

Also, one tricky thing is that it won’t initially reduce to one file: it actually reduces to two. The next run becomes three files, and so on where n uses yields n+1 files. Those, then, could be archived together in a straightforward fashion.

I will attempt to program it out this week but I’ve never worked with file reading/writing in programming yet. No better way to learn than to do it, however! After such a case I could certainly manage to run all possible 128 bit states.

erislover · October 29, 2001, 4:37am

Speed, unfortunately, cannot be the issue because I can only really program in Java right now.

Java, of course, is not known for its superlative speed.

sailor · October 29, 2001, 4:37am

>> Well, I just used my “method” on the following text file: test TEST test which is normally 128 bits long, but I got it down to 112. The method I used does not remove a single damn thing

well, you just said you removed 16 bits. I do not know how to repeat this. There is no way in this world to compress all possible 128 bit messages into 112 bits. You can compress 1/65536 and the rest are invalid. When you have more information expressed than the amount of information carried by the message this is called redundancy. If you have 2^112 valid messages and 2^128 possible messages what you have is tons and tons of redundancy and this is what allows compression.

Not to mention the example: ascii text, english, one word repeated three times. No redundancy? <sigh>

erislover, you simply have no clue what redundancy means in this context or the very basics of data compression or how you measure information.

One other thing: your secrecy is just foolish. You have not invented anything, I can guarantee you that. But people who do invent such things go out of their way to have them published and reviewed by others.

Anyway, I give up as it seems I am unable to explain this in a way you can understand it. Maybe others can do better.

Britt · October 29, 2001, 4:37am

Some bad news for you I’m afraid you’ll have to figure in the overhead of the file segmentation as well then. which in Windows is not always a trivial task.

sailor · October 29, 2001, 4:47am

>> I will attempt to program it out this week but I’ve never worked with file reading/writing in programming yet. No better way to learn than to do it, however! After such a case I could certainly manage to run all possible 128 bit states

I’d like to see that too. 2^128 cases? Have you any idea how long that would take? If your computer can do 1000 per second it will take 1,000,000,… (well it’s 28 zeros) years to finish. I’m not waiting around.

Britt · October 29, 2001, 4:51am

sailor If he is not doing any checking of the data then the fact the data has redundancy is irrelevant to him. There are of course algorithms that could do a lot better in this case than his does. I will await judgment till further data is provided for the two tests given above. Though I agree with sailors analysis of the situation in general you can not uniquely define 2[sup]128[/sup] cases in less than 128 bits, and using knowledge contained with-in the algorithm to store information is either not going to scale well, or is going to lose data erislover. Hope you have fun though.

Topic		Replies	Views
How does file compression work? Factual Questions	38	2435	August 3, 2002
Can compression algorithms be applied successively? Factual Questions	31	1549	September 21, 2004
Compression Algorithims--I'm clearly missing something Factual Questions	36	4771	August 7, 2012
Non-lossy compression implications Factual Questions	21	1508	April 4, 2007
On data compression In My Humble Opinion	34	1602	July 4, 2008

File compression: the general case

Related topics