File compression: the general case

:o, I hadn’t actually done the maths. Sorry erislover, your best bet is to take a random sample of files from different users and environments compress them then test them then analysis the results. A general case will never be accepted without giving up the method, but an empirical case study will still sell a product.

Well, the file segmentation thing I noticed (because even small files I write by hand are still 1K in size) but when I actually scrape up a program for it I can create some larger files to work with.

I would mention that the test I gave above, though itself is redundant, is not compressed by checking for redundancy. Sailor, you are welcome to give me any text or series of bits (preferably in hex for easier reading) and I will tell you how small the end file would be in bits (though I would have to use a .tar’er to get it all into one file, and I am not sure how exactly it accomplishes this).

If you would like to hear what I have to say about the method, then email me. What I can certainly say right now is that every file that is not (in hex):
787878787878…78 {EOF}
Can lose 1 bit per byte. Every occurance of 78h will cause this particular byte to remain the same size in its new form (though it will no longer be 78h).

I read that and it sounds crazy, but damn it if I am not looking right at the data in front of me. I wish I could see what it is that I am possibly doing wrong, because it is as clear as, well, clear things. A glass of pure water. :stuck_out_tongue:
This means that a worst case will yield no loss but no gain. It also means, in the way the algorithm works, that after it is used (on a sufficiently large file) that there will actually be redundancy to eliminate where there was none before. In fact, for every file larger than 255 bytes there will be guaranteed redundancy (of at least one occurance). For every multiple of 255 bytes there will be guaranteed redundancy of {size}/255, rounded to the smallest integer.

Fucking a, it sounds crazy. But I swear I am looking right at it.

The best compression one could get out of a single run is, then, 12.5%. I have a sneaky suspicion that the method can only be used once (that is, in this case, ERLing a .ERL will result in a file that is exactly as large as the original—at least, it did in this case) but I’d need to do some number crunching to test that for certain.

Actually, with some number theory I may be able to show conclusively without explicitely testing all cases that, in fact, all cases would compress to some value between 0% and 12.5% (the best case).

I am doing honest math here, no “pi magic” like I did in that encryption thread (damn good memory, sailor :)). I really do want to share it but not on a public message board.

Just to reiterate, there is no compression scheme which will compress all files. Another way of proving this, is to ask what happens when you run your program several times in a row: If you had a general compressor, it’d eventually get it down to a single bit. For instance, Hamlet compresses down to “1”, and King Lear is “1”, and Julius Caesar is “0”. Man, that Shakespeare guy sure could write, couldn’t he?

On the other hand, nobody wants to be able to compress all files. They just want to be able to compress the interesting ones… And depending on your definition of “interesting”, that can be very possible.

One word of warning, by the way: Don’t use text files for your tests. ASCII text only uses 7 bits out of each byte, so you can always chop an eighth of the file off. Another caution, from reading your response just now: If you’re using variable-sized bytes for your compression, you need to keep track of where one byte ends and the next begins.

I can’t help but say that back in the late 1980’s, one of my coworkers found an ad in a computer magazine for a “revolutionary” compressor. Claimed that it could compress any file to less than 100 bytes, if you ran the output back through the compressor enough times.

This would, indeed, have been revolutionary, as it would have proven that there were no more than 2[sup]800[/sup] files possible, ever. This is, of course, just ultrafilter’s explanation run backwards.

Anyway, about the 0x78… What you appear to be saying, in decimal, is that any number from 0 to 255 that isn’t 120 can be expressed as a number from 0 to 127. Actually, I guess 120 falls into the 0 to 127 range, too, so, all 8-bit numbers can be expressed as 7-bit numbers.

This would have tremendous implications for all of mathematics.

Here’s an exercise which should be fairly easy (well, easier than dealing with 128-bit-long samples): run all 255 different 8-bit values through your algorithm, and see what happens. If the results are ever identical (say, 163 compresses to the same output bits as 217) - and as sailor has already pointed out, something like this must happen - then you do not have a lossless algorithm. A decompressor would be unable to correctly pick the correct output value 50% of the time.

OK, after my post last night I reread that mystical sentence and said: no way, I’ve got to be doing something wrong.

And, sort of, I was. As it turns out, there are a little less than 40% of the numbers whose ERL representation would be just as long as their regular representation. No worries, though.

A C programmer at work here is going to build me a skeleton program for file read and file write so all I have to do is write the algorithm and insert it where appropriate. I expect, hopefully, to have a working copy up by the end of this week (Sunday). If I had a damn laptop it would be sooner, but alas I don’t, and when I go out of town I can’t do computer testing. I can work to refine the algorithm, though.

So, given normally distributed random data we could expect compression of just under or over 5.75% when we add the two files’ bit length together. The catch is that these two files should have redundancy where even the original file did not, in which case my ERL files can be zipped (or otherwise normally compressed) and viola! It also appears that the worst case scenario will be no gain in file size, or possibly one byte gain (this is in ERLing, not zipping afterwords which should still reduce any ERL file).

Chronos, the figures I’ve got are in reference to the numbers accessible by one byte; that is, 0-255. Though I did use a text file for testing it is in no way required. Also, the bit length in my ERL files are constant. I know, I know, it sounds like I’m getting something for nothing but I assure you I am not.

Dave, I don’t think I can ERL and ERL file without expanding it. This is a one-time operation, interestingly enough (or it seemed to be—for larger files I might be able to do it but my intuition tells me otherwise). It can be run once with some compression of 0%-12.5% and then that can be zipped. Or should be able to be zipped, at any rate (I don’t have a clear grasp of how the ZIP algorithm works mathematically so I can only test it to see). I have a ton of MP3 files which would serve as a perfect introductory test for possibly lossy compression; if they don’t seem lossy then I should be able to whip up a program to test the files byte-by-byte for perfect retention.

Anyone who feels I am completely out of my skull should keep an eye out for updates, possibly this weekend. While it wouldn’t suprise me to find out I am wrong, it is definitely a fun diversion.

erislover wrote:

In the broadest sense, ZIP goes looking for redundancy. It replaces large, common bit-patterns with very short ones.

What I’d like to see you do when you’ve got the program working, though, is take a bunch of test files, run them through your algorithm and ZIP, and compare the size of the output to just ZIPping the original file.

Of course, you’ve mentioned now, twice (or perhaps three times), your algorithm creating redundancy. This does not bode well. As you pointed out, every file longer than 256 bytes will have some amount of redundancy in it (technically, any bit stream longer than two bits will, really). So what’s really the issue is whether or not that redundancy can be eliminated efficiently, or in fewer bits than the original. Adding redundancy is the opposite of what you want to be doing.

Um, MP3 should not tolerate bad bits very well. It’s audio that’s been compressed 10-to-1 already. Start with UNcompressed audio (.WAV) files, or uncompressed images (.BMP or .DIB under Windows).

Audio, by the way, is generally recognized to be difficult to compress without generating lots of loss. Since your algorithm can’t compete at all with MP3’s 90% reduction in original file size, I’d suggest starting with something else so as to not get too discouraged.

And don’t worry about whipping up any programs for checking. On Windows, Linux, and Solaris computers (at least), there are programs for doing just that, and they come as “standard equipment.” Windows’ program is FC (for File Compare), for example. It’ll tell you exactly which bytes are different between two files.

I, just so you know, have done a lot of tinkering with “brand-new” compression schemes for all sorts of data. Not a single one of them could compare well to ZIP except under the most extreme of conditions (an image compressor I wrote way back beat ZIP very well when the image was in, say, only two colors, and consisted of big squares or rectangles - “real-world” images made it crap out, though, and ZIP was much better).

Oh, so it was you there too? I remembered the thread but I did not remember it was you. You don’t give up do you? :wink:

Not to discourage you from whatever experiments you want to carry out but I just have to say that you are most certainly wasting your time. You are not breaking any new ground, but rather you are muddling through an area that has been thoroughly investigated and which is covered in many books and college courses. To think you can get around the basic laws set in those works by ignoring them is plain foolish. If you are truly interested in the topic, the first thing to do is to study what others have done in that field before you.

Off the top of the head ideas are like dandruff: small and flaky!

The world and the Net are full of people like you who claim all sorts of things like free energy, perpetual motion, etc. which contradict the basic principles of math, physics, chemistry and other sciences. Those people have not bothered to study those sciences and then prove them wrong. They just claim things which are impossible and when asked to explain them their explanations are non-sensical.

Look, if you tell me you can express pi exactly with only 23 digits, or that you can build an over unity generator, I do not need to analize your invention to tell you that you are wrong. If I did I would be wasting my time. Maybe I would discover the flaw in little time or maybe it would take me longer. Maybe it would take me so long I could not find it. So what? I know there is a flaw.

There are riddles about how you can fit 13 people in twelve rooms, one to a room. Many people have difficulty finding the flaw in the reasoning or may be unable to find it. Still, they know damn well you cannot fit 13 people in twelve rooms. What you are proposing is the equivalent of fitting 8 people into 7 rooms, one to a room, and you still have an empty room left over. Sorry, no matter how you explain it, it can’t be done.

Don Lancaster has spent an inordinate amount of time debunking sloppy science.

When a magician makes a young woman disappear into thin air, I do not know the trick but I do know there is a trick because I know you cannot make a woman disappear into thin air unless your name is Gary Condit.

**
They told him it was impossible;
They told him the job couldn’t be done,
He rolled up his sleeves and set himself to do it;
He tackled the job that couldn’t be done;
And, by golly, . . . . he couldn’t do it!**

Gary condit jokes? :stuck_out_tongue:

I don’t know, guys, what I am seeing on paper is that it just might work. I have found that my ERLing will not reduce all files (as was to be expected) but should reduce normally distributed random files tolerably well.

If anything would make this method great, though, it is zipping the file afterwords. :shrug: I prolly won’t get my skeleton program until later in the week, so wish me luck on bit operations.

If ZIPping the processed file produces significant compression, that would mean your procedure was NOT doing the job as well as ZIP, wouldn’t it? Why not just ZIP it instead?

I await your further tests with much anticipation.

Why are you so aggravated Sailor? He’s as much as admitted he’s just horsing around with some concepts he is trying to get a handle on. He is quite probably 100% wrong but he’s not asking for love, money or the key to your house. He is in the process of discovering why he is wrong and that will lead to greater insights for him (IMO).

Everybody learns differently. Give him some slack and let him get to his own “Eureka!” about the issue.
[sub]Experience is what you get when you didn’t get what you wanted.[/sub]

erislover – If you’re doing this just for intellectual amusement, then go for it, sounds like fun.

However, I’m going to pile on with those doubting any commercial application. I’m assuming we’re talking about lossless compression, like PKZIP, not lossy like MPEG.

The bottom line, mathematically speaking, is that no compression program can compress all files. (Even if you don’t count the overhead of sending the compression program to the receiver once). Every compression alogrithm will compress some set of files well, have another set of files it compresses not so well, and even a set of files it makes longer. Chronos made the point well – his imaginary compression program is just an extreme example of what every program does, which is look for certain patterns and give those a code that’s smaller than the pattern. If the patterns it looks for don’t come up, then no compression at all happens.

So really, the key in writing a good (i.e. useful) compression program is figuring out what kind of files it’s most likely to be working on, and optimizing the algorithm to work really well with most of those files (which means it will work poorly with other kinds of files).

Now, I have no experience or expertise in commercial file compression (just some amateur math and info theory and a fair amount of programming experience), but I suspect that PKZIP and other available programs already strike a pretty good balance for a wide range of computer files. I doubt that there are enough improvements to be made to PKZIP’s algorithm that would make it worthwhile abandoning the established format.

But again, writing the program does sound like a fun exercise, so don’t let our naysaying stop you from having fun.

Can I play? Is this the scheme?

  1. Read in a bit of the file.
  2. Append this to the buffer.
  3. Is the number made up of the bits in the buffer a prime?
  4. If so, output which prime number it is and clear the buffer.
  5. Repeat.

This would compress all files, correct? Since which prime it is is always less than the actual prime. Sure, every number isn’t a prime, but if you keep reading in bits, you’re probably get to one eventually. I guess the “tricky bit” is that the lookup dictionary is a mathematical property that cannot be represented by a function.

Or am I a loony too? :slight_smile:

To make myself clear, when I say “which prime it is”, I mean something like 11 is the 5th prime.

Nevermind. After reading the comp.compression FAQ, I am apparently a loony:

FYI, GIF compression is non-lossy only if the original has 1…256 distinct colors, since each color is mapped into a color table. A typical corporate logo works well because it is often just small number of solid colors, but most photographs don’t – blue sky is many shades of blue and the colors are “used up” pretty quick.

If more than 256 colors are in the original, something has to give and some colors are dropped or shifted, making a lossless scheme into a lossy scheme.

That’s why a B&W image using 8-bit grayscale could be made into a lossless GIF image (but sometimes JPG makes a smaller file that looks just as good; it depends on the material).

There’s no free lunch!

Not only is ZIP a pretty good algorithm, it is more than just a single one. It does some rudimentary testing of the source data, then decides which of several schemes in its bag of tricks to use.

It’s hard to top ZIP for most files. But don’t let that stop you from trying! :slight_smile:

sailor wrote:

As I just heard recently, I forget from whom, “you cannot think ‘outside the box’ until you know what’s inside the box.”

SmackFu wrote:

Besides the argument you found in the FAQ (say you could use “magic bits” which take no space to separate primes), there are two fundamental problems:

Leading zeroes: 011, 0011, 00011, and 000…00011 all represent the prime 3, but they cannot all be encoded with the same bits. The decompressor won’t know how many zeroes to insert.

Secondly, when looking at a bit stream, any two bits in which the first bit is a 1 will be encoded as a prime. 10 = 2, and 11 = 3. You’d never, ever encode 101, since you’d first hit the 10 and encode it as a 2. You’d never encode a prime higher than 3.

So, encode 10 as 0, 11 as 1, and alternate, between magic bits, encoding 1x bit pairs and lengths of runs of zero bits. 0 * 11 * 1 (where * is a magic bit) would represent the bit stream 1000011.

Now, the job is to figure out how to make the magic bits real without goofing up the encoding or adding too much overhead…

Astro, I am not aggravated and I am not even trying to disuade him of horsing around. What I am trying to tell him is that if he is seriously interested in learning about this subject he is going about it the wrong way. But if he is just interested in messing around and he knows that is what he is doing, then I am ready as anyone to share and enjoy the experiment (although blowing up stuff in the yard is more fum). But let’s be realistic about it, OK? I am all for messing about and having fun but let’s not pretend it’s serious learning (or “fighting ignorance” as some people like to quote).

Musicat, yes, I am aware of the 256 color limit for GIF. I use Irfan as my graphics viewer (I love it) and it has the option of diminishing the number of colors to whatever level you need so you can see what it will look like in GIF.

Another interesting fact: GIF files carry the file information at the beginning whereas ZIP files carry it at the end so a file can be a GIF and a ZIP file at the same time. Check it out Some other interesting sites I found:
http://www.sct.gu.edu.au/~anthony/info/misc/tgz_vs_zip
http://www.simtel.net/pub/msdos/gif/
http://www.geocities.com/SoHo/Cafe/2260/crypto.html

sailor, you <i>are</i> quite obviously aggravated, you <i>are</i> discouraging him by hinting that he is a crackpot (not just someone who wants to work out his own understanding o the issue), and you have expressed your side of the issue many times. We are all about as impressed as we are going to be with your contributions.

He wants to experiment with this.
You’ve told him it’s impossible.
Wish him luck and get over it.

This post will be deleted in:

3…
2…
1…