New Original Pure Math Question

This is a new problem, but I think it’ll take a lot of the same thought as the last one did. And, since we had so much fun with that one, it’s about time for some more!

The Lempel-Zev-Welch algorithm is used to compress a data string. Since it turns out to be so efficient for raster data, it’s the algorithm employed by the popular GIF image format. Without going into the details, it converts a sequence of numbers into another sequence of numbers. If the original numbers are all of the range 0-3, for instance, then the first number of the output sequence will be of the range 0-5, the second will be of the range 0-6, the next of the range 0-7, then 0-8, and so on, for as long as is needed.

GIF specifications take the compression one step further with something called variable-length codes. Basically, it only allocates for each number the minimum number of bits required. For the first one, which is in the range 0-5 (000 to 101) three bits are needed, so we might refer to it as AAA. Similarly this is true for the second and third numbers as well, which we’ll call BBB and CCC. For the fourth number, however, since its range is 0-8 (0000 to 1000) four bits are allocated, so it’s referred to as DDDD. This is true up to the eleventh number, KKKK, which has a range 0-15. The next number, LLLLL, has five binary digits, since its range is 0-16. I hope that’s clear, but I’m not sure about it. Ask me for a better explanation. Anyway, these are sequenced one after the other, with the most significant bit first, into the following sequence of 8-bit bytes:


CCBBBAAA
EEEDDDDC
GGGFFFFE
IIIHHHHG
KKKJJJJI
MMLLLLLK
NNNNNMMM

and so on. Now the question. Is there a better way? Something (Tim) said in the other Pure Math thread about wasted bits got me thinking, there’s no way that AAA = 110, for instance. So, while no individual bit is wasted, the fact that there are bit sequences that will never appear make me think that there’s a more efficient way to turn the variable-length numbers into the final sequence of bits. Any thoughts?

Is there a better way to compress than LZ? Well, maybe, but there’s plenty of corps and graduate students spending lots of time trying to make a better compression mousetrap. I doubt just posing the problem will cause any instant insights.

Or maybe I’m just being cynical.

There are wasted bits … for example, you don’t need three bits for 0-5, you only need log2(6) = 2.585 bits. If you use three, you’ve wasted 0.415 bits, or 16% more than you needed.

Of course, since computers can’t store fractional bits directly, you have to round things somewhere. You don’t have to round to the nearest bit with every code, though- theoretically you could just round once at the end of the file, so you’d never waste more than 1 bit (of course files are kept to the nearest byte, so you’d actually waste a max. of 7.99999… bits).

As an example, say you need to store 100 different AAA codes (each having a value from 0-5). If you use 3 bits each, you’ll use 100x3 = 300 bits. If you use log2(6) bits each, you’ll only use 256 bits. Here’s one way to do this:

  1. Since there are 6 choices for each AAA code, there are 6^100 different possibilities. Number each of these in an ordered list.

  2. Find your particular combination of 100 AAA codes in the list. Get the index of that entry.

  3. Convert that index to binary. Since there are only 6^100 entries, you only need ceil(log2(6^100)) = 256 bits to encode this number.

  4. Send the binary number - to decode, use it as the index in the list.

Of course, you don’t really need a list of all 6^100 combinations, but conceptually that’s what you would do.

Arjuna34

Oops, that should be 259 bits, not 256.

Arjuna34

If the question is “Is there any way to do this with fewer bits?” then of course Arjuna34’s answer is correct. The only way I can think to better it is if some numbers are more likely than others. Then you could arrange for those numbers to be at the beginning of the index, therefore requiring fewer bits to code. Of course, the disparity in likelihoods would have to be large enough to justify devoting memory to a list of those likelihoods.

If the question is “Is there a better way?” then Arjuna34’s answer may or may not be correct. After all, it requires much more calculation. With computers now approaching a gigabyte of active memory, saving a few kilobytes probabably isn’t going to be worth what looks like several hundred clock cycles.

Well, the question is “Is there a way to do it with fewer bytes?” The problem with Arjuna34’s answer is that it uses a single-length code - every successive code is a number 0-5. LZW codes aren’t like that, however - the codes should be of a variable length, increasing by one with each successive code. Read my OP again, and see if it’s still not clear. This problem, however, is circumvented with a slight alteration to Arjuna34’s answer. Instead of the lookup table having 6[sup]100[/sup] entries, though, it’ll have 6×7×8×9×…×105.

The other problem is that a lookup table with 6[sup]100[/sup] entries is absolutely ridiculous. I’m not saying that I’d like the decompression algorithm to be completely simple, but I’m looking for something that’s doable before the Big Crunch.

And Bill H., I’m not looking actually to improve on LZW itself, and even if I were, I’m sure there are plenty other avenues to explore. I just was wondering about this one.

Arjuna34 doesn’t say it explicitly, but he’s describing, in essence, arithmetic coding, which does allow you to use fractional numbers of bits. Here’s how it works:

The basic algorithm for arithmetic coding is that, at any one point in time, you have the range of values of the probability of recieving all previous symbols seen, and that you have a guess at how probable you’ll get a given symbol next. You check the value of the next symbol, and, from that, you output the range of probabilities for that symbol. This gets multiplied in with any previous range values from prior symbols, giving a new range of values. When you reach the end of the symbols, you can output any number in the range of values you currently have.

I’ll make it a little clearer with an example of arithmetic coding without any other kind of compression. Assume you can only receive the values 0, 1, 2, and 3 at each step and that you have static probabilities of 20%, 5%, 50% and 25%, respectively. You start out with the range [0,1), and the the input stream is: 2, 0, 3, 3, 0, 1, 2, 2, 0, 3, 2, 2, 2, 2, 3, 2.

The four symbols are assigned probability ranges. 0 is the range 0% to 20%, 1 is the range 20% to 25%, 2 is the range 25% to 75%, and 3 is the range 75% to 100%. So, for the first symbol, our range goes to [.25,.75). For the next symbol, 0, since the range of its probabilities is 0 to 20%, then the lower value stays the same, while the upper value drops to 20% of the distance from the lower value, so our range goes to [.25,.35). Next, for 3, our lower value jumps to 75% of the current range, and the upper value stays the same, so our range is [.325,.35). For the next values, the range goes to:

3: [.326875,.35)
0: [.326875,.3315)
1: [.3278,.32803125)

etc.

Now, if the input stream had stopped after the symbol 1, we would have outputted any convenient number in that range, say, .328. From the static probabilities, the decoder can see that .328 is between .25 and .75, so it knows the first symbol is 2. It then subtracts off the lower bound, .25, and divides by the probability, .5 to get a new fraction, .156, which is in the range between 0 and .2; i.e., the next symbol is 0. And so on, and so forth, until it decoded the last symbol.

For the adaptive arithmetic coding, instead of every symbol having a static probability, the probabilities are recalculated after every symbol processed. In this instance, the probabilities of 0, 1, 2, and 3 would have been initialized to 25%. After the first 2, the probabilities would then be 40% for 2, and 20% for each of the others, and so forth. Note that this also works when the number of possible symbols changes at every step, as in LZW compression. Also, in any implementation, you’ll need a stop-code symbol, to tell the decoder when you’ve reached the last symbol.

I’m glossing over a whole bunch of stuff, but if you want to have the details filled in, there are references for arithmetic coding around on the web that explain it better than I could. Also, note that there are a whole bunch of patents on arithmetic coding, mostly by IBM, so you couldn’t (legally) use it in a program without licensing all the rights from the respective patent holders.

I was really addressing the sub-question of whether there’s a better way to encode fractional-bit patterns, not whether there’s a better way to compress than LZW (which Punoqllads did address well). You don’t really need a 6^100 (or 105!, or whatever) lookup table, though, you just need to be able to compute an arbitrary entry as needed. This is easy to do with some big-number arithmetic routines (you’d need 552-bit arithmetic for 6x7x8x…x105 = 9E165 “table entries”. In real life you’d probably break up the 552-bit chunk into sub-sections (by re-arranging the factors 6,7,8,…,105) until you had groups that were as close as possible to powers of two. For example, 6x7x8x…x105 factors into:
2^98 x 3^49 x 5^24 x 7^17 x 11^9 x 13^8 … x 103^1.
You can pull out the 98 2’s, which lets you store 98 bits exactly, leaving 454 bits for the “table”. You can play with combining the other factors to get as close as possible to powers of two, to further break it up.

Anyway, this is only tangentially related to the OP’s question about compression. I’ve done some of this fiddling around with fractional bits in situations where I only HAD a fixed number of bits (for example, 32 bits, or 512 bits), and had to make a bunch of data fit. For me, it was worth a lot of fiddling to squeeze 34.32 bits into 31.98 bits, for example, because I only had 32 bits available! My programming’s often on an 8-bit processor, with limited memory, so I couldn’t use a 2^32 lookup table either :wink:

Arjuna34

Alright, so this isn’t as pure of math as I’d anticipated. I still think it’s interesting. But before I go on, I guess a disclaimer would be appropriate: The information posted here could conceivably be used to do something illegal, so don’t.

Now then, I have not read any articles on “arithmetic coding”, but I did have something to say. In my (admittedly small) experience with LZW-compressed raster data, taking the time to assign relative probabilities to the various entries wouldn’t be very productive. Most entries tend to show up once or twice, and no more. However, even with static probabilites, all of which are equal, it sounds like arithmetic coding is a good idea.

I do, however, have an alternative, which strikes me as being just as efficient, but much simpler. Basically, you only read in enough bits to establish whatever code it is you want. Say for instance you’re reading in code AAA. It can be anywhere from 000 to 101. Read in the two least significant bits. Say they’re 01. Well, the whole code could either be 001 or 101, so you have to read in the last bit to establish what it is. But say they’re 10. Then you know the code has to be 110, and you could in fact go right to the next code. So, you’ve coded a three-bit code with only two bits! This works equally well with variable-length codes as with fixed-length codes. If the codes are truly random, then in the long run, the length of the data stream should have the same expected value as with arithmetic coding, right?

You might also want to look into Huff codes, which assign a code to each of several symbols of varying lengths, depending on their frequency. For a Huff code representing letters of the alphabet, for instance, the symbol for ‘e’ would be shorter than that for ‘q’, for instance, as contrasted to ASCII, where each character takes up the same amount of space. If I’m recalling my algorithms class correctly, they’re the optimum way to do so. It’s not quite relevant to what you’re talking about, but it’s close enough that I thought it might be of interest.