How does winzip work?

How do those zip files work? how do they make the program smaller? surely all the information in it is needed?

There’s a whole field of mathematics that studies information compression. It’s easy to convey the meaning of data without actually wrting that data explicately. If I want to say

00000000000000000000

I could just as well write:

20 zeroes.

This is why compression works best with data types that have a lot of repetitive data, like images.

Winzip uses a specific compression algorithm, but I forget which (sorenson?) and it does the sort of thing illustrated above to the entire file.

Winzip uses Lempel-Ziv, with some extra optimizations such as not compressing files that don’t get smaller.

The basic idea of most lossless compression techniques is that the original file doesn’t contain the data represented in an optimal fashion - for instance, if you have a text file containing only 6 different characters, they are still represented as eight bit bytes. Even if you didn’t want to get clever, you could clearly represent the 6 characters as 3 bit quantities, and pack the file into a series of bytes to be read as a stream of 3 bit quantities with a translation table in front of it. If the file is big enough, scrunching the 8 bits down to 3 will make up for having to include a table for unpacking it.

The techniques actually used are more sophisticated than that. This link gives a description of how Huffman coding and Lempel-Ziv encoding work:

http://www.data-compression.com/lossless.html

One thing to remember is that no lossless compression algorithm will be able to compress every input.

For example, a single byte can represent one of 256 values. Suppose you have a two byte sequence (256^2 = 65536 possible combinations). By compressing it to a single byte, you can only have 256 possible compressed sequences. No matter what your algorithm is, you will only be able to reproduce 256 of the original input sequences. Since this isn’t acceptable in a lossless algorithm, you need to add extra bytes telling the decompressor what to do.

One simple algorithm is run-length compression. Suppose you have the sequence “AAAAABCCCC”. You can see that there’s an obvious pattern here. You could represent it with the smaller sequence “5A1B4C”, meaning “five As, one B, four Cs”. But this algorithm will not be able to compress “ABCDE” - “1A1B1C1D1E” is twice as big as the original.

Any lossless algorithm will have the same problem; some outputs will be larger than the corresponding inputs.

The ZIP algorithm is very good at compressing some inputs (like text) and very bad at compressing others (like random data). Luckily, random data isn’t very common, so you’ll rarely run into files that can’t be zipped efficiently.

But try this: zip a file, then zip the resulting compressed file, then zip that resulting compressed file, and so on. Eventually the output file will be bigger than the input file; this is because the ideal compressed data is totally random, and every time you compress it it gets closer to total randomness. A perfect compression algorithm will produce an output that has no discernable patterns, and therefore cannot be compressed any more.

IIRC WinZip will use a number of techniques in compression depending on the nature od the data to be compressed. The basic algorithms I am aware of are:

Huffman coding which works well on textual data in which generally some characters occur more frequently than others (‘s’ & ‘s’ are very common in English, ‘q’ much less so). In this scheme common characters are given very short codes and rare ones get longer codes.

Run Length Encoding (RLE) is a very simple scheme where repeated sequences are replaced by one copy of the sequence and a repeat count. This works well for example on simple graphics where there may be blocks of solid colour

Lempel-Ziv as already mentioned assigns short codes to common patterns in the data. This is more sophisticated than RLE as the patterns can be distributed throughout the data.

As also mentioned above there is a limit to how much compression can be achieved without losing information. This is based on a value known in Information Theory as the entropy of the data source. Do not confuse this with entropy in thermodynamics - a favorite trick of creationists.

A simple example of text file compression could look like:

A fat cat sat on the rat trap. (30 characters)

The compression program would spot the many occurances of ‘at{space}’ and create a table which says "replace every occurance of ‘at{space}’ with ‘*’. The compressed file would then look like:

A fcson the rtrap. (22 characters, 27% compression before adding the table)

We could go farther and say replace ‘the{space}’ with ‘#’ and replace ‘trap.’ with ‘@’. Then the compressed file would read:

A fcson #r@ (15 characters, 50% compression before adding the table)

The unzip program would read the table and replace each symbol with the appropriate string.

The second step above would not help the overall compression because there is only one instance of each string, so the table entry for those strings would actually be larger than the character savings. Recurring patterns are required for compression.

wow. thanks guys. I thought it might work like dr jackson says, but obviously this is simplified, and most are much more complicated. Is zip the best compression format or is it just more popular (like windows)

A huffman encoding algorithm determines the frequency of each byte. The most common bytes are translated into shorter sequences of bits, and the least common into longer sequences. As mentioned, this works well for text files. If a file involves eight bits for each letter, it will be smaller if it only uses two for each ‘e’ and twenty for each ‘q’. Of course, the mapping back to the original bytes also must be saved with the file.

[warning: computer babble below]

The best way to accomplish this is to construct a binary tree, with each letter at a leaf. This tree keeps the bytes with the highest frequencies having the shortest paths from the root. Each byte is replaced by the path from the root to that byte’s location in the tree, with a left child being represented by a ‘0’ and a right child represented by a ‘1’. In the compressed file, a representation of the tree is stored, so that the computer parses it one bit at a time, following the path in the tree until a leaf is reached. Really quite ingenious, IMHO.

Truly random data cannot be compressed but note that computer programs are far from random. They contain instructions nd data which can be very redundant. Redundancy is what allows compression.

Aside: Redundancy is also what allows decryption of encrypted data. For this reason, the first thing PGP and other encryption programs do is compress the original data and, this way, take out the redundancy. Most often, a file encrypted with PGP is smaller than the original.