With compressions techniques like ZIP I know that the program simply looks for repitition of words/phrases and then writes them to a table representing something much smaller such as “1” or “a2”.
Wouldn’t it be more efficient to have a dictionary built-in to the ZIP program? Rather then write it to the file, the program would recognise that the encoded symbol is catalogued in the program’s built in table, saving space in the file.
Of course, you wouldn’t stop at words, you would add computer jargon as well. I know that when a program becomes ‘compiled’ it turns in to “mumbo jumbo”, but there would be a lot of repitition in that mumbo jumbo as well. Why not also add that to a built in table?
Doesn’t sound like you’re going to do any better than huffman encoding (which is the only type I know, and is at least similar to what Zip uses). Basically, what that does is similar to what you describe, but with bytes rather than words being what is shortened. So if the byte 10001110 is the most common byte, it might end up as 01 instead. The downside of this is that the least common bytes get expanded. This is good for text files, where certain letters are far more common than others.
To then shorten whole words to abbreviations for them that could be looked up in a dictionary seems like it would be very processor intensive, and could lead to problems with non-text files, as well as randomize the text that goes through the huffman encoding, which will make it less effective. Since huffman encoding can get to about 50% compression with text files, you really don’t want to ruin it’s effectiveness by making various alphanumerics just as common as the letter e.
The advantage of having the dictionary stored in the file and not with the zip program is that it is adaptive. The zip program will compute the ‘best’ dictionary for the file it is compressing. The dictionary for a plain text file in English will be different than one for a German text file, and both dictionaries would be very different from one for an audio file.
Algorithms used by Zip are not optimal for the special case of pure text files in a known language. Huffman coding for example usually works on symbols of a single character. I have seen worked examples (sorry, can’t provide a reference) where the symbols are whole words or even phrases which produce significantly better theoretical compression ratios.
On the down side, sender and receiver would need identical dictionaries, different languages would need seperate dicionaries and in any case most files are not pure text.
That isn’t quite true about ZIP encoding, it is near optimal for ASCII text. You have to realize, the computer doesn’t know the data is text, it just sees zeroes and ones, there are many more ways to compress data than using text approaches.