Why do some file types compress so much more than others?

In the case of spreadsheets, a huge amount of the size is from format and function information. How bad it is depends on the exact formats, functions and type of file.

I’ve just run a little test:
In my Desktop I had several data-export flatfiles (values separated by tabs). I imported one of them to Excel and saved it several times.
Original size, txt: 4.75 KB
.xls: 53.0 KB
.xlsx: 14.9 KB

Now add the autofilter (don’t actually filter anything):
.xls: 56.5 KB
.xlsx: 15.0 KB

The differences in size are linked to format information which the txt doesn’t bother include and the excels don’t really need, and to differences in how data is stored in xls vs xlsx. I remember one isntance when a coworker sent a large Excel file with multiple colors, conditional formatting, filters… which took half an hour to download, when he only wanted us to see three lines. The rest of the team allowed me to rip him a line of holes up one side and down the other for sending that monster and not just the relevant data.

You are right and I admit my post was not at all clear. I had in mind tab- and comma- separated data files like your original one. I was suggesting that rather than zip and unzip files like that manually, you may as well just use a disk format that compresses (if possible) every block of data at the time it is written. That will not be relevant to most users, of course, only those who have ‘big data’ stored in some of those formats.

That only yields benefits if each of those algorithms is not individually capable of achieving maximum compression. There will always be a point at which attempting to compress a file further actually results in a bigger output (that is, the content can be compressed no further, and the structure necessary to describe the compression adds something to the size.

Were it not so, it would be possible to repeatedly compress the output of a compression algorithm until the output was one byte - but that would mean that essentially there are only 255 different uncompressed files in the world, which is of course nonsense.

This is a wonderful example of a reductio ad absurdum argument.

What jharvey963 means is, you can construct a file (like the sparse matrices they were manipulating) so that by first applying a particular transformation A, then a general-purpose compression algorithm B, one ends up with a file B(A(f)) that is much smaller than B(f). This is not a general-purpose technique.

Fun with compression:

I made a highly compressible file—1,000 pages of the word “blah.” (Which ended up being 828,000 “blahs” in MS Word and saved it from there.

Saved as a DOC: 13,056,000 bytes
Saved as a RTF: 47,083,520 bytes
Saved as a TXT: 4,140,002 bytes

Total for the 3 files: 64,279,522 bytes

I then saved the three files bundled into a ZIP and into a RAR

Inside the original RAR

DOC is 427,842 bytes, for a ratio of 30.5:1
RTF is 28,288 bytes, for a ratio of 1,664:1
TXT is 2,094 bytes, for a ratio of 1,977:1

Inside the original ZIP

DOC is 2,509,997 bytes, for a ratio of 5.2:1
RTF is 197,034 bytes, for a ratio of 239:1
TXT is 6,108 bytes, for a ratio of 678:1

RAR size: 458,413 bytes, for a total ratio of 140:1
ZIP size: 2,713,599 bytes for a total ratio of 23.7:1

I then RARed the ZIP and Zipped the RAR

RARed ZIP: 316,686 bytes: 8.57 times smaller than original ZIP
Zipped RAR: 238,972 bytes, 1.92 times smaller than original RAR

RARed ZIP total ratio of 203:1
Zipped RAR total ratio of 269:1

That’s nothing! Please download this ZIP file. Try unzipping it. :slight_smile: (It’s not the file which expands into 45 PB, in case you are worried.)

But that’s not what lossy compression means. With JPEG every time you open the image with an editor and resave it you lose more details. With GIF you can open it, resave it without changing anything and still have the same image. There’s no “loss”, because the image never had more than 256 colors to begin with.

That is true only if you start with 256 colors or less to begin with. Saving a photographic image as a GIF is a lossy process.

(I come to this perspective as someone who had to choose between saving images as RLE Bitmaps and GIFs on a sub-200 MB HD before JPEG existed.)

To add to this, GIF was not designed to store photographic images; JPEG was. Conversely, if you start with something like a black-and-white bitmap image, say of a page of a book, converting it to JPEG will result in annoying artifacts (ringing artifacts due to the way JPEG deals with high-frequency components) and should be avoided.

It may not have been designed specifically to store photos, but storing photos was indeed one of its intended uses. Its creators at CompuServe designed GIF to be a general-purpose, device-independent method for encoding both “pictures and drawings” (as the specification puts it). CompuServe distributed photos as GIFs, and encouraged its users to share GIF-encoded photos of their own. The format was limited to 256 colours not because the designers didn’t have photos in mind, but because none of CompuServe’s customers had any hardware capable of displaying more than 8 bits of colour. The first standardized consumer graphics card to support 8-bit colour, IBM’s VGA, was released in 1987, the same year that the GIF format was designed, and it was years before VGA (and its clones) saw widespread adoption. 16-bit colour displays did not become the norm until the mid-1990s. Until then, for the vast majority of home users, photos encoded as GIFs looked no worse than photos encoded as JPEGs. (Though for much of that time, JPEGs were rare or nonexistent—the format was designed about five years after GIF, and didn’t have the backing of a major online service provider. Some JPEG images and viewing programs circulated in the BBS scene, but the format didn’t really take off until graphical Web browsers became popular.)

Absolutely right. This is not a general purpose technique. This method worked very well in our case, but only because we had intimate knowledge of the format of the data we had to compress. Our (ok, no false humility here, “my”) method gave lossless compression of between 20:1 and 40:1. Our customers were very happy to have their data load times reduced by a factor of between 20 and 40. Waiting 60 seconds to load is very doable, whereas 40 minutes to load was unacceptable.

J.