Why do some file types compress so much more than others?

Fotheringay-Phipps · January 26, 2018, 7:33pm

Talking about WinZip and the like. With files like Excel, you can typically reduce the size by about 80%. With picture and video types, barely at all.

What’s different about these file types such which produces this disparity?

friedo · January 26, 2018, 7:38pm

The pictures and videos are already compressed.

Data compression works by finding redundant information and replacing that information with a marker of some sort. If your spreadsheet contains 6000 cells which all have the word “PENDING,” for example, that word can be replaced by the number 42 and a chunk of data prepended to the file that says "when you see a 42, it means “PENDING.” Hooray, you have done compression.

That’s lossless compression, because when you reverse the process you get 100% of the original data back.

Pictures, video, and audio often use lossy compression. That works by eliminating data which is redundant in a way that minimizes the human-noticeable differences between the original and the compressed version. Most raster image formats (JPEG, GIF, PNG) use lossy compression. Others are designed for raw image data with no compression. Once you’ve eliminated a lot of redundant data with lossy compression, running it through WinZip is not going to make it any smaller.

Fotheringay-Phipps · January 26, 2018, 7:42pm

Thanks.

But that raises the question as to why the spreadsheet type programs aren’t similarly pre-compressed in the same manner?

DPRK · January 26, 2018, 7:43pm

Ninjaed, but different files have different information content. A file full of random numbers cannot be compressed at all, a file consisting of the letter ‘a’ repeated many times can obviously be compressed a lot, and most files fall somewhere in between these two extremes.

DPRK · January 26, 2018, 7:46pm

No need for that. It is best to keep those kinds of files (mostly) human-readable, and, to save disk space, let the operating system compress and decompress data on the fly when it is written/read.

Rysto · January 26, 2018, 7:53pm

I suspect that in this case, the reason is that most spreadsheets are so small that it’s not worth the effort to compress them. Adding compression support would take programmer time was as additional time in each release to validate the compression. Also, adding the additional code to perform the compression adds new code in which bugs can lurk. Changing the file format can be especially dangerous as bugs in reading files can be a great way to spread a virus.

Thudlow_Boink · January 26, 2018, 7:53pm

Yeah. If they weren’t compressed, video and audio files would be yuge. Files that are mostly made up of text and formatting codes (word processing docs, spreadsheets, etc.) are usually relatively small, and if they aren’t you can choose to compress them yourself if you need to. If you don’t need to, you might as well keep the process of saving and opening such files as simple and quick as possible.

Derleth · January 26, 2018, 8:07pm

Some spreadsheet file formats are always compressed: The Open Document formats are ZIP files with other files inside them in a standardized directory structure: The other files are XML document (so, text with complicated rules, basically) and the ZIP file format compresses its contents.

ZIP doesn’t give the highest compression ratio (that is, the ratio between original file size to compressed file size, such that larger is better) but it’s fast and very well-known and well-supported, so it’s easy to deal with ZIP files.

Anyway, Open Document-format files, being compressed by nature, likely won’t compress well a second time.

busterpickle · January 26, 2018, 8:17pm

The current -x MS Office files, like docx, xlsx, are already compressed. However, there is a balance between speed and efficiency. The more thoroughly one looks for repeated data the more time it takes to compress and also decompress the file.

As others have said, pretty much every image or video today is in some form of compressed format. RAW video files are huge as are RAW images. There was a a bit of a shift in this arena. It used to take extra software or equipment to perform the compression and on-the-fly or at run-time was expensive. Then, with the advent of smart-phones becoming the goto camera, they ended putting those aspects onto silicon and stacked onto the back of the sensor.

It’s pretty nuts when you look at what a “sensor module” can do now: various image and video formats, lens compensation, compression, etc. Crazy.

Doug_K · January 27, 2018, 4:03am

Not GIFs

Darren_Garrison · January 27, 2018, 4:11am

GIFs are lossy compressed–by reducing the number of colors or grey scales to 256.

pulykamell · January 27, 2018, 4:23am

Wait…aren’t PNGs lossless? And they support 24-bit RGB files, so they’re not limited to a 256 color palette.

Darren_Garrison · January 27, 2018, 4:28am

Also, about video compression, one example–one movie I have on my HD right now is 1h 46m and 1250x720p. Using h.256 compression, the file size of the video stream (meaning not counting the audio) is 453 MB (and it looks great–I love me some h.265.) If that movie had been completely uncompressed, the video alone would have been 392 gigabytes. So an 886:1 compression ratio.

friedo · January 27, 2018, 5:37am

You’re right - PNG does use lossless compression.

jharvey963 · January 27, 2018, 5:43am

Compression / decompression is a very interesting topic, and one I spent a big chunk of my professional programming life on. Here’s just a few quick facts I learned or discovered.

Compressing data (losslessly) only works if the data is not completely random.
The original format of the data has a huge effect on the amount of compression possible. We needed to transfer (at the time) large amounts of 2 dimensional matrix data (consisting of ones and zeros) over ethernet. If you stored the data in the file by rows, very little compression was possible because the ones and zeros were roughly random. However, if you stored the data by columns there was massive redundancy allowing very large compression ratios.
Surprisingly, 2 stage compression actually works. By “2 stage” I mean running the data file through 2 different compression algorithms. In our particular case, using a custom-designed variant of run-length encoding followed by LZW compression gave very good compression ratios.

J.

Derleth · January 27, 2018, 7:25am

Entirely true, but this could use some more context. Plain text, a textfile as opposed to a Word document or ODF document or HTML document or similar, has about five bits of entropy (very roughly, entropy is how much “new information” is in a given signal) per eight-bit byte, without taking the large-scale structure of the document into account. The fact is, human-readable text is easy to compress even using fairly primitive methods.

In fact, I’d venture to say that the only time people send around data which is close to being completely random is when they’re sending or receiving encrypted data, such as by visiting an HTTPS website.

Which means you effectively can’t compress completely random data lossily, either, because that random block of data is either encrypted, such that compressing it lossily would destroy anyone’s chances at decrypting it, or it’s pure entropy, like you’d use to seed a PRNG, such that lossily compressing it would reduce the amount of entropy in that data, making it less useful for the task.

psychonaut · January 27, 2018, 8:16am

That’s not compression; that’s a limitation of the file format. PNGs and JPEGs are also limited in the number of distinct colours they can encode (normally 16,777,216). There are plenty of applications out there (medical imaging, for example) that require a much higher colour depth.

psychonaut · January 27, 2018, 12:58pm

Maybe for the particular combination of data and compression algorithms you used, but this doesn’t hold in the general case. In fact, this is the essence of the OP’s question—he’s observed that precompressed data, such as modern image and video formats, does not get smaller when run through another compressor.

DPRK · January 27, 2018, 1:45pm

According to Shannon’s experiments, the entropy of average English text is somewhere around 2.5 bits/letter, not 5 or 7 or 8.

This may have been said, but the huge compression ratios reported by Darren Garrison for H.265 videos are of course due to “lossy compression”, which means it looks the same to you but the original data cannot be reconstructed, just like the way MP3 files are smaller than FLAC. For some purposes (e.g., scientific, or video mastering) it may be necessary to work with the original data even though it’s huge.

JPEG2000 offers a lossless-compression mode for your medical images, by the way. It’s not the ultimate compression algorithm; e.g., BPG is better, but at some point you (the hospital) need to make a decision and pick a format for archival purposes.

Musicat · January 27, 2018, 1:55pm

As psychonaut partly pointed out in post #17, GIF compression MAY be lossy, but it may not, depending on the source. Making a GIF out of a landscape pic will undoubtedly cause some colors to be shifted or omitted from the final table (lost), but a GIF of a corporate logo drawing, using only a few colors, might have no loss at all.

Topic		Replies	Views
File compression: the general case Factual Questions	61	2291	November 2, 2001
How does winzip work? Factual Questions	8	1525	November 17, 2000
How does file compression work? Factual Questions	38	2434	August 3, 2002
why variation in file size of similar photos? Factual Questions	4	749	July 18, 2005
why not mp3 video? Factual Questions	11	983	November 13, 2000

Why do some file types compress so much more than others?

Related topics