Talking about WinZip and the like. With files like Excel, you can typically reduce the size by about 80%. With picture and video types, barely at all.
What’s different about these file types such which produces this disparity?
Talking about WinZip and the like. With files like Excel, you can typically reduce the size by about 80%. With picture and video types, barely at all.
What’s different about these file types such which produces this disparity?
The pictures and videos are already compressed.
Data compression works by finding redundant information and replacing that information with a marker of some sort. If your spreadsheet contains 6000 cells which all have the word “PENDING,” for example, that word can be replaced by the number 42 and a chunk of data prepended to the file that says "when you see a 42, it means “PENDING.” Hooray, you have done compression.
That’s lossless compression, because when you reverse the process you get 100% of the original data back.
Pictures, video, and audio often use lossy compression. That works by eliminating data which is redundant in a way that minimizes the human-noticeable differences between the original and the compressed version. Most raster image formats (JPEG, GIF, PNG) use lossy compression. Others are designed for raw image data with no compression. Once you’ve eliminated a lot of redundant data with lossy compression, running it through WinZip is not going to make it any smaller.
Thanks.
But that raises the question as to why the spreadsheet type programs aren’t similarly pre-compressed in the same manner?
Ninjaed, but different files have different information content. A file full of random numbers cannot be compressed at all, a file consisting of the letter ‘a’ repeated many times can obviously be compressed a lot, and most files fall somewhere in between these two extremes.
No need for that. It is best to keep those kinds of files (mostly) human-readable, and, to save disk space, let the operating system compress and decompress data on the fly when it is written/read.
I suspect that in this case, the reason is that most spreadsheets are so small that it’s not worth the effort to compress them. Adding compression support would take programmer time was as additional time in each release to validate the compression. Also, adding the additional code to perform the compression adds new code in which bugs can lurk. Changing the file format can be especially dangerous as bugs in reading files can be a great way to spread a virus.
Yeah. If they weren’t compressed, video and audio files would be yuge. Files that are mostly made up of text and formatting codes (word processing docs, spreadsheets, etc.) are usually relatively small, and if they aren’t you can choose to compress them yourself if you need to. If you don’t need to, you might as well keep the process of saving and opening such files as simple and quick as possible.
Some spreadsheet file formats are always compressed: The Open Document formats are ZIP files with other files inside them in a standardized directory structure: The other files are XML document (so, text with complicated rules, basically) and the ZIP file format compresses its contents.
ZIP doesn’t give the highest compression ratio (that is, the ratio between original file size to compressed file size, such that larger is better) but it’s fast and very well-known and well-supported, so it’s easy to deal with ZIP files.
Anyway, Open Document-format files, being compressed by nature, likely won’t compress well a second time.
The current -x MS Office files, like docx, xlsx, are already compressed. However, there is a balance between speed and efficiency. The more thoroughly one looks for repeated data the more time it takes to compress and also decompress the file.
As others have said, pretty much every image or video today is in some form of compressed format. RAW video files are huge as are RAW images. There was a a bit of a shift in this arena. It used to take extra software or equipment to perform the compression and on-the-fly or at run-time was expensive. Then, with the advent of smart-phones becoming the goto camera, they ended putting those aspects onto silicon and stacked onto the back of the sensor.
It’s pretty nuts when you look at what a “sensor module” can do now: various image and video formats, lens compensation, compression, etc. Crazy.
Not GIFs
GIFs are lossy compressed–by reducing the number of colors or grey scales to 256.
Wait…aren’t PNGs lossless? And they support 24-bit RGB files, so they’re not limited to a 256 color palette.
Also, about video compression, one example–one movie I have on my HD right now is 1h 46m and 1250x720p. Using h.256 compression, the file size of the video stream (meaning not counting the audio) is 453 MB (and it looks great–I love me some h.265.) If that movie had been completely uncompressed, the video alone would have been 392 gigabytes. So an 886:1 compression ratio.
You’re right - PNG does use lossless compression.
Compression / decompression is a very interesting topic, and one I spent a big chunk of my professional programming life on. Here’s just a few quick facts I learned or discovered.
Compressing data (losslessly) only works if the data is not completely random.
The original format of the data has a huge effect on the amount of compression possible. We needed to transfer (at the time) large amounts of 2 dimensional matrix data (consisting of ones and zeros) over ethernet. If you stored the data in the file by rows, very little compression was possible because the ones and zeros were roughly random. However, if you stored the data by columns there was massive redundancy allowing very large compression ratios.
Surprisingly, 2 stage compression actually works. By “2 stage” I mean running the data file through 2 different compression algorithms. In our particular case, using a custom-designed variant of run-length encoding followed by LZW compression gave very good compression ratios.
J.
Entirely true, but this could use some more context. Plain text, a textfile as opposed to a Word document or ODF document or HTML document or similar, has about five bits of entropy (very roughly, entropy is how much “new information” is in a given signal) per eight-bit byte, without taking the large-scale structure of the document into account. The fact is, human-readable text is easy to compress even using fairly primitive methods.
In fact, I’d venture to say that the only time people send around data which is close to being completely random is when they’re sending or receiving encrypted data, such as by visiting an HTTPS website.
Which means you effectively can’t compress completely random data lossily, either, because that random block of data is either encrypted, such that compressing it lossily would destroy anyone’s chances at decrypting it, or it’s pure entropy, like you’d use to seed a PRNG, such that lossily compressing it would reduce the amount of entropy in that data, making it less useful for the task.
That’s not compression; that’s a limitation of the file format. PNGs and JPEGs are also limited in the number of distinct colours they can encode (normally 16,777,216). There are plenty of applications out there (medical imaging, for example) that require a much higher colour depth.
Maybe for the particular combination of data and compression algorithms you used, but this doesn’t hold in the general case. In fact, this is the essence of the OP’s question—he’s observed that precompressed data, such as modern image and video formats, does not get smaller when run through another compressor.
According to Shannon’s experiments, the entropy of average English text is somewhere around 2.5 bits/letter, not 5 or 7 or 8.
This may have been said, but the huge compression ratios reported by Darren Garrison for H.265 videos are of course due to “lossy compression”, which means it looks the same to you but the original data cannot be reconstructed, just like the way MP3 files are smaller than FLAC. For some purposes (e.g., scientific, or video mastering) it may be necessary to work with the original data even though it’s huge.
JPEG2000 offers a lossless-compression mode for your medical images, by the way. It’s not the ultimate compression algorithm; e.g., BPG is better, but at some point you (the hospital) need to make a decision and pick a format for archival purposes.
As psychonaut partly pointed out in post #17, GIF compression MAY be lossy, but it may not, depending on the source. Making a GIF out of a landscape pic will undoubtedly cause some colors to be shifted or omitted from the final table (lost), but a GIF of a corporate logo drawing, using only a few colors, might have no loss at all.