The Discrete Cosine Transform (DCT) is used in JPEG and MPEG image compression. Given an input image (a 2D array of color values) the transform calculates a same-sized array of co-efficients that is very easy to compress into a much smaller amount of data without losing much picture information.
My question is: Why does it work?
It really is an incredible trick - just by multiplying each original pixel against a series of cosine values, you get something that’s easy to compress. But why??
I know this is a rather obscure question, and am expecting 0 posts, but hopefully someone out there will have the answer…
The transformed data isn’t inherently more compressible, it’s just that the human eye will notice the compression less if it’s done that way. You don’t notice one gradient changing into a slightly different gradient the way you might notice pixels getting slightly blockier.
That’s why JPEG is mostly only good for compressing photos - try compressing a line drawing or text (something with a lot of sharp edges), and you can easily tell the difference. You need to combine a lot of cosines to get something that doesn’t look like a gradient, and any loss will ruin it.
Yes, JPEG works best with graphics with many colors and not sharp or strong transitions, like photos. GIF works best with graphics with a limited number of flat colors which means sharp transitions.
http://www-mtl.mit.edu/Courses/6.095/notes/dct.html shows the specific cosine patterns that are used by DCT. When compressing an image with sharp edges, you might see bits of garbage that look like these patterns… that’s because a block with sharp edges most likely has some component of all the patterns, and if one of them is dropped it leaves a pattern-shaped ghost.
JPEG considers high-frequency data less important than low-frequency data (since it appears less in natural scenes), so you’ll see artifacts like the patterns toward the lower right more often than the ones toward the upper left.
I might be wrong wrt to the DCT, but in general a transform does not yield data that is more easy to compress. The actual transform does not get rid of any data. What it does is change the data into a different domain that will make it easier to get at and throw away the data that is not needed in your case – in the case of JPEG, it is the detail you cannot see.
In the case of the DCT, the throwing away of data might be part of the transform, so the above is not strictly true.
You’re right, KeithB. The DCT is just a linear transform, completely invertible. What it does do is separate the low-frequency changes from the high-frequency changes (i.e. quickly changing patterns in the image). Since the eye doesn’t see high-frequency patterns well, these are quantitized more heavily (i.e. with fewer bits) than the low frequency ones, or even discarded. When you put the image back together with the inverse DCT, it looks pretty much the same (hopefully).
The pixels aren’t multiplied by cosines, what happens is that the image is decomposed into a series of 2D cosines. It can be shown that any signal (of any dimension, such as 1D for speech or WAV files, or 2D for images) can be broken down into a summation of cosines of various frequencies, phases, and amplitudes. The “coefficients” are what’s computed with the DCT- the exact amplitude and phase of each cosine, at predefined frequencies (i.e. acos(2pi*w+p), where a is the amplitude, w is the frequency, p is the phase, and pi =3.14…). If you apply these coefficients to the cosines, and add all the cosines up, you’ll get the original image back. Thus, the coefficients represent the image. Actually, the image is broken up into small blocks (8x8 pixels is typical), and the coefficients are found for each block.
Now, if you keep the coefficients exactly, you haven’t compressed anything. But since the eye is less sensitive to high frequency changes, the coefficients of the high-frequency cosines can be encoded with fewer bits, or even discarded. These compressed coefficients are then saved as a JPG file.
Then, to put the image back together, you apply the compressed coefficients to the cosines, and add them up.
There are an infinite number of ways to decompose an image, cosines using the DCT is just one.
One other point about JPEG: They separate the image into brightness and color components and take advantage of the deficiencies in human sight to compress things: