CPU vs. GPU?

(take two)Todays video card specs read like the pc specs of just a few years ago. Video cards alone have processors running between 350 - 500 MHz, with 128 - 256MB DDR-RAM. While pondering this, I got to wondering: what makes the chips from nVidia or ATI better at processing graphics than AMDs or Intels chips? Why not just slap an Opteron on an AGP card, give it some dedicated memory, and call it the GPU?

A GPU’s task is to move huge amounts of data in and out and to do fairly basic processing on it. You shovel data in one end and then you spit it out and forget about it. Thus, you need large memory busses, relatively low clock speeds and very little cache. A CPU is tasked for various things but is often called on to do fairly intensive processing on a relatively small dataset. Thus, you care more about caching and branch prediction and higher clock speeds.

Graphics work is inherently easy to parallelize - that is why todays GPU’s will have numerous pixel pipelines (up to 16 in the very high end cards) to perform operations on several different pixels simultaneously. CPU’s couldn’t do this job as well, though of course there are numerous computer problems which are not easy to parrallelize, thus requiring a really fast CPU.

And in many cases, GPU’s are even more complicated than CPUs are. For example, the NV40 GPU used in Geforce 6800, GT, Ultra cards, has ~220 million transistors. In comparison, an Opteron has just over a 100 million transistors.

RandomLetters: How many of the transistors in the GPU used in your example are duplicated? If that processor works on, say, 8 or 12 or 16 pixels at once, it follows that many of its transistors would belong to units that were simply repeated 8 or 12 or 16 times. (For example, a GPU with 200 million transitors that processes 10 pixels at a time might have 18 million transitors in 10 identical pipelines, and 20 million others.) I guess CPUs would also have a certain amount of duplication (actually, does the ability to process more than one instruction per clock cycle imply this?) also. But this duplication might explain why the NV40 has over twice as many transistors as the Opteron.

Even at the same transistor size, the graphics processor will have an advantage in graphics operations. The general purpose CPU will have several functions that the graphics processor does not (perhaps floating point operations, or a jump prediction unit to improve the cache hit performance, or any number of other optimizations). The corresponding transistors in the graphics processor will be dedicated to two things: more parallelism in graphics operations (as discussed above), and architectural features unique to graphics operations.

For a trivial example, the adder in a graphics processor may have an add instruction that is an “add with saturate” – if you take a pixel with a gray value of 230 (of 255) and another with 90 (of 255), the result will be 255 – used for overlays, transparencies, etc. A general purpose CPU may have no motivation to put this in hardware (it can always be simulated in software with more instructions). But this would be a common operation in a graphics processor, and having it in hardware in parallel across all pixels and running in one cycle amounts to a performance improvement over the CPU when you consider running the instruction over all pixels on a screen.

I chose this (simple) example because it was easy to explain. There are countless interesting and subtle (and more difficult to explain in one post!) graphics algorithms that may be in hardware on a graphics processor, but would need to be implemented in software using the more general instructions on a general purpose CPU. These take transistors to implement on the graphics processor, and give the graphics processor better performance.