Explain to me the rise of GPUs (blockchain, AI)

ok, to me it seems that GPUs were once just graphic cards that helped representing the visual interface on monitors…

Then about 15 years (or so?) ago, they became a thing of their own, due to “mining” … and now they are even more a thing with AI.

I understand (well no really :wink: ) they do some mathematical calulus differently from a CPU, a fact that is relevant for mining and AI …

Have GPUs become a computer on its own ? … how did that all happen?

Modern x86 CPUs have hundreds of instructions built into the hardware. That makes them fast, but very bloated. GPUs have a much smaller instruction set so more types of instructions have to be run in software, but they have the advantage of being much smaller. So while you can stick only dozens of x86 cores on one chip, you can put thousands of GPU cores. At one time the data flow on GPUs used to be one way-numbers go to the GPU to be crunched, video is produced, nothing goes back to the computer. But then GPUs started being designed to be able to send output data back to the CPU. And people realised that they can be used for much more than rendering video, you just have to write code that doesn’t expect the large instruction set found on CPUs. That means some calculations may take longer on a GPU core than on a CPU core, but that is far more than made up for if you have, for example, a computer with 16 CPU cores and 10,000 GPU cores. With the right kind of data your GPU can be hundreds of times faster than your CPU. And taking advantage of that came years before cryptocurrency mining.

There is a GPU on the way with possibly 24,576 cores.

The CPU(s) have a large instruction set to run the operating system, various applications and utilities, et cetera, all balancing the load such that it doesn’t get tied up or overwhelmed doing on task (at least in theory). The GPU, on the other hand, is really dedicated to doing a lot of floating point calculations critical to high refresh rate graphics output (especially 3D projection where a lot of coordinate transformations have to be performed), and so is also really good at doing general numerical computations very fast.

With the Compute Unified Device Architecture (CUDA) architecture and supporting frameworks, it is also capable of parallelizing distributed computations (breaking a huge problem like inverting matrices or doing tensor operations into smaller computations, doing the math, and putting the results back together), so it is great for solving large discretized problems like finite element or computational fluid dynamics models, running global climate or astrophysical simulations, or doing blockchain cryptography and training large heuristic models like LLMs on networks of GPUs. High end workstations and PCs used to have something called a math co-processor as a separate chipset to do this kind of thing but those have been integrated into the CPU and can’t really be expanded the way you can just add individual video cards controlled by the CPU.

You can actually run a GPU completely standalone as people have made dedicated ‘lightweight’ operating systems for the purpose, and it would probably make a hell of an embedded controller (albeit overpowered for most applications and not designed for physical robustness to dynamic environments or low energy consumption) but really they happen to be well suited for scalable high speed distributed computation because that is what video output requires. Even if you created some kind special chipset and firmware architecture for high performance computation, it would essentially look like a GPU, and the way GPUs are packaged makes it easy to create networks, and provide power and cooling to them.

Stranger

The other thing with GPUs is that they’re very efficient at running the same calculations in parallel on their many cores. That kind of processing is very useful in graphics processing, and so that was the first major market for it, mostly for games (because lots and lots of people play games). But once they existed as a common commodity item, all of the more niche uses (like physics simulations) for that sort of highly-parallel processing were able to use them, too. There aren’t nearly enough physicists to make it economical to develop new parallel processors just for them, but it’s easy to repurpose the gaming hardware for it.

That said, they’re not a cure-all for all computing tasks. A lot of tasks can be massively parallelized, but some can’t, and take a long time just because there’s a long list of things that need to be done one after the other. You can still sort of parallelize those some, by starting one subroutine and then starting two other subroutines, one for if the answer to the first is yes and one for if it’s no, and then just tossing out the result that doesn’t match what you get. But that’s pretty inefficient.

Never mind: comprehensively ninja-ed.

Also, the highest-end GPU “video cards” don’t even have video output ports (and cost up to $40,000).

That’s really the reason GPUs are used for things like crypto mining/blockchain and AI calculations. Those tasks are readily decomposed into smaller, simpler calculations that can all be run in parallel, and GPUs are built specifically to do smaller, simpler calculations in parallel (classically it was rendering triangles on screen, IIRC).

Otherwise, your CPU would have to chug through each of those calculations one by one sequentially, which is a much slower process.

An analogy might be that you have one person who’s pretty fast at doing anything and everything in a mailroom. Let’s say that the mailroom has a task to put 10,000 stamps on envelopes. Your one person could probably do that in a few days, but you also have the option of putting 100 summer-job high school kids on it. Each one is a bit slower at it than the full-time person and they can’t do much besides put stamps on envelopes, but there are 100 of them, so they can do it in an hour. That’s the graphics case.

Now when the mailroom gets another 10,000 envelopes that need stickers, you can use those 100 high school kids to do that in a short time. (cryptomining example).

But if your task requires someone to sort the mail into categories, set a machine to mark them, and then resort them on the far end, that’s something the full-timer is going to have to do, because it’s too complicated for the high school kids. (most normal computing).

Unless that is, you can decompose some part of it into something those kids can do- sorting is an example of a problem that is able to be decomposed into something that GPUs (or high school kids) can do.

I wouldn’t use the number of instructions as any sort of metric. Many general purpose CPUs have a low number of simple instructions. That is what defined RISC. x86 exists despite it ISA shortcomings and insane instruction set.

The other thing about GPUs is that there are a whole lot of things they are bad at that don’t matter for parallel computing of arithmetic.

The core driver is the maxim: make the common case fast, make the uncommon case correct.

Most CPUs spend their time doing stuff other than arithmetic and are optimised for their common case. A general purpose CPU benefits from caches, branch prediction logic along with speculative execution, and needs to cope with all the cruft an operating system brings, including interrupts, exceptions and security, in an efficient manner. This burns a lot of silicon to make fast. Especially the caches, but even all the extra junk attached to the main pipeline increases area massively. Adding virtual memory is another headache, with translation lookaside buffers to manage, more exceptions, and yet more junk in the way between computation and memory.

A GPU, at least in the basic geometry engine or raster engine needs to do linear algebraic operations and bit blatting on a large and regular dataset. This typically involves doing the same thing on each data item. It is rare to need to branch. And there is no need for complicated memory access control. Usually GPUs have their own optimised memory.

This leads to an architecture that contains memory control mechanisms good at striding large amounts of memory in regular patterns. It allows streamlining of memory control and pipelining of accesses. There are no caches. No need, and they would just get in the way. The regular patterns of accesses mean operands are always ready. Similarly branching is rare. So the instructions can be crafted to execute without contention for internal resources, with no hazards to cause a stall, and can run without worrying about needing to back out.

Some general purpose CPUs did gain some additional capability for GPU like operations, such as MMX and its ilk. Or the i860. But these still existed in the environment of the CPU, and you didn’t get the scale possible with a GPU. (A graphics system that included entire boards filled with i860 processors was a thing.)

A modern GPU is thus a very specialised processor. Cutting down what it needs to be good at to a very tightly constrained set means a huge amount of cruft needed by a general purpose CPU can be eliminated. The result is vastly smaller in terms of device count and thus chip area. Where a top end CPU chip might struggle to place 64 cores on a die, as a huge fraction of that is taken up with caches, and the CPU cores themselves are bloated out, A GPU die an be filled with legions of tightly designed cores and highly efficient memory controllers feeding them. Small also can mean higher clock rates. It all adds up.

Early GPUs also took advantage of the limited precision needed for graphics. You don’t need 64 bit IEEE floating point when all you are doing is calculating the location of a line on a screen. Even early GPUs with limited floating point capability were good enough for some physics - lattice gauge QCD for instance. The GPU manufactures realised that there was value in putting the extra precision of proper floating point to address these new markets.

AI on GPUs really means large scale matrix manipulation. So addresses the deep learning transformers paradigm. That seems to the the current hotness in AI so is a big deal. Blockchain in the abstract often requires some metric of computational work done to limit the chain extension (although not always.) Bitcoin just involves calculating hashes, so is trivially parallel and each calculation involves grinding along a large slab of data, so a great fit for each calculation as well.

Old 1980s supercomputers looked like a bunch of cabinets full of nodes; so do new supercomputers, but instead of a few or hundreds of cores, there are a total of millions of cores; each node will include both “CPU” and “GPU” chips (and each individual chip includes loads of cores) so you do get flexible general-purpose high-performance computation.

Supercomputers are an interesting counterpoint.

1980’s were still dominated by big iron vector machines, mostly from Cray. Early 1980’s you still saw The Cray-2 dominating. and it had up to four processors. Its successor, the XMP (where MP meant multi-processor) dominated unto about 1985, with up to 8 processors. Later came the YMP , but the number of processors still topped out at 8. The external design of these machines was one of the best looking supercomputers made.

1988 saw the introduction of the best looking supercomputer ever made - the Thinking Machines CM-2. This was one of the first massively parallel supercomputers, and boasted up to 4096 teensy processors.
(This machine was the source of the famous Seymore Cray quote - “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?”)
The CM-2 was quickly followed by the CM-5, which introduced significant scalability, and was a totally different design, and included 4 64 bit wide vector processors per node. A single cabinet machine could hold 128 nodes. (There was an entry level machine with 32 nodes and wasn’t expandable.) Adding cabinets added interconnect, storage, and nodes. It was famous for being the supercomputer in the original Jurassic Park movie. Despite its looks, inside each cabinet were three very tall 19 inch racks. A box containing 32 nodes on 8 huge double decker PCBs slotted in vertically was the building block, with four fitting in a single rack. Power supplies would fill another entire rack. Boxes could be cascaded to make bigger machine, but the network grew quickly, making machines with lots of nodes very large. The biggest ever was 1024 nodes, although it theoretically could go to 4096. This 1024 node machine was in the number one spot of the first ever top-500 supercomputers list in 1993.

Cray introduce the T3D to compete with up to 128 more conventional nodes, and it was just a huge box. As was the T3E.

After that supercomputers became boring, the mid to late 90’s saw the start of racks and racks of processors. SGI briefly were the big dog with the Origin series of machines. This was a ccNUMA (cache coherent non uniform memory architecture) design. So shared memory across all the nodes. That was quite something.
Their additional trick was that their very high end Reality graphics machines could be added to the same system - they used the same boxes and the same interconnect. So you could have a visualisation supercomputer. But the graphics engines didn’t contribute to the computations, just got fed the results. The peak of SGI was in late 1998 when ASCI Blue Mountain was commissioned. That was when supercomputers really took off. Suddenly the science that could be done for the dollar became compelling. However, a SGI Origin could only accommodate 128 CPUs on the ccNUMA interconnect. So ASCI Blue Mountain was really 48 such machines with a different interconnect binding them together. But a room full of racks it was.

Sadly SGI eventually went under. A number significant miss-steps sealed their fate. (They exist as the brandname on a company that fills racks with ordinary compute nodes.) Nowadys if you are in the big end of the supercomputer business each machine is to some extent custom designed. A basic machine of conventional CPU based nodes, connected via a high performance interconnect of varying topologies. Then the possible addition of a range of local accelerators. These might be GPUs, or could also be FPGA systems. The exact configuration depends upon the use cases. Well funded research labs will look at the problems they want to solve, perform a lot of benchmarking and optimisation, and iterate down to a specific architecture that give them best bang for their money. The interconnect is the unsung hero of a lot of machines.

But tell me this isn’t the best looking supercomputer ever built. Those LEDs meant something as well. They were not just for show. The design was famously partly due to Richard Feynman - whose son worked for Thinking Machines. Richard sketched out the way the 10 dimensional hypercube of the processor interconnect could be embedded into 3D space. So the design of this machine is very much form follows function.

https://www.mission-base.com/tamiko/cm/index.html

The CM-5 managed to look almost as imposing. The LEDs could be configured to get status information from the processing nodes. But most of the time they ran a couple of pre-canned random patterns. They were clearly mostly for vanity as the bottom most LED panel was partially covered up by a black panel. Cabinets without any processors (containing either storage or network) had nothing to connect the LEDs to anyway. The other big trick the CM-5 has was that the storage nodes directly connected to the network, and could stream data broadside form RAID arrays with hundreds of disks directly into the nodes. IO bandwidth scaled with the number of disks. So customers could fill an entire cabinet with disks. That the interconnect scaled in bisection bandwidth with the number of nodes was the CM-5’s other big trick.

They’re very very efficient at matrix multiplication.

The high-level stuff has been covered well already. But here’s a lesser-known trick that’s used.

CPUs tend to have a ton of onboard cache. The reason is that main memory has horrific latency; typically hundreds of clock cycles. The CPU can’t do anything while it’s waiting, so all of those cycles are wasted and your program would be hundreds of times slower without the cache. CPU makers go for massive overkill here since even very rare cache misses (a miss is when the cache doesn’t have the data and you have to go to memory) can kill performance.

GPUs have a trick that allows them to get away with much less cache (which they instead use for more math units). Recall above how GPUs have thousands of cores running fairly simple programs. But GPUs are so parallel that they can do one better: they’re running tens or hundreds of thousands of programs (called threads) at once.

Suppose you’ve skimped on the cache and now your thread has a cache miss. Instead of waiting around doing nothing, you pick one of the many other threads waiting to finish. Perhaps that thread executes a few instructions and then itself has another cache miss. Again, no problem: there are so many threads available that you can always pick one that’s ready. Eventually you might go through dozens or a hundred threads before cycling back to the original.

So the cycles are not wasted like they are on a CPU. You still need some caching for bandwidth reasons–although GPUs have lots more bandwidth than typical CPUs, it’s never enough. But it means you can use small, highly optimized caches rather than giant ones that eat all your die space.

GPUs also support user-managed caching (called shared memory on NVIDIA GPUs). This carves out a region of cache that the user can explicitly read/write to. Although painful to write code for compared to a CPU, it can be 10x as efficient as a normal cache, since the user only puts exactly what they need there when they need it.

There are some older CPUs that used something akin to the threading trick above. Called a barrel processor:

Fell out of favor for CPUs, but GPUs have simpler processor state per thread (especially since they group threads together into bundles of 32 or 64), and also use a more dynamic thread picking system than a fixed “rotating barrel”.

What you said is mostly correct, but I should point out that all GPUs have had virtual memory for a long time now. Pretty much all 64 bits, too (or 57 bits or whatever the latest CPUs provide).