1990 Cray supercomputer vers. my grandmother’s Dell

You ended your analogy by saying “On one hand it has many capabilities of older race cars, but probably can’t do the one thing the race car could: Race.”, which was the misleading part.

And in circular kind of way. The fastest supercomputer today is powered by Intel processors and Nvidia graphics cards. (admittedly there’s 7,000 of them).

Made in China: The Fastest Computer in the World - The Atlantic The fastest computer is made in China.

Which gets back to the fundamental difference that Francis Vaughan described eloquently in his post; modern computing clusters, which are really just a compact rack of individual (albeit often multi-core) computers that are arranged in a master-slave hierarchy (sometimes in multiple levels) devour a problem by killing it with a million nibbles; the great supercomputers of yesteryear were able to crunch whole solution sets in large chunks. In terms of the number of logic operations required to perform a specific calculation effort, vector supercomputers were far more efficient, whereas breaking a problem up among a bunch of smaller, less capable computers is less computationally efficient. However, the cost of building individual processing units and volatile memory is now so cheap that this measure of efficiency is no longer very useful; it is much, much cheaper, faster, and easier to abstract the problem to a parallel virtual machine that breaks a problem into discrete computational chunks, passes them to individual children, waits for an answer, and then quilts the results back into a unified solution.

Note that this isn’t all just hardware; the computational methodology for creating abstracted computation environments, and the message passing interfaces that transparently support it have advanced significantly in the past two decades. In theory you could take a bunch of Apple IIes and gang them together to solve a large computational problem, but the reality is that neither the network interface nor the tools for breaking a problem up into manageable parts existed at that time. It is also the case that basic algorithms for solving large sparse matrices have evolutionarily improved, so it takes less computation to solve the same system to an acceptable degree of precision.

The other problem is alluded to by Francis Vaughan’s post; it has been only in the last ten years or so that commodity computing hardware has been capable of 64 bit computation, allowing it to carry out high precision floating point computation that is necessary when accurately solving large matrices. You could break up a large problem to the aforementioned cluster of Apple IIes, but they would only be able to provide a limited degree of precision in the answer, which may not be sufficient to give a high fidelity answer.

To address the question of the o.p. in a more general way: your grandmother’s Dell is “more powerful” for most general computing purposes than the Cray C90 could hope to be. But for certain types of operations or metrics, the vector supercomputers remain technically faster, though not in ways that make them comparable to modern computing clusters for any practical purpose.

Stranger

Something worth mentioning is the converse. If your problem wasn’t a large highly regular one, and more like common garden variety program code, vector machines had no advantage, and indeed all that special extra capability really just got in the way. Crays were little faster than other contemporary machines with similar clock rates at running general purpose code.

Anther difference, the Cray vector architecture did not support virtual memory. So many modern operating systems mechanisms didn’t exist. These early machines used Cray’s homegrown OS. Latter Cray adapted Unix. These early machines used separate minicomputers to manage a lot of the more mundane OS, tasks leaving the serious work of number crunching to the vector processor. In many ways you could regard the vector processor(s) as the add on processors.

That’s interesting, and makes it even more directly equivalent to a modern desktop system with a high end graphics card.

So; it really was a benchmark machine for its time, eh?

For certain types of problems, a high end FPGA board can be equivalent to a desktop supercomputer of today. For a few thousand dollars you can get teraflops of highly parallel computing power

I’ve got two cards out of a Cray 1. One of the other government labs was selling them off through their museum about 15 years ago. They are astoundingly heavy. The entire structure is built on a thick copper plate, used to get the heat out.

Can you get supercomputers to run Windows? If I was super wealthy could I buy this:

tinker around with it a bit and run my games super fast?

Ovbiously it wouldn’t be the optimal way to do it as it’s not designed for that but if money’s no option is this the way to get games running fastest?

No, it won’t be running Windows.

Any games would require some serious re-writing to realise any performance gains. Most PC games on the market benefit little going from a dual-core to a quad core processor.

Nope. Even if you could wave a magic wand and get Windows installed on it (and got around the licensing problem—Windows licenses are limited by number of CPUs) and recognizing all that hardware, you’ve still hit the problem that games are not trivially parallelizable. On the other hand, you’d rocket up to the #1 spot in Folding@Home, I’d wager.

I’m disappointed. Maybe I’ll just get an i7 instead.

The supercomputer in question has 7,168 NVIDIA Tesla M2050 GPUs and 14,336 Intel Xeon CPUs. So it seems it’s basically a room full of blade servers. Most likely each blade has 2 CPUs and one GPU.

It should be easy to get Windows installed and running on each blade. The hard part is writing the software to break down a problem and distribute the workload onto all these individual computers. Also you need to spend something like $1 million on Windows licenses.

And even more so now, it would seem.

Tangent question:
When programming supers - how are the individual cpu’s/gpu’s addressed from the controlling program? At some low level either the IP or Mac address is being used, but I’m wondering if there is typically some logical identifier at a higher level, like CPU number 1 or CPU number 2,347, etc.

Do each of the cpu’s typically have a communications program running that sends and receives work over the network in addition to the program actually performing computations? Or is there special hardware to keep the CPU (which I assume doles out work to the attached GPU’s) from having to handle that. In other words - how much does it look and act like a cluster/network of PC’s and in what ways is it significantly different other than scale?

Depends upon your supercomputer - real supercomputers use special interconnects that provide for a range of possible programming paradigms, and very fast speeds.

At one end you have the SGI UV systems that are a single system image with a single shared memory space. Just like a multi-core desktop but much much bigger. Here however locality issues mean that you still need to be aware of where you data is, and there is an underlying notion of CPU identity and some ability to direct where data and processing occurs. Everything after the CPU itself is custom hardware, so these machines are quite a bit more expensive than more conventional clusters. However for the right job they are pretty hard to beat.

More common supercomputer interconnects - Myrinet, Infiniband, and the like, actually usually do provide an IP connection, and even a virtualised Ethernet connection, but these are not usually used for the computational effort - being used for control duties. IP places a large and unwelcome burden on the communication. In general you will see the actual code run written to use a general purpose message passing library - such as MPI - which abstracts over the actual interconnects and protocols. The specialised interconnect manufacturers provide customised implementations of MPI that work efficiently with their hardware. MPI just identifies the target CPU as a number. (The hardware itself does of course have a Mac address.) Most programs are written in such a way that they dynamically configure themselves to work on the number of CPUs that are made available at the time. So you make calls the the MPI library asking for how many CPUs there are (or you have been given) and each instance of the program will ask what the ident of the CPU it is running on is. MPI provides a whole raft of ways of creating parallel communication between all the CPUs, so that all the common paradigms are directly supported, and any specialised hardware tweaks can be used. For instance global synchronisation is something that is really worth adding hardware in the interconnect to effect.
The commnications hardware will handle all of the work of moving the data to and from the memory of one machine to another. The local CPU only needs to issue a request to the hardware. An important optimisation in any code is to identify where it is possible to overlap communication and computation, so getting the data you need for the next bit of work before sending out the work just done is important. (This is of course not trivial to effect for the entire program, so paradigms like red-black interleaving are used to allow this to work.)
Another thing - the interconnects usually work directly in user space. Myrinet for instance maps the control registers of the controller card into the user program’s memory. Requests to the hardware are therefore made directly from user code and do not require a system call. This significantly improves performance. However there is now essentially zero security. This is perhaps one of the more obvious differences between a supercomputer (even if it is little more than a stack of blade servers) and a computational cluster. Clusters tend to be managed to provide a full security model. Once you are past the management front end of real supercomputer anything that gets in the way of performance is thrown out.

Thanks Francis Vaughan, exactly what I was looking for.

Interestingly, we seem to have moved backwards in the volume of supercomputers.

The premier Cray computer, the Cray-1 (from 1975, not 1990) was physically a very compact machine (see photo) – 2 people probably could have joined hands around it. But the current supercomputers are moving back toward ENIAC – room-sized monstrosities.

When comparing performance of your average desktop to the crays of the early 1990s or even 80s, you must take into consideration not the “peak mflop” ratings but the sustained mflop ratings. Peak mflops mean little if the machine never reaches such performance levels. The Army actually analyzed the cost/benefit of having a cluster of p4 2.8 ghz or going with a cray solution. They discovered that although the p4 2.8 ghz had a high peak mflop rating of 5.6gflops, in practice it only reached %2 of its peak performance due to bandwidth limitations! Please see the results of the army’s study right here https://cug.org/5-publications/proceedings_attendee_lists/2003CD/S03_Proceedings/Pages/Authors/Muzio_slides.pdf. Needless to say their conclusion was that the cray solution was more cost effective and easier to program and maintain.
Nasa’s high performance computer lab also did a comparison between their old cray xmp 12 (one processor 2 megawords of memory) and a dual pentium II 366 running windows NT. They had to redesign the space shuttle’s solid rocket boosters back in the late 80s after the challenger disaster and the cray xmp was used to model air flow and stresses on the new design. Some years later the code was ported to a windows NT workstation and the simulation rerun for comparison. The result is that a single processor cray xmp was able to compute the simulation in 6.1 hours versus 17.9 hours on the dual pentium II. The cray xmp could have up to four processors with an aggregate bandwidth of over 10gb a sec. to main memory, this kind of SUSTAINED bandwidth between cpu (not gpu) and main memory was not matched on the desktop until about 4 years ago. The pentium IIs had either a 66mhz or 100mhz bus speed so we are talking a maximum bandwidth of only 800mb (528mb with 66mhz bus) and with around 330mb/sec transfer rates sustained (remember pc’s use dram and the crays mostly used very expensive sram memory). The importance of bandwidth and real world number crunching performance can be seen in the STREAM benchmark. Please go to http://www.streambench.org/ to see exactly what I mean.
In 1990 the C90 cray was the baddest super computer on the planet, and at $30 million fully configured it was also by far the costliest. Here’s a photo of it: Home | Computational and Information Systems Lab. The cray c90 could have up to 16 processors, with 16gb of memory, and could achieve a maximum performance of around 16glfops. “Well gee, my cheapo phenom x6 can do well over 16 gflops because that’s what it says on my sisoft sandra score so I have a cray c90 sitting under my desk blah blah…” you are completely wrong if you think this. The sisoft sandra benchmark tests everything in cache which is easy for the cpu to access. Real world problems, the kind that crays are built to solve, can’t fit into a little 4mb cache and thus we come to sustained bandwith problems. The c90 can fetch 5 mega words per clock cycle (for each processor) from main memory and has a real world bandwidth of 105gb/sec; compare this to a relatively modern, quad processor (4 processors and 16 cores) core i7 2600 that gets a measly 12gb a second sustained bandwidth. “But the core i7 2600 is clocked much higher than the c90 which only operate at 244mhz per processor”. Ahhh but if the data is not available for the processor to operate on then it just sits there, wasting all cycles, waiting for the memory controller to deliver data to it. Without getting into too much detail (if you want a lot of detail read my analysis of the cray 1a versus pentium II below) the real world mflops of the C90, working on data sets too large for a typical pcs small cache, works out to roughly 8.6 gflops while the Intel Core i7 2600 will achieve only about 1gflops sustained on problems out of cache. So far there are no desktops, and won’t be for quite a few years, that come EVEN close to the real world sustained bandwidth (and thus sustained performance) of a C90. Now for problems that do fit into the tiny cache and can be mostly pre-fetched, of course the desktop will be superior to the old crays. Here is a rough comparison I made between a cray 1a and a pentium II 400, read on only if you want to be bored to death:

The Cray !A had a clock cycle time of 12.5 ns, or an operational frequency of 80 mhz. It had three vector functional units and three floating point units that were shared between vector and scalar operands in addition to four scalar units. For floating point operations it could perform 2 adds and a multiply operation per clock cycle. It had a maximum memory configuration of 1 million megawords or 8 megabytes at 50ns access time interleaved into 16 banks. This interleaving had the effect of allowing a maximum bandwidth of 320 million megawords into the instruction buffers or 2560 mb/sec. Bandwidth to the 8 vector registers of the Cray 1A could occur at a maximum rate of 640 mb/sec. The Cray !A possessed up to eight disk controllers each with one to four disks, and each disk having a capacity of 2.424X10^9 bits for a maximum total hard disk capacity of 9.7 gigabytes. There were also 12 input/output channels for peripheral devices and the master control unit. It cost over 7 million in 1976 dollars and weighed in at 10,500 lbs with a power requirement of 115 kilo watts. So how does this beast compare with myr old clunker of a PC with 384 mb of SD100 ram and a P2 400 mhz cpu?

Well lets take a simple triad operation, with V representing a vector register and S representing a scalar register.

SV0 + V1* = V2*

Without getting into too much detail this equation requires 24 bytes of data to perform once. There are two floating point operations going on here, the multiplication of the scalar value with the vector, then the addition of the second vector.Thus, assuming a problem too large to just loop in the cray 1A registers, and a bandwidth of 640 mb/s, the maximum performance of a Cray1A would equal (640/24) * 2 = 53 mflops on large problems containing data which could not be reused. This figure correlates well with the reported performance of the Cray 1A on real world problems

http://www.ecmwf.int/services/computing/overview/supercomputer_history.html.

True bandwidth on a Cray 1A would also have to take into account bank conflicts plus access latency so about 533 mb/sec sustained is a more realistic figure. On smaller problems with reusable data the Cray 1A could achieve up to 240 mflops by utilizing two addition function units and one multiplication function unit simultaneously through a process called chaining. So you see the Cray 1A could be severely bandwidth limited when dealing with larger heterogeneous data sets.

My pentium II 400 has 512 kb of L2 cache, 384 mebabytes of SD100 ram, and a 160gb 7200 rpm hard drive. Theoretically it can achieve a maximum of 400 mflops when operating on data contained in its L1 cache, although benchmarks like BLAS place its maximum performance at 240 mflops for double precision operations which is what we are interested in here. Interestingly this is about the same as what a Cray !A can do on small vectorizable code. However once we get out to problem sizes of 128kb or 256kb or even 512kb my pentium 2 would beat the Cray 1A even in its greatest strength, double precision floating point operations, due to the bandwidth advantage of the L2 cache over the Cray’s memory. At 1600 mb/s bandwidth my computer can do up to 133 mflops for problems under 512 kb in size but greater than the L1 Cache.

Once we get beyond 512 kilobytes the situation shifts as data would then need to be transferred from the SD100 ram.The theoretical bandwidth of SD100 ram is 800 mb/sec, still greater than the Cray 1A but here we run into some issues. The Cray 1A had memory comprised of much more expensive SRAM, while my memory is el crapo DRAM which require refresh cycles. So with these taken into account my DRAM actually has a theoretical maximum bandwidth of about 533mb/s and a real world maximum sustained bandwidth of a little over 300mb/s. This means that for problems out of cache, my pentium 2 gets slowed to a measly 315/12 = 26 mflops. In this special situation where the problem is vectorizable, the Cray 1A is still faster than my pentium 2, not bad for a computer that is 30 years old.

Once we get problems greater than 8 megabytes, the advantage shifts completely back to my pentium II as the Cray !A must then stream data from its hard disks (which were slower than ultra ATA/100) and my computer can go right on fetching data from ram. The Cray 1A could not realize its full potential as it was hampered by bandwidth
and memory size issues, yet in certain situations could outperform a desktop computer from 1998. Solid state disks,more memory ports, and larger memories were utilized in the subsequent cray xmp to address these problems.

A desktop like the core duo E6700 can do over 12 gigaflops, BUT only on problems that are small and fit into its cache. Once the data gets out of cache today’s modern computers get their butts kicked by the old school Crays from the 80s. Just visit http://www.streambench.org/ to see what I mean.