Link to Cecil’s column from 2003.
From the column:
The 15 years are within a year or so of being up. Has the performance of the Japanese Earth Simulator been matched “by something you can buy for $900 at Wal-Mart”?
Link to Cecil’s column from 2003.
From the column:
The 15 years are within a year or so of being up. Has the performance of the Japanese Earth Simulator been matched “by something you can buy for $900 at Wal-Mart”?
Absolutely no way. The only thing that may approach that at the “lower” end is an average public university compute cluster, and I’d wager you’d have to leverage the entire cluster at once (not typically done) to get that performance.
The best consumer-grade GPU right now only hits around 9 TFLOPS, and most computers can’t physically handle enough RAM sticks to get past maybe 64GB (and that’s if you’re buying the really high end RAM).
Note that supercomputers are much more powerful now, though, and also much more accessible to the user.
For instance, you can rent time on Amazon’s pretty reasonably.
Comparing the article to current events, the most powerful supercomputer is now the Sunway TaihuLight in Wuxi, China. It runs at 93,014.6 teraflops, or 93 petaflops.
https://www.top500.org/system/178764
The lowest-ranked, #500, is still 286.1 teraflops, or over 7 times faster than the Earth Simulator.
The newest Nvidia TitanX Pascal graphics card can do 11 TFLOPS, but even the TitanX itself is more than $900 ($1200 MSRP). The high end desk top i7 processors from Intel maybe .5 TFLOP and again, more than $900 just for the CPU - no motherboard, RAM, etc.
The 10TB of main memory by itself costs way more than $900. More like $4000, and only if you have 64 slots and willing to use cheap consumer 16GB memory modules.
Comparing modern compute with supercomputer of the 70’s (which by default really means a Cray-1) is a fraught issue. You can’t compare peak compute, you need to compare sustained real world compute. And it is important to note that even back then Seymore Cray understood that simple fast compute did not a supercomputer make. The ability to get data in and out was just a critical, and an unsung part of the design of any real supercomputer is the effort put into IO. Even a Cray-1, which for its time had jaw dropping IO capabilities.
There has been a significant change in supercomputers, which makes the question of just comparing supercomputers of now with those of say 15 years ago difficult. More money is being spent on modern supercomputers. Not just numerical amounts of money, but those very big top 10 supercomputers cost in real terms significantly more than a Cray-1 did. A Cray-1 cost about $9 million. $35 million in today’s money. Not the multiple 100s of millions that go into the absolute top end. Even at the lower end, you see institutions that once would spend say $1 million in today’s money on a supercomputer spending five times that now. The reason is that you get so much more for your money now that new science is possible, so you get much more science for your dollar. Making spending money on a supercomputer (rather than say some experimental equipment) a better use of the money. At least by the metrics that are often used to measure science for the dollar. The other little secret about many of the top end supercomputers is that they rarely get used as a single big computer for a really huge job. They are all clusters, and the cluster management will almost always run them as a set of individual sub-clusters, each running different jobs, with a front end queue manager allocating jobs out. This makes it much easier to get buy in from the disparate research groups, and so, rather than have lots of individual high performance compute systems scattered around the various research areas, you amalgamate it all into one huge prestigious machine. When it is commissioned you run Linpack on it, get a bragging rights number (Gigaflop Harlotry as an old colleague termed it) and then leave the machine to run as a set of smaller sub-units day to day.
And then you have to ask just what “a computer” is. Is Seti@Home a supercomputer? If not, just what distinguishes it from all of the other clusters? Can a typical home user then say that they own a supercomputer (or at least, a small piece of one)?
I note that under Moore’s (revised) law of doubling every eighteen months, 15 years of such doubling constitutes almost exactly a 3-orders-of-magnitude improvement, or one SI prefix.
So Cecil’s ‘typical desktop’ benchmark of 1.8 Gflops in 2001 would be expected to improve to 1.8 Tflops in 2016. So how we doing on that mark? Can you get a $900 computer (probably a laptop) that does nearly 2 Tflops?
Powers &8^]
For what it’s worth, I had a lengthy calculation some 30-odd years ago that took 17 hours on an original IBM PC. (And the original IBM PC was a good deal faster than most mainframes of my youth.) My iPhone 6 Plus does it in about half a second. My iMac in about 0.085 seconds.
One big difference between a “classic” supercomputer and a modern one is that the classic only had one or two very fast CPUs (and maybe a vector unit) and were built from esoteric components using exotic methods. The modern one is built from a massive number of replicated components, which are either standard off-the-shelf parts or somewhat customized versions of standard parts. Seymour Cray (who I once had the pleasure of meeting at NASA AMES) is reported to have said “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?”
However, the writing was on the wall for giant single (or dual) processor systems, as microprocessors began to catch up (and surpass) the performance of the giant systems at a miniscule fraction of the space / power requirements.
I have 5-year-old+ x86 systems running in my dining room that have 192GB of RAM installed. Although I will freely admit that isn’t typical.
The Student Cluster Competition has produced systems that show excellent performance - the most recent winner achieved 12.57 LINPACK TFLOP/s. One of the competition rules is that the entrants are limited to a 3KW power budget.
That’s all well and good, but the real question is how well it will run Cookie Clicker?
One major limit on the “two strong oxen” model of supercomputer has been the speed of light: Modern clock cycles are fast enough that signals can’t get from one side of the chip to the other in a single cycle. Now, you can do multiple operations in one clock cycle, but that’s not going to get you much more than an order of magnitude, and you can arrange your chip so signals only have to make it a small fraction of the way across, but if you’re effectively isolating parts of your chip that way, then why not make it multiple chips?
It gets messier than that. A Cray-1 was partially speed of light limited - and it has fewer transistors in its entire design than you get on a single die for even a modest CPU nowadays. The backplane was hand wired with custom lengths of wire to tune the timing, and the circuit boards had wiggly traces to tune delays.
A modern chip can’t even access cache memory fast enough - the limitations of how fast you can access cache scale with the cache, and form a bottom limit on what you can do. Thus any sort of SIMD/Vector operation starts to win big - because you can (potentially) fetch all vector operands in a single wide operation. So vector registers, vector operations, get you back a lot of performance.
In general, you have to find real parallelism in any code if you want it to go fast. Otherwise you are limited to pretty much the performance any modern desktop can provide. Even spending unlimited money might only double your performance over a run-of-the-mill generic desktop. No matter how much money you have, Word or Excel won’t go any faster.
Algorithm design of very large clusters need to be built around the physical structure. The topology and latency of the interconnect dictate the way things need to be done, and more importantly, what needs to be done is the main driver in the design of the interconnect and its topology. But you have to fit your system into three dimensional space, and so there is only so much you can do with interconnections before you run out of space to hold them. A really nice interconnect like a fat-tree grows exponentially with the number of interconnected nodes, even though the latency only grows with the log of the number of nodes. Eventually you have to call a stop otherwise there is just no space for the actual compute. So you find 3D toroids in many very very big systems. And you see a very large variation in interconnect technology.
Tricks such as scaling each slab of data computed so that the latency of communication and time to compute as similar allows you to have the communications subsystem run flat out moving data whilst the compute does its job, and each are ready for the other at the next step, thus minimising idle time in each. But this requires cognisance of the match between the interconnection’s physical layout as well as bandwidth and latency. You need to avoid hot spots in the network, as these will pace the entire computation. Indeed the real killer can be problems with stragglers that act to pace the entire system’s operation to something well below its intrinsic capability. It isn’t easy.
Fortunately, big computing jobs are often of a form that’s inherently easily-parallelizable. The biggest computing job I ever did, about a decade ago, took two weeks on my personal computer (I did it over winter break, when I wasn’t going to be on that computer anyway). It consisted of doing the same calculation repeatedly, with about a million different values for one of the parameters. I just gave it multiple processes at once, each covering a different range of the values, and when a processor finished one, it started on the next, without having to care what other processors were doing (or even what it was doing, on different iterations). If I had a hundred thousand processors, I could have done it a hundred thousand times faster. If I had a million processors… well, it wouldn’t have been a million times faster, because some runs took much longer than others, but I still could have used them.
Around the same time, one of my professors had a problem that took him a few weeks on a supercomputer cluster. His problem was looking for a very large number of possible correlations, on a large data set. Again, each check was independent of the results of all of the others, and so he could make use of as many processors as he had checks to perform.
All too often people get overly focused on scientific computing and floating point calculations. Hence MFLOPS, etc.
But most of the really big computing systems today do mundane things like banking, credit cards, credit tracking, etc. Not a lot of floating point, a small amount of integer ops, quite a bit of string ops, and a large amount of DB/Data structures ops. All of which is reasonably parallelizable (in theory, reality is something else). It’s not flashy. No one is racing to build the biggest transaction system just so they can brag about it. (And in fact these companies tend to keep quiet about their systems.)
Even the NSA with it’s mammoth computing systems, isn’t doing much floating point stuff.
Companies like Google and Amazon have huge server farms that put the biggest supercomputers to shame. Some they use themselves, some they rent out.
Forget FLOPS. That’s so 1970s.
Even the newest model of the Apple Watch has two processors (or, in modern jargon, cores), and all the main Apple operating systems include a feature (“Grand Central Dispatch”) to make it easy to chop large problems into parallel streams.
And, where, ever since the 60s, IBM mainframes had come in half a dozen or so separate models, with a speed range of, perhaps, 100-to-1, now they come in only two speeds (“fast” and “cheaper, but almost as fast”), and, if you want more speed, you just add more engines.
Huge cloud clusters are not supercomputers. They are just rooms full of boxes. They rent them out by the hour and they are very often used as standalone machines. Often it is little more than companies moving their servers out of their own machine room into Amazon’s. Many run web servers. Front ending your commercial web services with rooms full of boxes you rent makes lots of sense. It isn’t a supercomputer. If you tried to use a typical cloud cluster for a serious parallel computation - as opposed to just running bunches of independent stuff, you would find you were very quickly running out of steam as all the bits that make a supercomputer a supercomputer become so clearly important. (I do use AWS could machines for some serious compute, but it isn’t really all that big - a few hundred cores for maybe ten odd hours for a run. And the code runs a very easily partition-able problem.)
Transaction processing is a very specialised problem all of its own. The problem with financial transactions is that they don’t parallelise well. Transactions are usually ACID (Atomic, Concurrent, Independent, Durable) and this is really hard to do with simple x86 servers when you are doing lots of transactions per unit time. There is a reason IBM, HP and SUN can still sell insanely priced big iron. There is a lot of smarts in those machines that allows them to provide highly reliable OLTP - reliable enough that you would trust you money to them.
Far from floating point being old school, it is arguably a much bigger deal than it ever was when you are talking big time compute. Even when you talk some of the really big dedicated clusters you discover you are talking animation rendering (Weta for a time owned one of the largest clusters and used it solely to render LOTR). Then you discover that seismic processing remains one of the biggest users of commercial cycles, with companies like Saudi Aramco having a constant presence in the higher reaches of the Top 500. Then you get the grand challenge problems - turbulence, protein folding, drug discovery, much machine learning, and you find that far from being old school, floating point performance is the key enabler.
Frequently it wasn’t the processing power that set supercomputer apart, it was the internal bandwidth (the number of lanes that data could travel to get from one part of the server to another.) That’s where the supercomputers shined. Your newfangled Pentium 4 might be a whole lot faster than the bank’s 10 year old super computer, but that didn’t matter if you couldn’t get an equivalent amount of data on and off box to perform the work.
Nowadays, we’ve got USB 3.1 and Thunderbolt 3 which have been created to handle things like multiple 4k video streams. (Ultra Blueray 4k streams at 108 Mbps)…TB3 delivers 40 GPBS…Now all of these are just numbers, and meaningless ones for the normal individual. Suffice it to say, it’s a bunch, and getting faster all the time.
More importantly, Supercomputers have migrated from expensive custom hardware to COTS (Commercial Off The Shelf) hardware that the average person can purchase, if they have enough money…they just combine a LOT of them. There is no philosophical difference between running your weather simulation on your Nvidia Titan GPU at home, and renting time on an AWS compute cluster…you’re just renting a slice of someone else’s computer.
If your problem takes one computer 1000 hours to solve, it might take 1000 computers 1 hour to solve…and the cost to rent will be the same.
At the same time, there’s the $5 Raspberry Pi Zero…it’s kinda sorta equivalent to a Pentium III, but can play back full 1080p HD video streams (PIII’s couldn’t do that.)
They give them away for free when you subscribe to MagPie magazine. Moore’s law isn’t JUST about doubling speed, it’s also about halving the cost to put components onto chips.
A server farm is not a super computer, neither is a cluster of standard intel PC’s connected via ethernet. I’d say at the very least you need something much faster like Infiniband networking between the compute nodes to make them act as a supercomputer. Various companies still make cache coherent shared memory architecture super computers as well, eg SGI (who used to be Silicon Graphics many years ago) still makes the UV line of ccNUMA super computers.
https://www.sgi.com/products/servers/uv/
Cray also still exists, but like everyone else they’re making super computers from Linux Intel workstations (with their own custom extremely high speed low latency networking and their own Linux kernel patches). They’re a bit ship of theseus like, since the original supercomputer business was sold to Sun, long ago when SGI owned Cray, but they still exist in name anyway.