Are today's PCs more powerful than a 1970s supercomputer?

On the topic of how supercomputers tend to be different from consumer-level computers:

I’ve noticed that GPGPUs designed for production or heavy computing work have a lot of VRAM. Nvidia’s Titan, for example, usually have 50 to 100% more VRAM than the next chip.

To make a concrete comparison, the 980Ti has 6GB of VRAM and that’s about as much as it can handle for gaming. The Titan X (Maxwell) has twice as much. What’s the point of having twice as much VRAM? Does it make other forms of computing faster/more power efficient? If so, precisely* how?

  • Precisely= in words that someone with an interest in CS but no formal training in it can understand.

For gaming more VRAM means you have more higher resolution textures loaded on the card at the same time and higher polygon count models. Loading from main RAM to VRAM is relatively slow, you want to avoid swapping textures as much as you can. For computation on the GPU it’s the same thing, more VRAM means you can have more complex computations loaded on the graphics card and running without having to swap between RAM and VRAM.

I do 3D oceanographic simulations for a living, and our team has been courted by Amazon and Microsoft Azure. The default options (if you just go to buy via their web interfaces) can’t match our current in-house Infiniband-connected cluster by a long way, but if you’re buying enough that it’s worth it to them, they’re willing to work on special set-ups (e.g. guaranteed to all be on the same rack). Still running tests but it’s in the ballpark of comparable.

We’ve checked out Cray, it’s (mostly) a sold-on name (as you say, Ship of Theseus) that puts together and service off-the-shelf racks (GPU or CPU) and are no more special than a dozen companies we get quotes from.

Mostly what coremelt said. While RAM to CPU Cache is actually a significant slowdown in parallel applications, it’s a highway compared to RAM to GPU memory’s dirt road. This is still way faster than sending over a network, but sending globs of data through the transfer buffer is un-ideally slow. If you actually try to stream straight from RAM to GPU (or, god forbid, disk to GPU), one example at a time, you’re likely not going to get much better and CPU speeds. In fact, it may even be significantly slower.

The second reason is that things like Deep Networks just require a lot of memory, straight up. Loading both the network’s weights and a reasonably sized RGB image into memory is a pretty sizable chunk, and that’s not including the space for the output buffer which is proportional to the number of classes.

You still usually get mini-batches of maybe a couple hundred sent to the GPU, but the bigger you can make your GPU batches the faster it can run because, well, dirt road. It wouldn’t be nearly as big of a deal with most datasets. Most common testing datasets could comfortably fit in GPU memory on their own, and even a lot of real world ones. But big images (especially for things like video) + deep network weights really eats up memory quickly.

I’ve had to stream my batches from disk before because my PC actually hit swap loading entire datasets at once, causing a full PC lock and necessitating an actual power button hard reboot. If there’s enough data to break 16GB of RAM, you have more than enough to make transfer speed a problem.

Read up on Moore’s law and clock rate.

We hit wall around 3Ghz around 2005. You WILL NEVER see 5Ghz or 6 Ghz with out lots of fans and liquid cooling or put in cold room.

And Moore’s law is starting to break down now:eek::eek::eek::eek::eek::(:(:frowning: it to take us to the year 2020!!! Will we will be at 10 nm. Well Intel claims 10 nm will be hard to come by with out some future technology and any thing below 10 nm is very unlikely.

Some are looking at 3D or stacking and building up that may keep Moore’s law going to we have quantum computer to take over where Moore’s law ended.

Other thing in past 10 years CPU have increased lot but no where like from 2000 to 2005.

A 2016 computer is not 4 or 5 times more powerful than 2006 computer.

You sure about that? In 2006 we had Intel Core 2 and Pentium Dual Core. Compare them to a mid range i5. This article seems to indicate it is about 4x faster:

Of course if you take GPU performance and RAM into account then it’s many more times than 4x more powerful from 2006 to 2016.

And I’m a bit dissapointed that no one has yet pointed out that moore’s law says nothing about clock speed. Moore’s law is that the density of transistors is doubling approximately every two years. Apparently it’s now dropped to doubling every 2.5 years, still pretty impressive. There are a number of technologies which may enable moore’s law to continue for at least another 20 years, 3d stacking of IC’s is one and theres more listed here:

That was the Pentium 4, which was the last generation that tried to compete based on raw clock speed. That turned out to be a spectacular flop, so much that they completely abandoned the concept for future generations and adapted the design of a notebook (mobile) CPU as the basis for future generations.

Intel still makes obscenely fast parts - they just aren’t listed on the Intel web site. For example, the Xeon X5698 has a factory clock speed of 4.4GHz and is air-cooled. It was only sold to a few select OEM partners like Dell (PDF spec sheet). Link to picture of an actual X5698.

(Relatively) recently, Intel has been producing a greater number of specialty CPU models (though most of these are selective binning of existing parts, not completely new designs) for specific customers. I’ve looked through some Intel lists (like this PDF) and found comments like “AWS” (Amazon) on CPU part numbers that don’t show up elsewhere on Intel’s site.

Part of the issue is there is really just not that much demand for 5 or 6 Ghz processors. Very few people need ultra-fast single thread performance. Servers benefit more from more cores than higher clock speed. On the consumer side games and media encoding / decoding can both benefit more from extra cores or GPU rather than a faster clock speed.

In fact I’d say 99.9 percent of users don’t do any tasks that would be better served by higher clock rate instead of more cores.

The story is really quite complicated.

Back when Cray ruled the roost there were a couple of companies that made competitive machines. There were other vector machines, and Cray compatibles. Among these Cray bought FPS Floating point Systems, and didn’t really know quite what to do with it. It became the Cray Business Systems Division. It made a multiprocesser Sparc based machine the CS6400 which sold passably well, and then embarked upon the successor machine, the Starfire. Again, not a supercomputer in terms of raw floating point, but a highly parallel shared memory machine directed at commercial customers. Very reliable, partitionable, hot swappable bits.

In the meantime Cray were doing badly in the supercomputer world. Big vector was no longer the flavour of the month, and they were being beaten on sales. The big dog in the competition was SGI, who were riding high on their graphics and parallel shared memory MIPS based machines. SGI bought Cray.

But SGI discovered that not only had they bought Cray’s supercomputer lineup, but they had this commercial Sparc based machine in Oregon. So some idiot decided that perhaps the best trick with this oddity was to see if they could unload it to the most obvious buyer. The most obvious buyer was Sun. This decision was probably the most pivotal in the litany of mistakes that finally killed SGI. The story is that Sun paid about $100m for the division. What SGI should have done was shut down the division, put every machine made in a crusher, sent to landfill, then destroyed the plans. This machine was re-badged as the Sun E10000 and it went on to sell something like $3 billion dollars worth. It made Sun, as in the new Internet era it was perfect as the transaction backend for so many new Internet companies. SGI on the other hand had no counter, and despite bringing to market the first of their ccNuma machines (Origin 2000) they were relegated to a minor player selling only into scientific customers, whilst Sun rode off into riches. The E-10000 was not perfect, and a problem with the caches nearly sank it. But in the end it was a wildly successful machine. SGI then made a number of other stupid mistakes (getting into bed with Microsoft being the most egregious) and let Nvidia eat their lunch. The death stroke probably was using the Itanium processor as a replacement for MIPS, which was running out of steam. It never lived up to its promise, and far from outrunning x86-64 processors it got overtaken. SGI switched to x86, but by then it was too late. Eventually they got bought out by Rackable, who really did little more than use the badge to re-badge their own gear, subsume what was left of SGI’s once phenomenal worldwide support organisation, and keep the ccNuma technology on life support. The engineering quality that was once part of what you expected from SGI vanished.

Nice summary! I used to work with SGI systems back when they were called Silicon Graphics. Watching the slow motion self destruction of SGI was rather sad. I think SGI’s basic problem was they got addicted to the huge profit margins of their super high end systems and would cripple their cheaper architectures to protect the big iron. Of course that doesn’t work when your competition just creams you in price performance, which is what happened with the Intel workstation class machines when they got good enough to come near SGI’s workstations in performance.

Really, they could have got extremely rich off the PC graphics card market at first for CAD / 3D and then consumer. Their Infinite Reality graphics was massively ahead of anything available at the time (1996) but they weren’t interested in making a cheaper version of it because they wanted to keep it’s price extremely high for their rack size Onyx computers. Predictably a bunch of employees left and founded 3dfx, the makers of the first consumer GPU, the voodoo extreme. 3dfx was eventually bought by Nvidia and SGI went bankrupt, several times.

I would bet that actually Rackable / SGI still sells very big ccNUMA machines, to the Department of Defense. Nuclear bomb simulation is one task that can’t be effectively carried out on a supercomputer that doesn’t have shared memory.

I assume your comments about speed vs number of cores is within the context of current constraints (meaning even if we increase speed, it’s only within a small window).

The ideal state is always to have one processor (core) that can service both high single thread computing needs as well as lower multi-thread needs. Multi-core is more limited, it requires more effort to transform the problem and for parts of the problem it can’t be split.

Seems like a common pattern. A smart business would see the writing on the wall and create a 2nd business:
Business #1 - Expensive, high margin - milk it knowing that a transition is in process
Business #2 - Cheaper, lower margin - ramp it up

Not sure I agree with that. All modern operating systems are multi threaded and it’s common for people to run multiple applications at the same time, both things which are better served by multiple cores rather than one super fast core.

Simple example of trade-offs (ignoring many variables) and assuming >=10 threads:
CPU A - One core, 10ghz
CPU B - Ten cores, each 1 ghz

CPU A cons:
Possible issues with real time systems compared to having separate cores responding to simultaneous events (and having sep hardware allowing parallel IO).

CPU B cons:
More transistors dedicated to duplicated execution logic/units/structures that can’t be used for cache.
Diminishing returns due to overhead of managing/sharing resources (avoiding conflicts).

This is theoretical and when you include all variables it becomes a pretty complex mapping from specific problem space to specific solution space. But in general, one fast core (if possible) is better than dividing up that execution into smaller chunks.

One of the critical limitations with high clock speed is the power draw. The power needed to drive a processor at those speeds is ridiculous. Power versus clock is very non-linear, and ten 1GHz cores draws a lot less than one 10GHz core.

There is only so much logic that can be thrown at a core. The trick to getting a core fast is to feed it quickly, and that means caches. Not just I and D (instruction and data) caches, but things like jump prediction table, translation lookaside buffer, writeback buffer, all of which need to run as fast as the core. So they are under power restrictions as well, plus in order to run really fast, you need to be able to get the signals through them 10 times faster, so they become size limited. This size limit paces the entire thing as well. You can’t make a single core machine run ten times as fast and have ten times as much cache. The cache may actually be significantly slower than the smaller cache you had one the 1GHz core. It is a balancing act, and one for which there is no easy answer.

That ten times speed also only happens on the processor die. Once you get off the die you have real problems. Access to memory is never fast. You can get high bandwidths by adding wires, but the latency never gets better, and it is the latency that kills you on single threaded tasks. Latency tolerance comes with multiple threads, but this requires that your task be partitionable usefully. Hyperthreading is one way to get some latency tolerance with multi-threading on a single core. But it isn’t cost free, and really only gets you a mediocre speed bump on some tasks. For highly compute intensive tasks one usually disables it, as a problems with controlling cache locality start to rob you of performance.

Sometimes you really can’t get that multi-threaded performance. A single core with lots and lots of cache can. There are problems for which only using one of the cores on a very high end chip, and giving it the use of all of the cache on the chip is the answer. Even here the dominant pacing item is the speed at which you can operate the cache. The time it takes the signals to propagate through the control logic limits things, and that time scales up with the size of the cache. (Which is why level one caches remain tiny in comparison with lower levels.)

I’m talking about the typical computers use case where many processes are running at the same time. You’re working in photoshop, playing music, have facebook open autoupdating itself, downloading in the background. In this case 10 x 1 Ghz cores might be better, because context switching is not free and the 1 x 10 Ghz will be constantly context switching while the 1 x 10 core can actually run 10 threads without context switching. Or more realistically the Quad core 2.5 Ghz CPU will be better than your 1 x 10 Ghz CPU for the typical usage pattern, even if we could make a 10 Ghz processor with reasonable power requirements.

In the special case of 10 threads then you’re right, context switching could be eliminated in the multi-core example. If we are talking about real life then it still needs to context switch due to the number of processes and threads running (typically many more than 10).

But even though more cores can reduce context switching (when threads<=cores), it also increases concurrency issues on shared resources which limits the gain of having multi-core.

There is also an increase in memory bandwidth and cache requirements for multi cores, but most likely that same requirement exists for one fast core because in both cases they are accomplishing the same amount of work in the same amount of time (theoretically).

It will still do less context switching even when threads > cores, 10 times less to be precise, so each core is getting interrupted less often. I know it’s complicated and messy in practice, I’m just trying to state that a 10 Ghz Single core is not always preferable to Quadcore 2.5 Ghz or 10 core 1 Ghz. It depends entirely on the usage case, even if you don’t take power usage into account. In practise of course the reduced power usage makes multi core systems the clear winner, especially for server farms where power and heat are big limitations on the total amount of work you can do in a certain square volume of space.

Don’t forget about the biggest advantage that 10x1GHz has over 1x10GHz: The former actually exists. I mean, if we’re going to be putting together wishlists for computers that don’t exist, I can wish for a heck of a lot better things than just more gigahertz.