Are today's PCs more powerful than a 1970s supercomputer?

Not that I can find :wink: The slowest 10 core CPU I can find is an Intel E5 Xeon 10 Core 1.7 Ghz :
http://processors.specout.com/l/1224/Intel-E5-2450LV2

We’d need a 15 Ghz or so single core to come close to that… (yes I know it’s not that simple)

The Intel HEDT i7 chips are close to being core multiplier unlocked Xeons, but currently top out at the 10 core i7-6950X. That chip has 10 cores and will clock at 4GHz without too much fuss. So nice clocks for single thread, multiple cores for multi thread. One of my computers uses the previous top HEDT i7-5960X @ 4.6GHz, 8 cores. That’s over 50% overclock from the base clock that an equivalent Xeon would run. I really appreciate what it can do when encoding video, 8 cores with hyperthreading is pretty nice for a home PC, but it’s no slouch on “normal” PC stuff. Still, it’s no supercomputer and I admit it’s more than what I could get by with.

Yes, but the point is that we are forced into multi-core, it’s not by choice (generally). Meaning that carving out resources into smaller chunks restricts which problems can be solved efficiently whereas a single equally powerful cpu can solve both types of problems well.

Other than memory speed (just because it’s so obvious), what would be on your wish list?

I think that once a core supports more than one thread then context switching for that pair of threads equals the same amount of context switching on the single core for the same pair of threads, once per time slice… If there are 2 threads per core (20 total) then every core is switching context every time slice, just like the single core.

In both cases, there is a context switch at every time slice on every core, whether single or multi. It’s a wash.

But there is still the cache and bus issue for multi-core. Trying to share resources takes time, energy and complexity. Either you carve out chunks of resources to avoid conflicts, which reduces performance when the workload is uneven, or you share resources which reduces performance trying to avoid conflicts.

Agreed that the specific workload determines the best solution, the point was just that ignoring thermal issues that prevented higher clocks, 1x10ghz can efficiently support more workloads than 10x1ghz.

Well, for starters, a few thousand qbits, and the ability to operate on them.

And, of course, SGI lost the Hollywood market when ordinary desktops reached the point (partly driven by video games) where they could do extreme FX. I miss the real SGI; they had some neat stuff, and Irix was my introduction to Unix.

(At the announcement of the iPhone 7, one game company mentioned that their new dark-Oz game has, at one point, 400 flying monkeys on the screen. Someone joked that that could be the new performance metric—number of flying monkeys.)

Arguably, the latter does exist. While they never quite shipped with these clocks, the Pentium 4 could overclock to 5 GHz with some effort. The P4 has a double-pumped ALU, so internally a 5 GHz P4 is (partly) running at 10 GHz.

Of course the P4 was notorious for its low IPC, which was the tradeoff they made to reach those high clocks. Intel realized that this wasn’t sustainable and changed their strategy with the Core series.

We can do better than that! The AMD Bulldozer FX8150 was renowned for being able to be overclocked to extreme figures. In 2012 someone managed to get one to just over 9 Ghz, running two cores. No details on the cooling setup and I’m sure it only ran for a few minutes, but hey.

Context switching is a fraught problem. One switch can be vastly more or less expensive than another. And you can put some hardware in place to help, up to a point. Hyperthreading is intended to make switching fast between those threads that have context. It does this by replicating registers, and keeping more than one thread in flight. But Intel’s version of hyperthreading only supports two threads. The originator of fast thread switching was the MTA, and Burton Smith. That machine could hold a huge number of threads (I think the last version was 1024) it could and did switch contexts on every clock cycle. The ability of the x86 hypertheading to swap between two different threads in a clock cycle means you can get some useful concurrency without taking the full hit for a context switch. But once you have more than two threads it doesn’t work out.

A big hit with context switching is often loss of cache locality. The next thread finds itself with a cache full of the other thread’s junk, and pays a very significant cost to refill it. Thread affinity additions to the operating system’s dispatcher can help greatly, at least ensuring that a thread gets run on the same core as it did last time, to take advantage of any context that is left in the caches. But this depends upon the nature of the task. Many big numeric compute tasks are best served by ensuring there is only one task thread per core and making sure the OS keeps it on that core and doesn’t dispatch it to a different one. Keeping a core or two out of the compute task and trying to get the OS to run housekeeping on those is worthwhile as well.

The MTA didn’t have caches at all. The round robin nature of the thread dispatch meant that you could fetch an operand from memory and have it ready before the next time your thread came to the top of the queue. Thus the architecture was latency tolerant. But only if your problem could be subdivided into thousands of threads. So it suffered from the opposite problem of current processor designs. (I really loved that architecture. I heard Burton talk about it over 20 years ago, and there still hasn’t been anything as innovative in processor design since.)

That’s still less than 10 GHz, though :).

If we’re going for “works for long enough to run a benchmark” records, some Celerons have gotten to 8.5 GHz. That’s a 17 GHz ALU frequency.

This is exactly how GPUs work. GPUs have caches, but it’s largely for bandwidth, not latency hiding. Memory latency is hidden by–as you say–having thousands of threads at the ready. Issue a fetch and immediately switch to a new thread. By the time you get around to the first thread, the memory fetch has finished (actually, GPUs have it much worse: the texture units, for instance, have both memory latency and quite a bit of math latency in blending the texels together).

Cache locality is a problem. Though graphics has the advantage in that spatially located threads (i.e., pixels that are near each other) will likely have coherent memory access. So the trick is to ensure your threads get scheduled with spatial coherence, which requires tricks like tiling.

Context switch overhead is also a problem. GPUs bundle threads together so they can share a PC and some other stuff.

It lived on for a while. Burton’s company Tera computing were the one’s that bought the Cray name from SGI. The Cray XMT was going to use a variation of the MTA processors but I’m not sure if it ever shipped.

As recently as 2011 Cray was still talking about using their own “threadstorm” processors which were an evolution of MTA. AFAIK they are now using Xeon Phi processors in their current XC series super computers. Is it possible that Intel bought the tech off Cray and part of it made it into the Xeon Phi? It seems to be at least somewhat in the same ballpark from what I can tell.

Missed the edit window:
yep Threadstorm was an evolution of MTA and it did ship in a few systems but is now discontinued:
National Lab Pushes Graph Platforms to New Points (read the comments)

Yeah, I was going to mention that part of Cray’s history. It was a real David and Goliath thing, and I suspect a big part of this was a way re-badging a refinancing. There was no way that a tiny company that had managed to ship one or two development machines (I think at the time SDSC was about the only place that had one) had the financial clout to actually buy Cray. The MTA sat on the books for about a year odd and then vanished. In the meantime Cray were driving down a whole slew of technical directions. A vector MIPS being my favourite. Burton went to Microsoft as a research fellow or somesuch I think. That is a bit like the Princeton Institute. Really good people go there for a sinecure and never do anything of value again.

According to Wiki they bought the remains of Cray from SGI for $35 million (plus a million shares) after Sun had already taken the E10000 line. Maybe Tera just had some investors with very deep pockets?

Reading the wiki I also just learnt that there is a tiny remaining part of SGI left, called Graphics Properties Holdings Inc, and yep they are making money by suing AMD, Sony, Apple, Samsung etc. Ok they’re not strictly patent trolls since they did make products that used these patents. Still that’s an ignoble end for such an innovative company.

I has a sad…