Nowadays there are very few specially designed supercomputer processors, so we have to go back a while, to a more golden age to see the difference. The lack of purpose designed supercomputer chips is mostly to do with economies of scale. With a market for millions of processor chips, the mainstream manufacturers and designers of chips can push the design cycle much faster and design much much more sophisticated systems compared to a niche market like traditional supercomputers.
If we talk supercomputers of yore, we are usually talking big numeric problems. These problems are usually very regular, and often expressible in terms of liner algebra, which means matrix arithmetic. Indeed about half the problems out there were numeric solutions to partial differential equations, which is little more than populate a huge matrix invert, calculate answer, rinse and repeat. The other half were finite element style systems, and again lots of reasonably regular calculations over large data sets. The critical thing about most of these systems is that they need the accuracy of 64 bit floating point. Systems would often simply be too unstable to get useful answers with the error that would creep in with only 32 bit float.
This leads to computer architectures that are very good at very regular application of 64 bit floating point operations to large data sets. Fast floating point means deep pipelines, and large data sets means that you need lots of fast memory. Modern machines depend upon very fast caches that can provide operands inside a very few clock cycles. There is a constant trade-off in cache sizes, bigger caches are slower. Which is why you see such tiny level one caches.
The key to the performance of these older vector machines was that they were designed to operate upon slabs of data at a go - vectors of data. Not only would you have registers of data, like conventional machine, but vector registers, that actually contained say 64 operands. You could emit a single vector instruction and have it apply the operation to all 64 elements. Read and write operations work the same to memory. Because the processor knew at the start of the instruction all the work it needed to do on a very large amount of data it could efficiently muster large resources. In principle multiple ALUs, with deep pipelines could all be usefully employed. Because the operands didn’t conflict there was no need for worrying about hazards in the piplelines, or lots of the horrendous tricks used in a modern superscalar design. Probably the last vector was the Cray X1, which was an amazing design, but sadly overtaken by systems based upon commodity x86 processors. With a clock of a pedestrian 800Mhz it still offered 12.8 Gigaflops per processor. 6 years later there still isn’t anything that can touch that.
For that matter, that Cray-1 supercomputer I mentioned above couldn’t do anything without another computer handing its input and output. Seriously - it had disk drives, but didn’t have any I/O capability beyond that.
That situation wasn’t unusual in those days. My first job was as a computer operator on a CDC 1604 (I may have that model number wrong) in the mid-sixties. That computer had tape drives, but no other peripherals. My main task was using another computer to perform card-to-tape and tape-to-print operations as a combined front-end and back-end for the 1604.
I’m curious about which smart phones WON’T let you save and edit local text files. What phones did you look at? Because, as others have said, this is completely incorrect. Heck, Apple makes a ton of money selling iPod Touches, which are basically iPhones with the phone part removed.
How do these devices decode MP3s? I remember when I did a short research paper on MP3s in high school, and I learned that the first computer that could decode an MP3 in real time was the Pentium 120 MHz (or maybe it was 133Mhz).
I always equated my third generation iPod with that Pentium. I loaded Linux on it, and it could just barely play MP3s at 160bps. While Apple apparently had a way to play MP3s and still allow you to do other things on the device, I assumed that was a different codec than the general purpose one that would be used in Linux.
Anyways, my point is that, if you actually look at the things the iPhone is good for, I think you’ll find it would be the fastest computer a lot later than you think.
Actually, dedicated vector-processing number-crunching computers are still around, and with far greater economies of scale than ever before. Your CPU isn’t designed for the same kinds of tasks that the old Crays were, but your video card is. And a lot of high-end scientific computing nowadays is done on video cards.
Looking at Geekbench scores, my iPhone 4 scores about 50% better than a PowerPC G4 500MHz and slightly higher than a G4 700. Not bad for in your pocket, I’d say.
This was typical of Seymour Cray’s designs.
And that design is still used today, for high performance. For example, early PC’s had the CPU doing the output for the integrated video; now there are specialized video cards, with their own graphics processor onboard to do that work for the machine.
It was commonly said that you never sold a Seymour Cray machine.
When you were already maxing out your 1 (or 2) CDC 7600’s; so you installed a Cray-1, but you kept the 7600’s to do the I/O processing for the Cray.
Recent NVIDIA GPUs can achieve about 500 GFLOPs of double-precision performance per processor at <1200 MHz ALU clocks. Yeah, it’s a GPU–but the programming limitations on them are similar to vector processors. They even support supercomputer-type features such as ECC memory.
The Crays and 7600 were after my time, but in the 6600 the CPU itself was only a sort of “idiot-savant.” Most O.S. operations (including all I/O) were handled by the PPU, a separately-architected processor.
The PPU executed ten threads concurrently with a single processor! This mechanism (compared to the rotating barrel of a revolver gun) is an elegant low-cost approach to higher performance. It conflicts with cacheing, but cacheing is a high-cost approach to speed, so I’m surprised Cray’s barrel got little further interest, or did it?
A Nvidia T20 GPU can clock in at 515 GFlops of double precision. But it isn’t a single core, rather it contains 448 cores. That is still pretty damn impressive. But the Cray X1 was a single core. It was also based upon a MIPS ISA, and could be used as a general purpose processor (albeit no faster than any other 800MHz MIPS processor.)
On the other hand, Cray now ship machines that can be filled with GPU boards, and they killed the X1 5 years ago, so it is pretty clear how things have worked out. But it is hard to take it away from the X1, it was an amazing bit of work. It was later available as a dual core chip, but things were already winding up for it by then. However, had the X1 been a success we would probably be seeing eight+ core chips with 3+GHz clocks, and processing capability lineball with GPUs and with possibly an easier to use ISA. But no market sizeable enough to support it.
Yeah, I seem to have gotten two different issues confused. You and I discussed my search for a smartphone in January in this thread. But looking back at it, I see that part of my problem was that I also wanted a physical keyboard, and that’s missing from a lot of the modern phones.
BTW, I ended up getting a cheap used Palm Centro on eBay. It is an AT&T one, so I simply put my SIM card into it, and was using it and loving it for about a month. I never used any of its data abilities, because I didn’t want to pay for the data plan, and was more than happy to have a phone and a PDA in the same device. Then AT&T discovered that I was using a smartphone for my phone calls, and informed me that this requires me to pay for a data plan. I argued with them for about 2-3 hours across 4-5 calls, telling them that it is a totally legal AT&T phone that I didn’t jailbreak or anything, and that I’m not using the data features and I shouldn’t have to pay for the data features. With great difficulty they were able to show me where the contract says that I’m required to get a data plan simply because I’m using a smartphone. So, rather than pay their extortion fee, I put the SIM card back into my dumbphone, and now I have two separate devices in my pocket. It turns out that the inconvenience of having two devices is balanced by having double the battery power!
I did stuff like that on an IMSAI 8080, too, but with 48K RAM (64K cost too much and we had to allow for EPROM addressing space), we could only capture a few seconds of single-channel audio. Kinda made it difficult to store Elton John’s Greatest Hits album.
And real-time compression was out of the question since the CPU wasn’t fast enough.
If Francis Vaughan’s good comments are too technical, I’ll try and offer an analogy.
Let’s say computer chip manufacturers use the basic building blocks of electrical circuits. If they arrange them a certain way, it can be optimal for integer calculations. If they arrange them another way, it can be optimal for floating point. Of course, you can have a computer with both types of circuits but there’s an inherent space and power limitation inside the enclosure so the computer designers consider the type of “tasks” it’s intended for and can favor the architecture that is most appropriate.
For the analogy, consider making computation devices out of the basic building blocks of wood and sticks. One approach might be a wood abacus (see photos). This type of device would be optimal for integer type of computations. Another approach with wood is to construct slide rules (see photos). This is great for multiplying/dividing large numbers and floating point calculations. Notice you’ll have a similar limitation as the computer chips above: in a given volume of physical space, you can’t construct a piece of wood that’s optimal at both types of calculations. However, if you prioritize the “tasks” you want to solve, you’ll know ahead of time which computational device is more appropriate. Now, a big point to emphasize is that you can do floating point operations on the abacus, but it requires more “rules” and many more manipulations of the fingers shifting beads around. It’s “slower” than the slide rules dedicated to that type of calculation.
If you need a device to tally votes or count vertices on a graph, the devices optimized for integer operations is fine. If you need to calculate vectors of stars or ballistics trajectories, you’ll want something with horsepower dedicated to floating point.
As a disclaimer, I know my analogy is incomplete but one can keep adding details to it until you have the equivalent of a PhD in electrical engineering and can design computer chips from the raw materials of sand.
Vector processing is alive and well. Especially in the cell phone industry. Many digital signal processors have something very much like the vector math processors of old super computers. For pretty much the same reason the old super computers did. There is a huge amount of math that is done to get the high data rates out of the cell phone modems. Some of this math is done in specific dedicated hardware some of it is done in more general purpose DSPs. A trend is to move more of the stuff in dedicated hardware to more powerful DSPs with vector processing engines. All this vector processing stuff is done in different processors from the application processor that runs the apps you get for your iphone or android phone.
BigT the phones I am familiar with have separate DSPs that handle the heavy computations for MP3 and movie decoding and encoding.
I wouldn’t equate a GPU core with a CPU core. They aren’t equivalent in functionality. GPU processing is arranged somewhat hierarchically, with clusters of “cores” and clusters of clusters. Unlike CPU cores, GPU cores can’t operate independently, and in reality more resemble a single slice of a vector unit on a CPU.
So while it may be true that the Cray X1 had the fastest single-core FP throughput per clock, for some definition of a “core”, to me that seems like too limiting a definition given the variety of implementations out there. In terms of what’s been packed onto a single die, the X1 has been beaten.
There’s been plenty of interest. Most recent Intel CPUs support “Hyper-Threading”, which is 2 threads per physical core. IBM has similar technology for its POWER architecture. And Sun’s UltraSPARC Tx-series has a barrel-type arrangement with 4 or 8 threads per core.
Probably the main reason why there isn’t even more interest is because massively-threaded systems are useful mainly for servers; on the desktop, single-thread performance is still crucial.
The granddaddy of hyper threading was/is the Tera MTA (Muti-threaded-architecture).
For a long time it was really the only new idea in computer architecture. Tera pushed the idea for quite a time, and actually made (a few) machines. When Tera bought Cray (a minnow swallowing a whale) the MTA continued on for a while, but was killed. Burton Smith, who was the leading light, left and joined the evil empire.
The MTA was amazing. The second version had 256 threads, and it would switch contexts each clock cycle. It had no data caches, but latency to memory was - surprise surprise - about 256 cycles. So a thread always saw its operands. It has other neat archtictural features, especially tagged memory. The big big problem was getting a compiler to make use of so many threads. I heard Burton talking the MTA up in 1996, and some years later he was using exactly the same benchmark results - a massive parallel sort. There is a clear lineage from the MTA to hyperthreading and the other multi threaded designs.
No argument there. Eventually all that matters is what you can do with a given area of silicon. GPUs remind me much more of classic SIMD machines, machines like the CM-2 and CM-5.
It is worth mentioning that vector supercomputers are not really dead. But they are very niche. NEC still makes the SX9, which gets 100GFlops per core. They must do it for love. OTOH, for the right problem they are amazing.
Matching the architecture to the problem has always been the big issue with the very high end supercomputing. This is why machines like the SX9 and SGI UV still exist.