Ok, I understand that a given processor can only perform a certain number of operations per second. Why can’t someone simply connect hundreds of processors together in a supercomputer and get all the processing power he wants? Or if this already happens, what (apart from sheer size) keeps one from making a computer 1000 times faster by linking together 1000 computers? Is the speed of light relevant here?
Almost all modern supercomputers work this way. The problem is writing software that can be parallelized efficiently.
While it is true that most modern cluster computers (which comprise virtually all high end supercomputers) do this, parallelization (i.e. breaking the problem into smaller individually solvable chunks) is actually only one of the problems; even if you write software that can break up a given problem into any fractional degree of segments efficiently (which is actually pretty easy for large computational problems like like computational fluids analysis grids or climatology models) you still have more issues, including but not limited to:
[li]linearizing the model such that it can be broken into a discrete system of equations,[/li][li]creating some kind of message passing interface that allows nodes to renormalize conditions at the internal problem boundaries,[/li][li]exceeding communications bandwidth between nodes and within the cluster LAN,[/li][li]having enough local memory or rapid access virtual memory to integrate chunks of the solution back together efficiently,[/li][li]having the capacity to visualize or integrate the final solution in a way that is meaningful, [/li][li]et cetera, et cetera.[/li][/ul]The net result is that while clustering can allow you to solve problems much larger than you could ever build a monolithic scalar computing system capable of solving, the efficiency does not increase in proportion to the number of nodes, and in fact, once you get past a certain size of system, the efficiency drops off rapidly, which makes it prohibitive to make too large of a machine. (How much and what drop-off depends to some extent on the type of problem, the computational intensity of the nodal solutions versus that required to reintegrate into a holistic solution, what the specific bottlenecks in the system are, et cetera.)
The way of getting around this problem–to a certain extent–uses a concept that computer scientists call recursion; in essence, you make a cluster of clusters, each of which has an optimal number of nodes. Some of the largest machines are even clusters of clusters of clusters, so that the actual compute nodes are great-grandchildren of the head node. (In such systems, it is often called the “god node”, and it itself may be a small cluster of machines, or at least a substantial self-contained multiprocessor/multicore machine.) The availability in the last few years of multicore machines with large memory capability and bandwidth has to some extent reduced some of the LAN bottlenecks by doing much of the solution internal to computation nodes and even (depending on the software) a lot of the precomputation necessary for the head node to integrate the final solution. As far as the visualization issue, this is also an area where a dedicated microcluster with a lot of graphics computation power and dedicated GPU memory is used to provide near-real-time rendering.
The original thrust behind cluster computing was actually to make use of inexpensive commodity hardware, and even though high end companies like SGI, IBM, and Cray Computing went from large multiprocessor servers, mainframe computers, and high throughput scalar machines to embracing clustering, the vast majority of modern clusters are x86, x86-64, or IA64 (though a few, mostly graphics rendering clusters for cinematic animation, are based on PowerPC archs) and are running some flavor of Linux or FreeBSD, and thus cost a fraction of the cost of older supercomputers even though they are many orders of magnitude greater in computing throughput.
Still, the essential problem with digital computers still remains thus: all they can really do is add ones and zeros, and actually much, much slower than the absolute laws of physics would permit due to thermodynamic limitations. (Flipping ones and zeros actually generates a lot of heat, even if you aren’t wasting most of it keeping vacuum tubes hot.) The next true revolution in computing will probably be not a revolutionary advance in digital computing architecture and parallelization or virtualization, but the ability to process a problem for a wide array of solutions and then picking the most likely by using principles of quantum mechanics, allowing computational efficiency that scales down to the highest possible thermodynamic efficiency. However, this technology (despite what you may read in Popular Science or Scientific American) is still in its infancy, and it will likely be decades (at least a couple) before you have a quantum supercomputer built into your wristwatch.
Stranger, you get out much?
Great answer. Was looking forward to reading it when I saw that you posted in the thread.
Well, I’m going to play Dark Heresy on Friday night, and I reorganize my DVD collection on alternate Sundays, so it’s not like I don’t have a social life.
I actually manage a couple of small HPC beowulf clusters as well as a handful of Unix and Linux machines, so the topic isn’t far afield from what I do normally, albeit my diminutive systems are nothing compared to the Earth Simulator, Blue Gene, or Roadrunner. The above is a very, very simplified precis on cluster and grid computing–I didn’t really go into the history of clustering, like Gene Amdahl’s seminal work and the original VAX cluster systems, because the post would be ten times as long and still not be thorough–but at least it gives you the basic idea of how clusters work and why you can’t just scale them up infinitely.
I had to chuckle a little when I saw that beowulff answered this question and in fact talked about distributed computing.
This is old, old stuff. The Illiac IV, built in the late '60s early '70s, had 64 processor. It was already gone when I got to Illinois in 1973.
There are lots of different architectures for multi-processing. MIMD, multiple instructions, multiple data, is what we mostly have today with cheap processors and cheap memory. But there is also SIMD (single instruction, multiple data) in which every processor more or less executes the same instruction stream, being able to mask data at certain times. In the late '70s dataflow engines were pretty popular - TI even built one. In dataflow data flows to different processing nodes, which have a request/acknowledge system to keep things from getting bunched up. My department had a dataflow simulator on which I put some data dependency graphs for my microprogramming work.
There are a couple of issues with multiprocessors. First, the ones I’ve played with work best when you could parcel out small chunks of work on an as needed basis. I wasn’t doing big array calculations, but rather fault simulation, where you insert an error into the simulated model of the circuit, run a test, and see if you caught it. You can split chunks of faults among processors, which we did with a fairly primitive network built using the AT&T version of Ethernet. If one node takes a lot of time, you don’t get much. You have to handle a node going down, so that the job can be restarted on another machine. Plus, most jobs have a steep efficiency decrease as you add processors, so you might get double performance at 2 but only 4 x performance at 10.
And now many new processors have multiple cores. The Sun Niagara 2 has 8 (each of which can run several threads) and I’ve seen talks about one with 100, though it isn’t out yet. The move to this has been done for reasons other than pure performance, though.
BTW, I took a seminar class on this in 1978, where we read a ton of papers, and concluded that the big issue was the reliability of the interconnect between the processors. Moving them mostly on chip solves that problem pretty well.
And the interconnection (between compute nodes) is still a major issue; InfiniBand was suppposed to be the cat’s kimono, only every InfiniBand implementation I’ve seen today has significant performance underruns compared to marketing claims. And 10Gb Ethernet is a bald-faced lie; I’m waiting for 40Gb Ethernet (in order to get 10 GbE performance claims) before building up the next cluster, and 100GbE is a pipe dream as far as I’m concerned. So moving to multiple cores-on-board, or better yet, true multi-cores is a great alternative…as long as you can provide enough memory bandwidth and SDRAM-on-board to accomodate processing requirements.
You’re right about x86 and it’s continuing to grow in the top500, but you’ve got IA64 and Power backwards (and both are being used for far more than just graphics rendering type clusters, 5 of the top 10 machines in the world are based on Power and probably spend most of their time on nuclear weapons research).
Power 68 machines
IA64 16 machines
Bandwidth wasn’t really the issue, since if your application has a lot of interactions it isn’t a good one for distribution. Reliability was a much bigger problem, since interconnect is usually less reliable than the things interconnected. (Especially back then.) I work at a lower level, and Serdes stuff is a real pain in the ass, since a lot of test equipment doesn’t understand protocols. There is some effort in this, one of the papers submitted to a workshop I’m running on futures discusses it.
The primary applications for which my clusters are used are for computational fluids dynamics and structural finite element analysis, so there is a some modest amount of crosstalk between nodes (or rather, from the compute node to head node and back to compute node) to normalize boundaries between problem subsets, but then when the solution nears completion the compute nodes are often competing with one another to dump their results to the head node at the same time across the cluster LAN, which can cause bandwidth chokeholds if you have a large (>64 comp. node system) and are doing a lot of iterative solutions. It can also be a problem with grid computing though most wide grid management systems can be configured so that the individual compute nodes only upload their results in response to a query by the head system.
RaftPeople, I stand corrected on the use of PowerPC clusters, but while the really high end clusters may use Power or IA64 processors, the majority of cluster systems I’ve seen are still x86 or x86-64, and most commercial vendors for the applications I use have only recently (in the last couple of years) provided production-level IA64 support. The really high end machines like Blue Gene or the Earth Simulator are running proprietary, home-built codes to solve cutting edge problems and thus don’t have to worry about running on ‘standard’ architectures (and if they have a bug, the team running the solver is often the one that also delves into the code to fix it) but commercial code tends to be a lot less ‘bleeding edge’ and more conservative about supporting newer architectures.
The OP is operating on the assumption that every problem can be solved much more quickly with parallel computation. Many can be, but there are others where we don’t know how to get that speedup yet, and if you believe the complexity theorists, it’s not possible anyway. Extra processors are therefore of limited value for those problems.
I was thinking of something with really a lot of interaction. For instance circuit simulation, at either the transistor or gate levels. This is getting very painful, in fact we have several thousand processors in several compute ranches that most do simulation for verification all the time. This would be a great application to distribute since memory usage is strongly dependent on the size of the design. However, there are so many global signals that there would be a ton of traffic between processors. I know of some research a while ago attempting to partition designs to minimize interactions between partitions, but the lack of any such thing on the commercial market t makes me doubt it worked. The distribution done today is on small verification tests, which distribute well.
So you guys have it easy.
Can I ask what problems you’re running into the 10 Gb Ethernet that 40 Gb Ethernet would solve?
Boolean circuit simulation is P-complete and therefore one of the problems that is suspected to not be particularly parallelizable. I’d expect general circuit simulation to be no simpler.
Well, until we start getting into non-linear coupled loads analysis of large structures with complex geometry; then busting a problem up into chunks is not always so successful (in terms of performance). Personally, I try to stay way from that silliness, and the entire ugly black magic business of computational fluids. A little bit of modal or some forced vibration of a static structure is fine, and even a bit of localized submodeling of crack propagation can be a fun day at the beach, but once you start getting all crazy with large angle rigid body motion, dynamic couplings across the structure, rate-dependent hyperelastic material properties, isentropic material deformation, and composite interlaminar separation, your results might as well be gumbo for all the good they do for you.
Most of the realistic performance benchmarks I’ve seen on a 10Gb system show it actually running at about 3-4Gb under heavy (but within spec) load. So my hope is that I can get true 10Gb performance out of a 40Gb Ethernet LAN. It’s kind of like the “divide by three” rule for guys bragging about sex.
Agreed, high-end can afford to use something other than commodity processors but x86 will continue to engulf from the low-end (which isn’t so low anymore).
Side note: For my home project needs, I was recently weighing multiple ps3 vs multiple x86 vs all other options that aren’t too expensive. Conclusion: GPU’s. I’m just starting the process of porting my critical routines to NVIDIA CUDA. I realize people have been talking about this (and some doing it) for a little while but it seems like there is a real opportunity to create clusters with an order of magnitude better price/performance than x86.
I suspect you’ll be disappointed on this. The bottleneck is almost certainly the software drivers and the OS itself. Giving it faster hardware can’t help you with that.
Nope, definitely not the software–compared to function on a multi-processor single board system–or the OS, which is a highly optimized clustering Linux which runs pretty light on the compute nodes. On large clusters, the bottleneck is typically either packet interference, or the head node simple not able to cope with the throughput required to manage the problem (hence recursive clustering). I remember when InfiniBand came out and was supposed to deliver 3x to 4x improvement over GigE, and I have yet to see a system that did so under heavy load. From what I’ve seen of 10GigE, you can get maximum performance only by tuning the system for a specific type of solution, or reducing the problem size to a point where communications aren’t the bottleneck. It’s generally not a problem on the GigE-LAN clusters I manage because they’re not that big, but I’ve seem them nearly maxing out the internal network even at 24 and 32 nodes, and the next cluster is going to utilize multiple core nodes on a quad motherboard with a much memory as I can fit, and try to keep the node count reduced. However, this is all predicated on the software development being able to handle large multi-core systems; right now they’re still struggling to make dual cores perform as efficiently as two independent processors.