How can Elon Musk build a supercomputer in19 days, when his competitors take four years?

LOL

People are better at creating stuff from scratch than finding mistakes (really dumb ones!) in others’ work.

You just created a really hard job that nobody wants from 4 normal jobs.

All that for the low, low cost of the energy usage of a small town and your self respect.

Nevermind that AI will get only worse incrementally with the further diluting of the dataset as more AI written stuff goes online and people are finding new ways to sequester their original writing from AI scrapers.

Yes, but, yes, but… maybe this diaphanous, wobbling bubble is the one that won’t burst!

This is what was said when people began moving off farms, this is what was said when robots became common in factories, and presumably said again as automation took over service jobs (ATM’s, self-checkout, vending machines, automated answering machines, EZpass vs toll booth operators, etc. etc.). Yet unemployment stays stubbornly well below 10%. When I was young, back in the middle ages sometime, my parents used to tell me to do well in school “or you’ll end up a ditch-digger”. Today, a ditch-digger is a heavy equipment operator getting a premium wage.

Those are all examples of a single industry being automated or modernised. If the hype is to be believed, AI is going to do that everywhere, all at once, very fast. I’m not sure the pattern will fit, but I admire your optimism.

Sticking to the PR script and keeping the doors locked to prevent independent observation.

Bravo! I’ve been looking for a way to put that word in service for ages and you found the perfect metaphor.

Hype and obfuscation are Musk’s two chief weapons against reality.

Stranger

His two chief weapons are hype, obfuscation, and fanatical loyalty to his own ego… wait, his three chief weapons…

What does the term “supercomputer” mean in this context versus an older conception such as - say - a Cray XMP? It sounds like we’re really talking about a whole bunch of computers operating in a very, very tight network.

Huge arrays of tightly-coupled machines won out over one super-speed machine a long time ago. The same thing happened to personal computers. I don’t think there was ever a production machine that was clocked much faster than 5GHz. Now it’s all multiple processors running at reasonable clock speeds.

Still, if the software can be written to work efficiently in parallel, who cares how the computer is defined?

The Cray X-MP was a vector processor-based computer. This kind of computer is really fast at doing specific types of linear computations provided that the software is designed specifically for that architecture. Porting code between different vector supercomputers meant essentially having to substantially rewrite the code for each type of machine.

Since the late nineties, essentially all supercomputers are clusters (originally Beowulf clusters, which were standalone workstations often put into a rack configuration and connected by an Ethernet LAN, but now they are more purpose-designed for clustering and use a variety of dedicated interconnects to eliminate network bottlenecking) which discretize a problem by breaking it up into individual computations that are run by the individual nodes and then delivered back to a master node to reintegrate. (In cases of much larger computers they may actually have tiers of nodes or dedicated sub-clusters to perform specific types of operations.) Because they are commodity hardware they can run any software that can be compiled on a standard compiler, and even if the nodes are individually slower and less computationally efficient than a vector processor the cheapness of the hardware makes it possible to achieve higher performance in volume rather than optimization. In the last decade, most of these systems built specifically for high performance computing (HPC) such as physics and molecular chemistry simulation, visualization and rendering, and now neural network learning, use graphics processing units (GPUs) as a kind of math coprocessor that does all of the real ‘compute’, and interestingly enough these GPUs are actually optimized array processors which is similar in concept to a vector processor, so in a sense they are kind of like having a bunch of Cray-type computers ganged together.

This still doesn’t make just having one, or preposterously claiming to have built one in 19 days, very impressive. Nor is boasting about the ever more computationally heavy ‘brute force’ approach to machine cognition indicative of any kind of breakthrough. Again, your 25 watt brain with a tiny fraction of the data available in the training sets for current LLMs can still outperform a chatbot in most practical ways, and certainly in terms of correctly applying real world context. I’ll be impressed when you can build an ‘AI’ agent using the computing power on your phone and train it with a corpus that is comparable to what an 18 year old would learn in school. Which may well happen at some point, but not without substantial innovations in computing that make processors work more similarly to brains.

Stranger

Thanks. I actually understood some of that!

Glad to help.

You can think of the comparison between vector processing and scalar processing in terms of representing a physical problem. For instance, if you have a bounded fluid that you want to simulate flowing, you could (in theory) take the Cauchy momentum equation, derive the Navier-Stokes fluid dynamics equation from it, then work out all of the terms as a global tensor of position-based equations and then iteratively solve in a vectorized algorithm for whatever particular point or volume in the flow based upon initial and boundary conditions as the flow field develops. However, doing that would not only be incredibly difficult (usually impossible for all but the most trivial and symmetric of cases) but any discontinuities, non-linearities like turbulent flow, or phase changes would create numerical singularities or convergence problems that would require a bunch of ad hoc rules for each condition to address so in practice you can only take the approach with the most simplified scenarios like incompressible laminar flow of a liquid through a pipe, or else use empirical methods to compensate for the inability to write global equations.

In analysis of real world scenarios using computational fluid dynamics, the continuum of the flow field is broken up into a volumetric grid of cells which approximate the flow, and each cell undergoes iterative calculations to determine its state at ever step of the simulation. Because you are doing the calculations per cell for each step, they can be sent to a scalar processor to perform those operations and return the result to the master, which integrates them all back together, harmonizes any discontinuities, and then sends them back for the next iteration. Furthermore, it can change the way it represents flow behavior in each cell so that if a cell is near a numerical singularity, or is experiencing a transition from laminar-to-turbulent flow or vice versa, or some other condition it can change the state or apply semi-empirical methods to represent that instead of the calculation having to exactly represent the state at every point in the continuum. Not only does this make doing this kind of analysis possible for an arbitrary flow field but it also significantly reduces the complexity of the calculations while allowing for more complex multiphysyical phenomena (like simulation of electrodynamic plasmas or fluids which transition from gas dynamics to fluids depending upon state) to be simulated even though we cannot write linearizable equations that will work across the range of conditions.

So, this approach wins out because not only is it cheaper to build big scalar clusters rather than increasingly complex vector processors but also because it is really only practical to simulate problems of great complexity by discretizing them in this way. In fact, for most types of physical simulation that goes beyond rigid body dynamics or linear circuits, some kind of piecewise discrete linearization method is the default approach because it allows for defining arbitrary continua and boundary conditions in very simple ways that don’t require unique formulations of the underlying equations. Solid/thermofluid/heat transfer/electrodynamic finite element simulation, computational fluid mechanics, global climate circulation models, and so forth are all done using this kind of discretization approach and then solving the ‘simple’ linearized differential equations to evolve the system to equilibrium or through a define set of states.

In the case of neural networks (which, it should be noted, do not actually work like the neurons of a real brain but may be conceptually thought of as ‘computational neurons’ in a vast, sparsely connected network) the problem is inherently discretized by dint of the modeling approach, and so it is easy to send a set of conditions for each virtual ‘neuron’ to a compute node and have it do calculations. The difficulty in representing a brain is that real neurons have thousands of outgoing connections, and the complexity of representing that ‘interconnect’ in software is extremely computationally expensive even though the individual calculations are almost trivial. (The ‘calculations’ to represent the processing going on in a real neuron is far more complicated than what AI systems do, and while many people like to conceptually imagine neurons to just be a kind of organic TTL the reality of biophysical simulation of a neuron is far more complex.)

Stranger

Oopsie!

Stranger

Womp womp.

Also Discourse shows the article as being from 8 Nov 25 which is funny to me.

There is currently some controversy that xAI gazumped Tesla’s big purchase of Nvidia kit, with Elon requesting Nvidia jump X and xAI ahead of Tesla in the queue.

This has all sorts of interesting implications. But if Tesla were already soft pedalling the big purchase last year it does fit. Just how sincere the idea that Tesla is an AI company and not a car maker becomes ever more suspect.

The Memphis AI plant appears to be up and running. Grok4 has favorable industry standard metrics relative to DeepSeek and Claude as well impressive performance in serving its user base by providing recipes for nerve agents, declaring itself MechaHitler, offering anti-Semitic commentary, and creating videos resembling nude celebrities. The Memphis plant is running on all cylinders: natural gas plants feeding it power have increased peak nitrogen dioxide concentration levels by 79% in its neighborhood, leading to community protests. They say those turbines are temporary and they are being replaced this summer. At any rate, they don’t appear to be outfitted with pollution control equipment.

Eastern Tennessee (on the other side of the state from Memphis) has good wind potential, but its capacity has been flat for around 20 years. Their solar development has been lackluster. A small modular nuclear power plant at Clinch River (eastern Tennessee) was announced in 2022. Last month the Trump admin expedited permitting for a new Tennessee coal mine.

As to the OP, I’d be surprised if this was anything other than a bunch of DGX SuperPOD racks interconnected via InfiniBand. In other words, NVIDIA built them all over some undefined period of time and then delivered them in a 19 day window. With the DGX system, NVIDIA probably did all of the interconnectivity between SuperPODS (it’s not that hard, as I’ve worked with InifiBand connected systems before, back when Microsoft sold database rack systems (APS). Still, with rack systems such as these, there is always significant hand-holding available even if you don’t want the white glove service.

I’m reminded of the Liberty Ships built during WWII. The first ones took awhile to build but they developed the algorithm and the average build time was about 45 days, astonishingly fast to build a ship from scratch. But they eventually built one in less then 5 days.

Ahem