How can Elon Musk build a supercomputer in19 days, when his competitors take four years?

This link says that Musk built a supercomputer in 19 days, instead of taking three years to design and one more year to build.
How is this possible?
Is there something different about Musks’s supercomputer than his competitors?

Maybe he didn’t need to design it, and just copied an existing design? But even then, it supposedly takes “over a year” to build. That’s waaay more than 19 days!

What are Musk’s employees doing that nobody else knows how to do?
( Also , if you were hiring somebody for your company, would you believe a resume submitted by a project manager saying that he “successfully managed a team which completed a 365-day project in 19 days”? )

What’s the straightDope here?

(I took this link from the currently-running GD thread about Musk and Tesla’s market value)

Maybe it’s another of those never-ending tech projects? Is something with an unspecified timeframe for expansion ever really 'built"?

People like Musk can move much more quickly than bureaucratic organizations. When you alone have more money than many countries combined, plus the backing of various powers that be, and no real checks or balances, it’s pretty amazing what you can get done.

Musk is pretty much the real world embodiment of Atlas Shrugged. Say what you will about him (I certainly have mixed feelings), but he can move mountains the way few mere mortals can. He could probably become the cyber electric emperor of Mars if he wanted to. Maybe Earth too. What force would stop him?

Why on Earth would it take as long as 19 days? All a “supercomputer” is, any more, is a bunch of graphics cards connected together. If you can afford all of the graphics cards (spoiler alert: Musk can), it takes an afternoon.

I would start by questioning the assumptions of that unabashedly adulatory article.

There is far more to building a computing cluster than just “a bunch of graphics cards connected together”, especially a single cluster with 100k nodes.

Stranger

The “competitors” are groups like OpenAI, not the federal government, so presumably they’re no more bureaucratic than his group. Though he does have the advantage of all of the work that OpenAI has done to get to this point.

That’s why I asked: yes, the article is unabashedly fawning in praise, but it contains a specific numerical data point: 19 days.

I’m typing this using the PC computer on my desk, which has a graphics card in it. If I decide to plug 100,000 more cards inside the case, it’s gonna take me more than19 days. :slight_smile:

I suspect the 19 days doesn’t take the time to build the building.
If you order pre-configured racks of servers, and only start the clock when they are delivered, than 19 days sounds like plenty of time to hook them to power and plug in all the cabling.

Cutting corners, skimping on procedures. Making decisions that will cause major issues somewhere down the line. Anything can be built quickly, it takes care to build something effective that lasts.

Okay, but if it’s that easy, why did the article claim that it takes 3 years to design, and another full year to execute?

It might very well, but the design time is “invisible.”
As for “fully execute” - sure. You can have something working in 19 days, but the full system might take a bit longer.

Doing something the first time is always harder and takes longer than copying an existing design. Is Musk’s computer substantially different than any other similar computer? If not, he probably just hired the key people and gave them a blank check.

My guess is his team did not design it from scratch, but used an existing design. That saves a ton of time. It sounds impressive to turn a former manufacturing center into a data center in under 18 months, but what was the status of the facility before this project was started? And how much would it have cost if it wasn’t a rush job? The article is not very detailed (IOW, very hyped).

Well, it depends on what that “19 days” includes. If it is constructing the building, assembling the racks, installing the HVAC and UPS/power management systems, et cetera, then I would find that extraordinary. (I see upon review that it was installed in a “former manufacturing facility” but no indication of the previous power and HVAC infrastructure was installed or adequate.) If it is just assembling the individual nodes, that is possible with a sufficient number of trained people to physically assemble the system but HPC clusters are more than than just a GPU; at a minimum, it is a motherboard with a CPU, interconnect, DDRAM, SSHD, and then the operating system, drivers, message passing interface (MPI) or some other API, any specific distributed processing control system like OpenCL, and of course the actual software that is doing whatever calculations required of the GPU.

Even assuming that the nodes are homogeneous and you can just propagate cloned images through the internal network, people have to physically assemble the notes and install them in racks, flash a bootloader, and install some minimally function OS such that they can attach to the system. I’m not familiar with AI clusters but most clusters of this scale are not homogeneous and have specific blocks of nodes that perform particular tasks related to distributing computations or load-balancing, and of course the entire system needs to be validated and benchmarked for actual performance to assure that it is assembled as intended and that it will deliver the expected performance without memory, network, or computational bottlenecks. All of that testing will take weeks because it needs to be stress-tested through a variety of benchmark cases, and adjustments made which might include proprietary firmware changes or updates, virtual or perhaps even physical reconfiguration, and maybe even changes to the underlying code that is doing computations to limit overstress or better utilize the system. While cluster computers aren’t as finicky as vector processors are to achieve performance close to the idealized system, they are prone to bottleneck limitations and also just the physical problems that such a massive, power-hungry system can have, and it can literally take months to get such a large system performing as intended, of which the article makes no mention.

I’ll also note, independent of this particular cluster, how much raw power is being done in order to train an “AI” chatbot to do the same thing that a 25 watt human brain does with more ‘nodes’ (counting neurons) but ostensibly less ‘compute’ in terms of raw processing power and speed. Despite the tech press eating up AI hawkers about how amazing these systems are at training chatbots to perform progressively more sophisticated language manipulation, it is clearly a brute force approach utilizing terabytes of electronic text and image data that it would take a human thousands of lifetimes to consume just to create a chatbot or generative AI ‘bot’ which does things that would be impressive from a talented twelve year old but are often factually and conceptually wrong in their responses to prompts, have severe limitations in the ability to relate their responses to real world constructions, and are approaching the total available data for training without actually achieving any kind of breakthrough in ‘common sense’ or reliability while consuming so much energy and resources that they are measurably contributing to ecological harms. This is like building a diesel powered go-kart out of aluminum cans and paper mâché that can get up to a speed of 20 miles and hour but consumes 50 liters of fuel per lap and spews out black clouds of unburnt hydrocarbons behind it. It is kind of impressive that they can do anything, but they don’t do it very well and with staggering inefficiency, and yet, we’re investing billions into furthering the development of these system and assigning them truly stunning future valuations despite the fact that it is unclear that they will ever be suited for high value, critical reliability use cases. They are truly the tulip bulbs of the 21st century, even moreso than cryptocurrencies and NFTs.

Stranger

It certainly sounds like it really depends on how one defines “build”.

It reminds me of the Liberty Ships in WWII. At first, they took months to build. By the end of the war, they had that down to just over a month.

In one case, a publicity stunt, they went from laying down the keel to launching in under 5 days. But that was definitely a publicity stunt. There were several bits internally that needed to be finished after, and they had everything prepped and ready ahead of time.

There are several plausible ways of defining “build” that would allow for something like this. Does that mean they could build a supercomputer from scratch, on demand, anywhere in the world in 19 days, including preparing the building, power, permitting, etc? Almost certainly not.

But it got you talking about it, right? And that’s the actual purpose.

I mean, it wasn’t so long ago that OpenAI almost dissolved, having nearly lost Sam Altman, who then came back and staged a coup in full. Elon was on their board previously too, and caused quite a bit of disagreement.

Startups may have less red tape than the government or established FAANG, but they’re not completely immune to politics. And certainly they don’t have as much freedom as one-man shows.

I don’t think Elon has people challenging his authority like that in his own groups.

Yeah, absolutely. But then so did Google, Apple, etc., and they took way longer to get to parity.

Do we know that he’s at parity but Google, Apple and so on are not? I think all he’s done so far is to assemble the 100,000 GPUs in one data center. What has he accomplished that shows his supercomputer is at the same level of sophistication (or stupidity)?

And BTW, isn’t the real limitation the supply of the Nvidia processors? Surely there’s not an infinite supply of them?

Google and Apple also wanted to do things with their AIs that others hadn’t already done.

Challenging authority is how you get things done efficiently.

It wasn’t a 19 day project. It was 19 days from the moment the first racks started being delivered to the already up and running facility, to the point where some computing ran.

From Nvidia’s newsroom

The press release says 19 days until it started training, but omits to note whether that was training on test data to validate the system operation, or actually running production code on production data. I know which way I would bet. Systems of that size will always take time to sort out.

The infrastructure took 122 days. In parallel with that, Nvidia would have been building up racks of compute. We can assume this 122 days was from the day the spades hit the dirt, not the time taken to design, find contractors, sign contracts, get approvals, and order long lead time equipment. The large scale power and cooling systems are not off the shelf components and lead times can be significant.

The entire project is claimed to have taken 16 months. Even that will be on the back of significant early work. 16 months will be the time from signing the master contract.

Supercomputer is not a term that describes one defined thing. The really big systems are carefully designed around the problem space they are intended to address and that takes a lot of work before anyone is ordering any hardware. Raw compute is usually the easy bit. Balancing compute, memory, storage, and especially communication bandwidth and latency requires a deep understanding of the problem and how you get the best bang for buck.

An AI supercomputer helps in that we can assume a lot of that work has already been done. Nvidia are working hard to be a one stop shop. It isn’t just about the GPUs. Nvidia’s acquisition of Mellanox jumped them up to being a dominant competitor later in communications infrastructure. The Nvidia article trumpets their RDMA (remote direct memory access) system. Something that has been a mainstay of HPC for decades. It goes back to Myrinet and Infiniband, which all got gobbled into Mellanox years ago.

It isn’t unreasonable to imagine the system was configured from a known scalable design template. That would save a huge amount of time. But would still take significant time. Nobody is going to sign off on a many hundreds of millions to billions of dollars design without being clear that it isn’t a career limiting move. (Whilst these sums sound insane, none of this touches the oil industry. A single well can cost significantly more than 100 million. And there is an exploration manager that gets to decide where it goes. That focuses the mind.)

19 days is just a vanity number and one that conveys little to nothing about the actual project.

Nvidia might like to point to it as a way of advertising their one stop shop advantage. But nobody is going to be fooled that the system was created in 19 days. Closer to two years would be my bet. That is from the genesis of an idea, to serious negotiations, contract signing, design of system, design of facility, start of logistics planning, ordering, sub contracts, work starting on construction, and so on. Wheeling the racks in is the last tiny step, but one that makes for good press and photo opportunities.