How can Elon Musk build a supercomputer in19 days, when his competitors take four years?

If, perhaps, those racks already had the servers loaded onto them, that’s not much of an achievement.

Which from my reading they did. Not just the compute boxes but the power control and rack scale networking. Not clear what the cooling is. Air and it is ready to roll, liquid and there will be some plumbing fittings to connect.

Sounds much more like a triumph of logistics rather than computer engineering.

Or PR.

That’s a very polite term for “bullshit”.

Stranger

Presumably it takes time to decide on how to best achieve a desired goal.

Buying 100,000 graphics cards is expensive even for Musk.

You have a problem you want to solve and decide you need a really big computer to do that. But what is the best way to do that? Nvidia or AMD? Which brand of card? We know the consumer ones but there are also professional cards Nvidia (at least) makes. Then you need to wait for them to manufacture 100,000 cards for you. That will take a lot of time on its own.

Where will this be housed? Power delivery. Cooling. Control systems to make it all work together as you want. How many employees to keep it running?

Loads of questions and complications.

If all he wants to do is slap together 100,000 cards and can have them all delivered to one place at the same time (for 100,000 cards I’d expect that alone would take a lot of time) then sure. Get an army of people to plug them all in and you have a big computer. Now make it do something useful (the real test is if it could run Crysis with max settings).

Remember that Musk was an early investor and founder of OpenAI (the group responsible for ChatGPT, among other projects) and their patents and research are freely available, so presumably his new project is building on that work.

Preach it.

It beggars belief how far this has run on such slim underlying science. Musk can afford to eat the costs. But the distortion of reality makes Steve Jobs look like an amateur. (I saw Steve in person doing his thing, back when Apple were doomed. Elon and the various other AI flacks take this to an entirely different plane.)

The danger for the rest of us is just what happens when the bubbles burst. There are going to be a lot of highly leveraged people taking a bath. That always knocks on.

Electrical power is an underestimated aspect to AI datacenters. I suspect that at least some of the savings here is due to cross-pollination between the energy teams at Tesla and those at xAI.

The use of GPUs for AI training has some unusual aspects compared to traditional datacenters. At, say, the ones used for Google search, you can depend on the law of averages: system power use may go up and down, but averaged out over tens of thousands of systems, the total amount is predictable. But AI training has two characteristics different from that situation:
- Peak power use can be dramatically different from minimum power use since the GPU power is so dominant
- Due to the way AI training works, there is very strong correlation between power use among all the systems

What this means is that peak power use can be massively higher than average power, and furthermore the rate at which the entire facility can change its power input can be very high.

There are some reports of destroyed transformers because the training system had idle periods that induced a resonance in the windings. If you’ve heard of “coil whine”, this is the same thing, except at a scale that can cause physical destruction. The naive solution (which has been implemented in places) is to never have idle periods–if there’s nothing to do, run some dummy calculation (I’d say they could just mine Bitcoin, but unfortunately the periods are too short).

You also end up having to size everything for absolute peak loads, even though the average might not be so high. Not cheap, especially when there’s a shortage of things like transformers in the first place.

The solution here is to have a hierarchy of energy caching. At the high level, this means something like Tesla Megapacks (grid-scale batteries). These can smooth the load from the provider, and furthermore are a huge cost savings since billing at this scale is basically proportional to peak use.

You also need smaller-scale caching that acts more locally. There are solutions here but I’m not sure if there’s anything off the shelf.

Anyone that can actually solve these problems has an enormous advantage in cost and schedule. Sure, anyone can order Megapacks (or alternatives)… but the knowledge of how to use them is not evenly distributed, and I suspect many people who have designed datacenters in the past have not anticipated all the differences that AI brings. Further, having a direct connection to Tesla means that xAI can get custom firmware, etc. that does exactly what they need.

I know that xAI is doing at least some of this. The full extent is more speculative.

Another thing Tesla may be able to bring to the table is solid-state transformers. Their Megapacks demonstrate they have the capability. This is important because, as I mentioned earlier, bog-standard transformers are hard to get and in many cases are backordered for years. Solid-state transformers (using power electronics to do MVAC->DC->AC conversion) is entirely possible, and not necessarily cheap, but more available than the alternative. And it may be possible to save money by combining roles with the power storage system.

I suppose the point is - the interconnect hardware/network designs and the operating system software already exist, and were not an xAI original creation. So to get the first nodes online… not rocket science, even for Elon’s crew. Presumably all the hardware and software accomodates expansion in realtime - no need to reinstall everything and start over.

The article is short on details, but the 19 days as others mention above is to get the power on and computing happening. THe article implies the full center and installing 100,000 chips(?) is not yet complete.

It seems to be mainly fluff, low on details.

That makes it even more impressive. Logistics is hard.

But unlike tulip bulbs in the 1700’s, AI today has value–because it does something that could be very useful.

Yes, it may be only as good as a talented 12 year old…But half of all the work done today could be done by a talented 12 year old. :slight_smile:

Imagine the world’s offices filled with 12 yr olds, but supervised by adults.
Half of all the documents produced —by insurance clerks, by banks, by real estate agents,by accountants( or their bookkeeper assistants) and by lawyers (or their paralegal assistants), etc—are routine matters that could be written by a talented 12 year old.

During an 8 hour day, 6 hours of the work could be done by our theoretical 12 yr old, leaving one adult professional with 2 hours of work to supervise and fix the mistakes. If AI replaces all those jobs for 6 hours a day, it’s going to be a good investment.

An Nvidia 4090 retails at Walmart for $2,500. 100,000 of them is only $250 million at retail price, which I assure you Elmo, or anyone buying 100,000 of them wouldn’t be paying. That’s pocket lint to Elmo; it isn’t remotely expensive for him.

They’re using H100s in these machines, not 4090s. They cost more like $30k each. Not including power and everything else.

Just for the record: 19 days are 1,641,600 seconds. So if the connecting cables are standardized, the connection configures itself automatically and the programming just works by magic, you will be able to build it if you work day and night plugging a card every 16.42 seconds.

As a translator in one of my former lifes I can assure you that correcting a bad translation takes more time and nerves than throwing it in the bin and starting from scratch. Your working hypothesis does not always apply in all fields.

As a current translator I can confirm this. Every time I’ve used AI to assist my translation, I’ve ended up chucking it all away and starting from scratch. It’s faster that way.

Huh. My wife has been a translator since the dawn of time, including certifications in medical, legal/court and financial work. She currently does public sector work including translations and says that she finds DeepL to be incredibly helpful for her work and a real time saver. Obviously everything needs a read-through and check for accuracy but she finds real value in it.

Mind you, she’s primarily translating into Spanish which is probably better tested and seated in the dataset than Dutch or Arabic might be.

But the article says they got it up and actually producing results in 19 days, the whole thing will take 16 months. So presumably, reading between the fluff, they got the first few pieces up and the software installed and configured, but then they can plug in more and more at their leisure and the software will adapt as extra GPU’s come online. Until then, maybe the AI can write some nice fluffy PR material…

There is an old saying that applies to many industries.

Good, fast, cheap. Pick any TWO.

Making 90% (or whatever large percentage) of the jobs in the world redundant doesn’t seem like a great investment to me. If you are the owner of a business, who is left to buy your products and services, once everyone is unemployed?