Can someone explain the hype around China's deepseek?

This is almost a meme already in various AI discussion groups. Probably depends on the shape of the demand curve. AI has limited demand at the moment, but the amount will increase if AI becomes more capable. Which it will if training/inference become sufficiently cheap. But it has to get better/cheaper fast enough relative to the amount of improvement. Jevons Paradox probably won’t kick in if running 10x the workload only improves performance by 1%.

My point was that NVidia’s share price was based around an AI arms race of who could get more of the latest generation chips. So even if DS used older NVidia chips the point is this goes against Nvidia’s current strategy for growth.

It’s like if there was suddenly an algorithm that could run the latest PC games at high framerates using old Nvidia 1060 cards. That would be terrible news also, because the market is already flooded with those cards and even if there’s something specific about the 1060 that even faster AMD rivals don’t have, it won’t be hard for them to emulate it as the card has been out there for so long.

The thing is, we already know there are other AI models out there that are better but basically too expensive to run. So it’s like the state of the art being, I dunno, Cyberpunk at low settings at 640x480 being state of the art, and then finding a big efficiency boost that allows running that on lesser hardware. People are going to immediately demand that the efficiency gains be put toward the top-end hardware so they can use it on higher quality settings.

Agree compmeyely. But …

Pretty quickly in the early days Google decided to roll their own ultra cost-efficient hardware and their own software stack pretty much from the silicon up. All to drive their cost per incremental search query down.

I suspect that as the big winner is determined, or the product gets commoditized, there will be an effort to rewrite the successful engine(s) to eliminate a lot of the internal layer bloat and duplicative library use.

Most certainly. There is huge opportunity for optimizations across layers–the AI equivalent of function inlining. All kinds of redundant steps are happening at the moment.

You can fix this by hand to some extent. But what you really want is a “compiler” that can look across the entire stack and look for opportunities.

Another force here is that development is so rapid that we don’t know what the final “shape” of the stack will be. You don’t want to constrain yourself too much. Your optimizations now had better be either fully generic, or cheap enough that you don’t mind throwing them away and starting fresh on the next iteration.

You definitely don’t want to build hardware that narrowly optimizes one of the current models. Hardware development is too expensive to risk on something that might be obsolete in 6 months. Hence why GPUs remain popular.

This is basically my thinking. The main obstacle to universal of AI as I understand it was the massive amount of computing power it would take, so assuming that wide use of AI is a good thing (another debate altogether) having it much more efficient is great news.

As far as it being China that’s beating us, I see this news less as evidence that China is dominating AI, but rather than no one is dominating AI (which of course comes as a shock to those who thought they were). The fact that a group can put together an significant improvement in performance in a fairly short time frame with a limited budget suggests to me that the field is in serious flux and there are likely other improvements that can/will be identified in short order.

Unrelated, but I asked DeepSeek what Mao Zedong did wrong. It gave a long answer of all the things he screwed up and all his terrible policy ideas. Then it deleted that answer and said ‘Sorry I don’t have an answer’ or something like that.

I thought the whole thing with AIs is that we have no idea what’s going on inside them. How can you streamline something you don’t understand?

As well as being open source you can also examine deep seek’s internal monologue.

It’s quite cute seeing this discussion of how many r’s in “strawberry”:

https://www.reddit.com/r/ChatGPT/comments/1iaqkip/deepseek_arguing_with_itself_about_how_many_rs/

We can streamline the actual human-written software code. Not the opaque model that is running on that code. Similarly we can observe how the hardware is used and build special purpose hardware that does less of what’s not needed and more of what is needed.

Kinda like how nowadays we can perform brain surgery even though we have no idea how to perform mind surgery.

At a high level, there’s the computation and the data. The computation is 99% multiplying by gigantic matrices. All of this structure is defined by humans and is a big component of what differentiates the different models.

But then there’s the data–the actual content of those matrices. That was generated from training. There is a forward pass–the model is run on some inputs to generate some outputs–and then there’s a backwards pass, where you look at the difference between what the model generated and the output you wanted/expected. Then the weights are tweaked slightly so the output more closely matches what you want. Repeat a trillion times.

So we know how all this happens, but the actual meaning of the numbers in those matrices is mysterious. You can follow some particular computation from the beginning to the end, but it still doesn’t tell you much–it just looks like an endless parade of multiplies and additions by numbers that came from nowhere.

Nevertheless, we can streamline these computations because it’s still just matrix math. We don’t have to understand where the numbers came from to work on them more efficiently.

This video by Computerphile is about Deepseek and what sets it apart. They’ve put out really good content on how these models work that I remember from ChatGPT’s early days.

Here is a report on how the general scientific community is receiving this new LLM. (Full access now requires creating a login and password, but nothing beyond that)

Scientists flock to DeepSeek: how they’re using the blockbuster AI model

From the article, one of the things that impresses the scientific community is DeepSeek’s openness:

For researchers, R1’s cheapness and openness could be game-changers: using its application programming interface (API), they can query the model at a fraction of the cost of proprietary rivals, or for free by using its online chatbot, DeepThink. They can also download the model to their own servers and run and build on it for free — which isn’t possible with competing closed models such as o1.

Ahhh, exciting stuff from Chinese researchers. Some really excellent and clever work!

Here is a fantastic article on the topic of why Deepseek is very cool. Maybe some of the best you can read right now, even if I disagree with his premise that nVidia stock might be over-valued. In fact, he even states my case as a possibility for why he is wrong.

If you wish to save time, just scroll all the way to:

The Theoretical Threat

  • One, this model is absolutely legit. There is a lot of BS that goes on with AI benchmarks, which are routinely gamed so that models appear to perform great on the benchmarks but then suck in real world tests. Google is certainly the worst offender in this regard, constantly crowing about how amazing their LLMs are, when they are so awful in any real world test that they can’t even reliably accomplish the simplest possible tasks, let alone challenging coding tasks. These DeepSeek models are not like that— the responses are coherent, compelling, and absolutely on the same level as those from OpenAI and Anthropic.

  • Two, that DeepSeek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.

In five years, it will probably run on your cell.

20GB ain’t much…

I was wondering if someone knowledgeable could explain the advantage of DeepSeek being open source. I know Meta did this too in order to try to attract users given its main product is selling adspace. How much difference does this make in practice? Results are still black box…

It means you can run the models locally, and guarantee that zero information leaves your machine.

Unfortunately, Deepseek-R1 is too big for all but some extremely powerful setups. However, there are distilled versions that can run on consumer graphics cards.

You can run Deepseek-R1 on cloud servers that aren’t associated with China. That might be an acceptable middle ground for some.

From what I’ve heard they created this on the cheap with some last-year type Nvidia chips before some embargo for only $6 million, which you cannot build a bionic man from (though maybe these guys can)

In my experience, it seems to return answers very similar to ChatGPT or Gemini. At least no greater wisdom.

I did install it and asked it (I’d seen a guy on YouTube) about “A man standing defiantly before a series of tanks” and got back links to the (Pulitzer prize winning) photos with links.

I asked the same of the online version, and it started to say “Yes there is…” then that was erased and replaced with “This question is out of my scope”

How long till the question of “What happened to that guy?” can be answered. “What guy” is the current answer.

I don’t understand the computer science, but this article says the hardware cost may have actually been closer to 500 million for deepseek, not 6 million.

Yep and to be clear, it is not that DeepSeek lied about their costs. They just mentioned $6 million for the training cost in the paper, some journalist compared this number in not like-for-like way with the total costs for CGT-4o, and everyone else ran with it.