I don’t think I really answered this properly. I think you’re asking about the distinction between the two types of model.
Previous versions of FSD combined a manually trained image feature detector with standard programming logic. For the image detector, basically you would get people to label images with the geometry and type of feature. If you see a truck, you draw a bounding box around the truck and tag it with the type. Same thing with humans or bicycles or whatever else. You draw lines corresponding the road boundaries. And so on.
You feed that into the AI training system, asking it to predict the geometry given the provided images. Eventually it gets pretty good at reproducing the labels.
So the car then has these features available and has to act on them. You write code so the car stays between the lanes. You write code to brake if the car ahead is slowing, or accelerate if it’s open and you’re under the speed limit. You write code to change lanes if there are no obstacles in the destination lane. And so on.
The problem here is two-fold. One, the image detector is limited by how many labels you can come up with. But there are just so many things that could be in your way that your imagination will never be good enough. You add a ladder detector since those sometimes fall off trucks. A barrel detector. A deer detector. An ostrich detector. Whatever.
Furthermore, the labeling process is difficult and not very scalable. You can speed it up a little by having the computer do most of the predictions, and then the human just cleans it up and adds new stuff as necessary. But it’s error prone, time consuming work. You want to train on millions or billions of images, and that’s just not possible for humans.
The second problem is the code. You have to write something to handle every possible case. Some tiny portion of the code has to be dedicated to animal avoidance, but even that is an infinitely complex problem. What size animal? How fast is is moving, and in what direction? How far away? Is there a lane free to move into or should I slam on the brakes? Etc. It’s just totally endless. Tesla says they had 300,000 lines of C++ code and it was still obviously very primitive. Again, it’s just not scalable to manually anticipate and handle every possible situation.
So Tesla (and soon to be others, probably) are now using end-to-end training. What this means is that you feed in the raw inputs (the camera info, the speed/direction/etc. of the car, the current control inputs, and so on), send it through a giant neural net, and predict the outputs. It’s trained using data gathered from human driving: given these inputs, did the human steer left or right, hit the brake, etc.?
How the neural net works internally is a mystery, just as the internals of ChatGPT are a mystery. You fed in a bunch of inputs and it gives you some outputs. In-between is… who knows? But it’s a lot of computation, that’s for sure.
But the advantage is that you can feed in a thousands of recordings of people avoiding deer and it will figure out what a deer looks like and what humans do when they successfully avoid them. You don’t use recordings where the accident happened (although I wonder if they ever use negative training). If you find that it still hits deer, you feed in more clips, as diverse as possible so that the neural net has the best possible grasp of what it’s looking at.
It’s much more scalable since there’s so little human curation. They can stick with recordings from top-quality drivers, which they determine based on lack of accidents and measures like drive smoothness. There’s probably some manual vetting, but much less than the image labeling task.
And Tesla can also query the fleet when they need more recordings of particular things happening, whether that’s deer or ladders in the road. Since there are millions of Teslas on the road, they have immediate access to a huge dataset. There are so many that even very weird situations are happening to someone, somewhere.
All that said, since it is a black box, you can never be sure that it’ll detect some particular thing. Well, that’s a problem with humans and their meat brains as well. The only way to measure safety is to actually try it. But FSD will undoubtedly still hit some deer, even when they have nearly perfected things, since deer are pretty dumb aside from their sheer genius in committing suicide by car. So I’m sure we’ll see plenty more stories like this, but when you look at the statistics it’ll be vastly better than humans.
So why does the highway model lag behind the surface street model? It’s probably because the v11 highway model, despite its flaws, is still pretty good, since highways are relatively simple compared to surface streets. No pedestrians, gentle curves, no traffic lights or other signaling, and so on. The advantages of the end-to-end model are not as great since there’s just not as much stuff. The occasional deer notwithstanding. I expect also that computer performance plays a role: at higher speeds, you need faster predictions, but the end-to-end model is more intensive. So they need to optimize it better or something else. The cameras only see so far away and the car needs to react in the time it has. That’s less of a problem at low speeds.