Self driving cars are still decades away

there was no evidence of Teslas ploughing through deer, before this video, either - YET IT DID… but now there is and that is a problem. … and if you keep sugarcoating it, it is still a problem, just a sugercoated one

And it is that kind of problem they cannot afford now, with trying to get people into teslas without pedals and steeringwheels (trusting your life to a robotaxi)

Tesla robotaxis are at least a couple of years out and will have new hardware and software well beyond what we have now. And they’ll start, like Waymo, in geofenced urban areas where they’re confident things will be safe, then slowly expand from there. No one is taking away your steering wheel tomorrow.

Yes there are. Search on YouTube.

In my experience it is because the V11 model does no (or almost no) detection of road obstructions. If an object is not a car, person, bicycle, or trash can it may as well be invisible. It does not see geese, dogs, potholes, boxes, branches, speed bumps, dips, &c.

I think V11 had no chance to do the right thing here, based on my experience of having to intervene every single time a flock of geese walks across the road in front me.

I’ve not had V12 while the geese are here, but it does detect and handle speed bumps and dips in the road.

Cones show up pretty well–or at least they show up on the visualization. The path planning around them isn’t so hot, though. As for the rest, that fits with my experience as well but it’s not something I have direct insight into.

I don’t think I really answered this properly. I think you’re asking about the distinction between the two types of model.

Previous versions of FSD combined a manually trained image feature detector with standard programming logic. For the image detector, basically you would get people to label images with the geometry and type of feature. If you see a truck, you draw a bounding box around the truck and tag it with the type. Same thing with humans or bicycles or whatever else. You draw lines corresponding the road boundaries. And so on.

You feed that into the AI training system, asking it to predict the geometry given the provided images. Eventually it gets pretty good at reproducing the labels.

So the car then has these features available and has to act on them. You write code so the car stays between the lanes. You write code to brake if the car ahead is slowing, or accelerate if it’s open and you’re under the speed limit. You write code to change lanes if there are no obstacles in the destination lane. And so on.

The problem here is two-fold. One, the image detector is limited by how many labels you can come up with. But there are just so many things that could be in your way that your imagination will never be good enough. You add a ladder detector since those sometimes fall off trucks. A barrel detector. A deer detector. An ostrich detector. Whatever.

Furthermore, the labeling process is difficult and not very scalable. You can speed it up a little by having the computer do most of the predictions, and then the human just cleans it up and adds new stuff as necessary. But it’s error prone, time consuming work. You want to train on millions or billions of images, and that’s just not possible for humans.

The second problem is the code. You have to write something to handle every possible case. Some tiny portion of the code has to be dedicated to animal avoidance, but even that is an infinitely complex problem. What size animal? How fast is is moving, and in what direction? How far away? Is there a lane free to move into or should I slam on the brakes? Etc. It’s just totally endless. Tesla says they had 300,000 lines of C++ code and it was still obviously very primitive. Again, it’s just not scalable to manually anticipate and handle every possible situation.

So Tesla (and soon to be others, probably) are now using end-to-end training. What this means is that you feed in the raw inputs (the camera info, the speed/direction/etc. of the car, the current control inputs, and so on), send it through a giant neural net, and predict the outputs. It’s trained using data gathered from human driving: given these inputs, did the human steer left or right, hit the brake, etc.?

How the neural net works internally is a mystery, just as the internals of ChatGPT are a mystery. You fed in a bunch of inputs and it gives you some outputs. In-between is… who knows? But it’s a lot of computation, that’s for sure.

But the advantage is that you can feed in a thousands of recordings of people avoiding deer and it will figure out what a deer looks like and what humans do when they successfully avoid them. You don’t use recordings where the accident happened (although I wonder if they ever use negative training). If you find that it still hits deer, you feed in more clips, as diverse as possible so that the neural net has the best possible grasp of what it’s looking at.

It’s much more scalable since there’s so little human curation. They can stick with recordings from top-quality drivers, which they determine based on lack of accidents and measures like drive smoothness. There’s probably some manual vetting, but much less than the image labeling task.

And Tesla can also query the fleet when they need more recordings of particular things happening, whether that’s deer or ladders in the road. Since there are millions of Teslas on the road, they have immediate access to a huge dataset. There are so many that even very weird situations are happening to someone, somewhere.

All that said, since it is a black box, you can never be sure that it’ll detect some particular thing. Well, that’s a problem with humans and their meat brains as well. The only way to measure safety is to actually try it. But FSD will undoubtedly still hit some deer, even when they have nearly perfected things, since deer are pretty dumb aside from their sheer genius in committing suicide by car. So I’m sure we’ll see plenty more stories like this, but when you look at the statistics it’ll be vastly better than humans.

So why does the highway model lag behind the surface street model? It’s probably because the v11 highway model, despite its flaws, is still pretty good, since highways are relatively simple compared to surface streets. No pedestrians, gentle curves, no traffic lights or other signaling, and so on. The advantages of the end-to-end model are not as great since there’s just not as much stuff. The occasional deer notwithstanding. I expect also that computer performance plays a role: at higher speeds, you need faster predictions, but the end-to-end model is more intensive. So they need to optimize it better or something else. The cameras only see so far away and the car needs to react in the time it has. That’s less of a problem at low speeds.

I’ll add that end-to-end training does require immense computing power. Tesla has some giant fans:
Imgur

It’s quite literally a gigantic GPU cooler, as they plan on having (may already have–not sure) 50,000 NVIDIA GPUs doing the training.

We’re a little behind the human brain in power efficiency… but at least it only has to be done once per software release. Once the neural net is trained, actually using it is a relatively low-power operation.

Switching to the decidedly lower-tech driver assist (“smart” cruise control with active lane-keeping) in my BMW. About which I’ve posted a few times. …

Yesterday was my first long drive with it. 4 hours from northern Miami down to Key West. About 90 minutes of busy urban/suburban freeway then 2-1/2 hours of 1 or 2 lane each way sorta-rural / small town highway with 35, 45, or 55mph speed limits.

My overall reaction is it made the tedious second low-speed highway phase of the trip much less tedious. It was quite good at driving stably for many minutes at a time when not interrupted every half mile by the ubiquitous intersections and traffic lights in my usual suburban area. It handled cars ahead slowing to turn or pulling out in front of us very reasonably.


The freeway part was interesting in a different way. I’d already noticed this effect over the last 3-4 weeks, but this is the first I’ve reported on it here. To wit:

We have lots of aggressive drivers here and sometimes I’m one of them. Traffic was dense enough and had enough trucks that you’d alternate going 50 and 80 as you were in a pack or between packs and as the number of available lanes changed between 4 and 8 here and there.

Having the sorta self-driving feature turned on really reduces my tendency to get aggressive. The car drives dumbly: just stay in your lane and follow the car ahead at whatever speed. Which is the opposite of the opportunistic “change lanes every 30 seconds to hopscotch ahead of one slowpoke after another” method I often employ.

In terms of net exposure to accidents, I’m sure it drives safer than hopscotching me does, but maybe less safely than follow-the-leader me does. The mental / emotional challenge is to avoid getting annoyed at the car ahead who’s clearly driving too slow for the lane they’re in. What I did find (and find surprising) was that when I got annoyed at a slowpoke, I also had a countervailing “Damn, you mean I have to do this pass manually? Too much work; I’ll just veg here.”

It’s amazing how quickly one gets used to HAL, even a decidedly stupid and listless HAL, doing all the heavy lifting.

thx, quality post …!

whats your opinion on re-feeding the same data-set into the neuro system (NS), e.g. you start out with a NS on a 1 (on imaginary 1-10 smartness scale) … and you feed it the deer-data … as time goes on and more data is fed into the system, the NS should move up the smartness-scale, right?

So would the NS benefit from being fed the deer-data it was fed when being a 1 … and then being fed the same data when the NS is a solid 6? I assume (but don’t know) that the 6 NS would draw slightly different (and arguable smarter) conclusion from the data as opposed to the 1 system.

… any educated guesses? (problem is obv. not limited to deer or Tesla)

Are you suggesting that a deer standing in the road is an edge case? That’s something that’s happened to me several times, i consider it a normal, if low-probably, event.

This is true. Musk does not release credible data, however, so who the hell knows.

Naw, there’s a strong “nudge nudge, wink, wink” suggestion that you don’t need to pay attention, because the car is doing that for you. Odds are that a driver in a fully manual car would have seen the deer.

That’s not a lot of time for a human, but it should be plenty of time for a computer to attempt to brake and/or swerve. Hell, i think I’ve avoided collisions with 2 seconds warning.

i am by no means a specialist in this area, but I have read quite a view posts/article, that make me believe that FSD (of all makes) are quite slower than an avg. attentive driver in taking decisions/remedial action. I strongly believe the seat-of-the-pants “heuristics” of an experienced driver have a lot to do with it … (e.g. seeing two road-rage driver butting heads, I stay WAY more clear than I normally do).

So - short version: NO, it seems that the 2024 and older cars are not practically faster in reacting …

There is a big difference between what Tesla has always said, and what Musk has said, but even he has been dialing it back a bit, particularly with the recent addition of “(Supervised)” to the name.

The reality is that current driver attention monitors allow more than enough latitude to hit a deer. This is going to be the case for every single one of the systems, regardless of which company or what technology they are based on. Two or three seconds of diverted attention is not long enough for the self driving system to alert.

Even an attentive driver, whether using automated driving aids for not, will look away from the road for brief periods of time for any number of reasons, including things completely related to driving, such as mirror checks. You probably shouldn’t be spending 3 seconds looking at your mirrors, but a map or radio could easily take 1-3 seconds of attention.

I’m not trying to defend FSD. It should take action for any object in the road.

This is going to depend on circumstances. Some of it is just judgement. Someone trying to drive smoothly will probably start slowing down sooner than FSD starts slowing, but I don’t think it’s that FSD is taking a long time to process, but rather that it just decides to brake later. No different than riding in the car with someone who zooms up to red lights instead of coasting in.

FSD responds instantly to many things, like a light turning green.

The very few times I’ve experienced a Tesla take evasive action the car responded extremely quickly. In those very few times I do not know if I swerved/braked or if the car did it on its own, or if we did it simultaneously. All I know is we both saw the car coming into our lane, and action was taken to avoid a collision.

And there’s the main point. This is 100% on the driver. They have a brake pedal and a steering wheel that’s perfectly functional and they should have been attentive, in particular on a country road with deer in the area. If they didn’t have time to react, neither did FSD.

No, just this particular example. Sort of a weird angle for the deer, facing directly away from the car. Usually you’d see more of a cross-section.

As echoreply said, it’s possible this version of the highway model just doesn’t see deer at all. I have no real insight there. But the end-to-end model will undoubtedly incorporate them, and if it still misses some cases then they’ll add more training samples until it does see them reliably.

It’s using eye tracking, so (unless they did something to disable it) the driver would have to be looking at the road without paying attention to it. And sure, that’s possible, but you can’t just be messing with your phone or whatever while the car is driving.

I’m not denying that FSD “should” have reacted. Just that this particular incident is essentially meaningless, and conveys zero insight to the future of self-driving, whether LIDAR is useful or not, or anything else.

That’s basically how neural net training works. Usually divided into “epochs”, where each training sample is fed in once within an epoch, but then you can run multiple epochs.

Since the neural net weights are initialized with pure noise, it’s completely dumb on the first samples, and most likely almost all the information in the samples is lost. But slowly it gets a little less dumb, and it can start to recognize features in the samples. The next time you feed in the training data, it can actually extract more information, since it has some grasp of what’s going on in the image/text/etc. And then on the next epoch, it can extract even more subtle details, and so on.

So yeah, basically what you describe is how things are done. The net’s ability to learn from the samples is itself related to how “smart” it is at the time. It runs into a limit eventually where more epochs don’t help (or can make things slightly worse), and the optimal number depends on the problem, but it’s likely in the range of 10 to 1000.

As I posted above, I think my car hit the brakes before I got there and helped me avoid hitting a deer and it was VERY close. I think if it relied on my reaction time, the deer would have been struck. I know this anecdotal, but that was my experience.

I’d be surprised if Tesla or anyone was doing models from the “pure noise” any more. Most production vision models start from a pre-trained vision model that is then specialized through transfer learning (or something similar) that at least allows it to at least skip over the “random noise to seeing basic shapes” phase of training.

The task in xkcd: Tasks isn’t so big of a deal any more.

Could be. You’re certainly right that most specialized models are fine-tuned from pre-trained models like ResNet. That’s certainly true of their pre-end-to-end vision model. Still, those models started with random noise.

I’m not sure about end-to-end. It’s not just a few other models smushed together. There isn’t necessarily anything they could start from. And they have such immense computing resources available that they can afford to start from scratch.

Tesla gives an FSD roadmap. Take with the usual serving of salt:

As October comes to a close, here’s an update on the releases

What we completed:

  • End-to-end on highway has shipped to ~50k customers with v12.5.6.1
  • Cybertruck build that improves responsiveness
  • Successful We, Robot event with 50 autonomous Teslas safely transporting over 2,000 passengers

What’s coming next:

  • Full rollout of end-to-end highway driving to all AI4 users, targeted for early next week, including enhancements in stop smoothness, less annoying bad weather notifications, and other safety improvements
  • Improved v12.5.x models for AI3 city driving
  • Actually Smart Summon release to Europe, China and other regions of the world
  • v13 is a package of following major technology upgrades:
  • 36 Hz, full-resolution AI4 video inputs
  • Native AI4 inputs and neural network architectures
  • 3x model size scaling
  • 3x model context length scaling
  • 4.2x data scaling
  • 5x training compute scaling (enabled by the Cortex training cluster)
  • Much improved reward predictions for collision avoidance, following traffic controls, navigation, etc.
  • Efficient representation of maps and navigation inputs
  • Audio inputs for better handling of emergency vehicles
  • Redesigned controller for smoother, more accurate tracking
  • Integrated unpark, reverse, and park capabilities
  • Support for destination options including pulling over, parking in a spot, driveway, or garage
  • Improved camera cleaning and handling of camera occlusions

We have integrated several of these improvements and are already seeing a 4x increase in miles between necessary interventions compared to v12.5.4.
This lays the foundation for the v13 series, and we are targeting to ship v13.0 to internal customers by the end of this week.
Most of the remaining items are independently validated and will be integrated over November in a series of point releases.

We are targeting a wide release with v13.3 with most of the above improvements for AI4 vehicles around Thanksgiving!

Unfortunately for us HW3 people, while they are still pushing out updates, it sounds like HW4/AI4 is going to be accelerating away from us.

Musk did say that HW3 units will be upgraded if necessary. Contrary to various sources I’ve seen, he did not say upgraded to HW4. I suspect it’ll actually be to HW5. HW4 is too power intensive in its current form. However, HW5 on an updated process node could fit into the HW3 power envelope (possibly downclocked).

They’ll probably wait as long as possible to be sure that the specs are good enough and to lower their own costs (every car used for a trade-in can be sold without FSD, and then they won’t have to upgrade).

“Audio inputs” is a neat feature. I wonder which microphone they’re using. And external one I don’t know about or the internal one used for phone/voice commands?

so successful, they lost 8% of their value within 24 hours? …

I hope the remainder of the update isn’t equally “elastic” concerning reality.