Labeling is a big problem. AI companies are working on ways to make it easier.
Most self-driving systems, Tesla’s current version included, have as a first stage a system that converts raw (or close to raw) imagery into a high-level description of what it sees. There is a small car at these coordinates, a semi truck at these other coordinates, road lines here, a speed limit sign there, and so on.
To train the net requires a human to first perform the labeling manually, just viewing the image and drawing boxes around things and describing them (in terms the computer can understand). This is then fed into the net so that it can reproduce the same description.
This is incredibly tedious and error-prone work, so they’ve improved the approach in one way: once you get the AI partially trained, it can give you a best guess at what it sees. The human then fixes up any incorrect labels, and then the corrected data (if it needed correction) is fed back into the system. This reduces the workload substantially.
But it’s still a lot of work, and worse, it’s not what you really want anyway. The labels only contain things that humans already identified as important. But that’s not necessarily what is important. As said above, really we want the computer to learn from patterns that we aren’t even perceiving. And even aside from that, the labels are pretty low-fidelity. It’s just not all the data that you’d want.
So Tesla’s latest (unreleased) system is end-to-end–there’s no intermediate labeling step; instead it takes video input and outputs the vehicle controls (steering, throttle, brake, etc.). Somewhere deep inside the AI, it must have something like the labeling–it’s still distinguishing between different elements of the scene, else it wouldn’t work at all–but it’s difficult to know what exactly it’s doing. And no human is involved for that step.
The only labeling, as it were, is just that they have millions of examples of how humans behaved in the same circumstances. One funny consequence that came up is that humans apparently only come to a complete stop at stop signs <0.5% of the time. “California stops” are, apparently, near universal. Their FSD system learned to behave the same way, so they had to explicitly feed it extra examples of people coming to complete stops for it to behave that way.
At a very high level, DPRK is right; these systems are “universal approximators”. There is some function that we want to be solved, which takes some input (whether video, text, or otherwise), churns on it, and produces some desired output. We give it examples of the input/output we want and the internal weights get adjusted to fit those examples.
What’s remarkable though is how it generalizes. It wouldn’t be surprising if it could reproduce the training set exactly–you could do that with a lookup table, in principle. But somehow it usually manages to do “the right thing” even with novel inputs. It suggests, a little distressingly, that all of human intellect and creativity is not much different than interpolating points on a curve–just a “dumb” process of filling in the blanks.