AI (or Machine Learning) is a lot like the hundred monkeys at typewriters trying to create the works of Shakespeare. It is primarily created through shear brute force.
When I was a kid I worked as a cashier at a hamburger place. It gets boring after a while so I would start to guess the customer’s order before they gave it – going so far as to start filling out the order sheet. We didn’t have a big menu, but there were some unique decisions: 1/3 or 1/2 pound burger, cooking temperature, cheese, fires, drink, etc. It was surprising how often I was right. I’m sure if I measured my overall accuracy I wasn’t as good as I thought I was.
I was basing my guess off a bunch of factors like age, gender, was it lunch or dinner, were they dressed for work, etc. I might have been able to make better guesses if I started tracking a bunch more factors, writing them all down, and updating them. Things like weather outside, day of the week, color of shoes, etc. And if I did this for 1000 years, I might get pretty good at it – to the extent that it is a predictable problem.
This is an example of one common type of AI done manually by a human. If you had a huge spreadsheet where each row represents a single customer and their order (all of their factors and what they ordered), you could build an AI to guess orders. You could make your AI better by having more rows of examples.
You could also try to make it better by having more factors in the columns. Maybe you record their hair length, belt buckle size, number of rings, how busy the restaurant was when they ordered. Crazy factors that might not have any association with their order. Some of the ‘magic’ of AI is that it will tease out the factors that seem unrelated and find a correlation in them. Maybe people will tend to order fries more when the restaurant is busy because they don’t want to wait in line again if they are still hungry.
Exactly, data is everything in AI. Garbage in, garbage out.
For the hamburger example above, this means that the factors and the order have to be labelled correctly. If 5% of the data is mislabeled it’s going to be hard to get better than 95% accuracy. It also means that the data has to represent the entire problem you are trying to solve. If you want to guess people’s orders 7 days a week, but you don’t collect data on weekends then it might not guess well on weekends. Similarly if you collect data from a single store and it has some outliers the guesses will suffer – perhaps at this location every day a country squire sends his butler dressed in coat-and-tails in to order 25 salads. It’s unlikely this happens in general so the guesses might suffer at other stores.
In the context of self-driving it might be: Do you have enough examples of different road conditions, under different lighting conditions, times of day, with different configurations of pedestrians, cars, strollers, scooters, trash, graffiti, etc. Are these examples labelled correct? Are all of the strollers identified as pedestrians or were some of them labelled as cars? Were different areas over-sampled or under-sampled so that the AI is better at some areas and worse than others?
Self-driving is actually an entire orchestra of AIs working together. Obviously what constitutes good or bad data depends on the type of problem. An AI that is guessing where the other cars are going to be in 10 milliseconds is going to have different needs than an AI that is guessing where the boundaries of the lane are.