Looking at your 3 bullet points I suppose we can collapse 1 & 3 to just “Artificial intelligence is no match for natural stupidity”. 
I’ll see what I can find. It has been mentioned in a few papers that I’ve seen in the past year or two, and has been a topic of conversation at a couple of conferences I’ve attended as well. It has mainly been in the imagery context. Certainly, it is not the case that all neural networks fail, just that there have been observed issues with high accuracy algorithms (especially neural networks) doing not very well when transitioning to a real-world application. I know one paper was in the British Journal of Medicine talking about medical imagery so I should be able to find it.
Until I see the cites, my hunch is you’re talking about the problem of overfitting. If so, that’s not a problem with neural networks, that’s a problem with the people training that particular model (well the methods used to train and test, not the people themselves, but the people are responsible for the methods).
[quote=“BeepKillBeep, post:43, topic:658287, full:true”]I know one paper was in the British Journal of Medicine talking about medical imagery so I should be able to find it.
[/quote]
I agree with DMC that overfitting and generalization error is a common failure mode of all ML models, not just neural networks, but the latest generation of neural networks tend to have so many coefficients that if you don’t have ludacris amounts of training data, the network can basically “memorize the training/eval set” instead of learning general features unless one is very careful.
I imagine specifically for medial imagery, there isn’t a whole lot of training data available – as far as these models are concerned, millions of images is “not a whole lot”.
Transfer learning is a neat set of techniques that let models be trained in a domain with not a lot of data by starting with another domain where there is a lot of data.
It isn’t an overfitting issue. Or at least not exclusively overfitting.
IANA anybody about AI. I’m very curious, but that’s it.
I recall reading an interesting article a few months ago about problems with AI for reading medical imagery. I just spent about 20 minutes hunting for the article, but failed. It was legit stuff in a legit source, the wasn’t Newsweek or USA Today.
Anyhow after good success with training data they found when they transitioned this app to live practice it failed badly. They eventually traced the problem to spurious inputs. The training was done with images from two institutions, one of whom dealt with much sicker populations than the other. It’s apparently common that medical imagery includes a sort of “bug” or logo or watermark, a human-readable visible marker of which facility or piece of equipment generated the image.
The AI was looking at that bug and using it as a proxy for severity of disease. Which meant at every other institution in the USA other than the one training institution it was underestimating the severity of disease since that very important marker of severe disease was absent. Not good.
This is indeed an issue, but with the training and testing, not neural networks. There are many issues with improperly trained networks, just like there are issues with improperly designed scientific experiments. When someone claims cold-fusion is reproducible, we blame the experiment, not the scientific method.
Since you are curious, imaging with neural networks (I’m not an expert in medical imaging, but I’m pretty sure it’s the same) is typically done by training the model with a portion of your data, while holding back a significant portion for testing purposes. In most cases, this is done via supervised learning. Supervised learning means that the model knows the “target” or outcome for each image it is given, thus the images fed to a model looking for lung cancer would potentially be labeled “True” or “False” (more likely 1 or 0, but principle is the same). So you show it a million images with no lung cancer and a million images with lung cancer and tell it which is which. It then looks for patterns in the string of pixels that are fed to the models that seem to indicate one or the other. There is a shitton of stuff going on, but that’s the simple version (if you want to know more, start with perceptrons, which should help you understand what is really happening). The next part, testing is vital for a successful implementation. You basically run your model against the images you held back, but this time you don’t share the target value with the model. Then you compare its predictions with the actual target value and measure the results.
Based on your description of the example, the model had determined that the watermark pixels were highly predictive of the seriousness of the illness, so any image that had a different watermark than the institution with the sicker population was going to generate a lot of false negatives, as it was giving far more weight to the watermark than to the spot on the image that would actually be predictive. I’ll be honest in that I consider this particular error inexcusable, unless the watermarks are so well hidden that no one would see them with the naked eye, but without the details in the paper, I’m not ready to burn them at the stake.
Thanks for the clear explanation. That was about my level of understanding of neural networks in general.
The general challenge is that it is extremely difficult to identify all “artifacts” that must be eliminated from training data. The watermark at least had the advantage that it was human-visible. And I have no clue how the debug team figured out that was the issue. What about other artifacts like color calibration or angular alignment? Or even the operator habits like e.g. at facility 1 they tend to take a more expansive view of the e.g. heart to include a little more of the surrounding tissues vs facility 2 that takes their pix more tightly.
Any difference can be significant to a free-flow learning model. I am reminded of another problem in non-AI software development. A time-tested proverb holds that “Any externally observable behavior of your API quickly becomes a part of your API contract, documented or otherwise.”
IOW, you can write a detailed spec & documentation to say your API does exactly this. But the clever devs using your API will of course want it to do a bit more and they will reason from what they observe about your implementation details and then take a dependency on those details. The classic example being something that returns a collection of objects / records / whatever. Per the documentation, the sequence is unspecified. But devs discover the results always come back in database index order, or chronological order or whatever. At least some of them will write their code on the assumption that the observed results order will always be true for all future versions of your API. If you later change that, you will break a lot of code. This need to maintain complete backwards compatibility, bugs and all, is the bane of all widely adopted long-lasting platforms & APIs.
The AI is taking dependencies on gosh-knows-what. It’s taking dependencies on whatever aspects of the data it can perceive. Whether those aspects are “part of the API”, a “leaking abstraction”, an “implementation detail”, or worst of all “spurious noise” of the real world.
Here’s an interesting article on point:
Be sure to flip through the short slide show of stop signs; the second & third images are key.
The point is that you and I recognize a street sign as a red octagon with a white border, and the white letters “STOP” in the center. We have a semantic framework and a syntax framework to evaluate these things versus all the visual noise in the environment. And those syntax features (red, octagonal, ~6 feet off the ground (usually), etc., are deliberately chosen by the designers to be distinctive = easy to “parse” from the visual clutter.
Meanwhile, the AI scene recognizer is looking at gosh-knows-what to decide what’s a stop sign. And most significantly, the evidence shows the recognizers are NOT paying attention to redness, white-border-ness, octagonality, or the presence of the letterforms S, T, O, & P arranged horizontally in sequence in block uppercase filling ~80% of the width of the octagon. As a result, there is no assurance that other unrelated changes in the environment won’t result in objects falsely identified as “stop signs” appearing on the freeway, or actual stop signs being unrecognized.
IMO as a total noob at AI, but a long time dev, if you can’t force the system to identify redness, identify octagonality, identify the letterforms, and build up your recognition confidence based on those specific factors and only those specific factors you’re not going to have an reliable system in the engineering sense of the word “reliable”. Identifying things in nature, such as cancer or weather signals or seismic risk factors or … is a lot harder than identifying stop signs that are deliberately designed by humans to be distinctive and easy to recognize. “easy” that is, as long as you look at the correct designed-in recognition features almost entirely and fall back to ancillary = coincidental characteristics only rarely as tie breakers.
I’m in a conference this week, so I had some time between talks to try to find some of the papers I’ve been talking about. Sadly, I was not able to find them. But, the post above by LSLGuy describes well the problem people have been experiencing. In essence, the neural network is able to find some pattern in the imagery when in the lab (but cannot express what that pattern might be), and ultimately has high accuracy (with all of the normal precautions taken to reduce overfitting). For whatever reason, that pattern seemingly does not exist so frequently outside of the data used in the lab because accuracy drops significantly. Of course, such a problem is not exclusive to neural networks; however, most other classification algorithms produce a model that can be more closely examined, whereas neural networks are (in)famously black boxes, and do not (as of yet, but I know people are working on it) allow for an easy explanation of their internal logic.
It is like the recent issues with facial recognition except more subtle. A black woman found that a facial recognition program, with a reportedly high accuracy, was not able to detect her face. It detected white, male faces. But not hers? And not Asians? Why? Well, it ends up that a high percentage of the input imagery were white males. To be clear, I don’t agree with everything she says, but it is a weak example of the problem (there’s little subtlety as to why it isn’t detecting her face, while the problem people have been talking about is that there is a non-human detectable pattern that the model is latching onto which isn’t “real”).
Keep in mind, I don’t work with imagery (medical or otherwise). My work is mainly on functional or process model. For example, I’m working on a model of communication in the brain at the moment, i.e. what is the cognitive process used by humans. This hasn’t been a problem I’ve been having, so everything about it is second hand (or worse) to me.
Joy is right that there have been many instances where blacks have suffered as a result of facial recognition so this is a known issue (as well as many other biases). I was just trying to explain that these are the fault of the people doing the model training and testing, not the underlying libraries. I’ll acknowledge that there is some disagreement around that statement, but so far the “algorithm is to blame” side hasn’t brought much evidence to the table. There are books on the issue of bias in data science (I like “Weapons of Math Destruction” by Cathy O’Neil), and we really need to watch out for these things, but so far they seem to not be innate flaws in the code itself, but instead caused by human error somewhere in the process. An simple example to illustrate the types of problems we face is that models are more than happy to drag biased source data all the way through to the output. So, if some police force is targeting majority black neighborhoods looking for drug dealers, more arrests will be made in those neighborhoods. Thus the model believes those to be troubled neighborhoods warranting more police presence and the cycle continues.
One is reminded of this case:
Google’s photo recognition ML had a habit of classifying black people as gorillas. Google patched the problem by blocking all recognitions of “gorilla”. Root cause was the algorithms were trained on a limited sample of people.
An interesting case that I encountered (being deliberately vague) is that data collection started at a given time (say 9:00 AM) and we were intentionally collecting a certain amount of ground truth examples in category “A”, a certain number in category “B”, &c.
Except it turned out that category “J” was way more common than any of the other properties so ground truth collection for “J” was done by 9:10 AM, while collection for other categories was spread out over the day. Then at training time, the model would learn, “if time is between 9:00 and 9:10, predict category J” even though the correlation was an artifact of collection.
The moral of all of these stories is that the ML is good enough now to find every stupid little correlation in your training data, and if those correlations are not true in general, you’re going to have generalization error.
I remember needing to fix an existing facial recognition algorithm to better detect black faces around 10 years ago, so by now there really should be no excuse.
Although…it should be noted that this is not just an AI issue. Many (video) cameras have or had a color balance that had been optimized for white and/or asian faces. So some contrast and hue may have been washed out even before the AI gets a look in.
I hope that the beautify filters that some people have permanently turned on on their phones work equally well across races, but I would wager they don’t.
Absolutely true, but in that case, IngestionDateTime or whatever that field was labelled should have never been an input.
Data selection, feature selection, feature engineering, model selection, hyperparameter selection, and proper testing on unmodeled data (or even cross validation) are all vitally important parts of the process and all subject to human error. That’s true of software engineering as well as science. It’s basically GIGO, but there happens to be a ton of “G” laying around and perfectly willing to become part of “I”, if you don’t watch out for it.
That was an interesting article. I’m unsure of why they used the LISA dataset (it’s rich, sunny, and never fucking rains in La Jolla, so they probably have perfect stop signs that are dusted every day, whether needed or not). 
It is certainly possible to address some of those hacking issues, such as heavier weighting on edge detection, but it’s going to always remain possible to hack image recognition algorithms. One lazy way is to simply tape a sign over another one. We’ll probably address that electronically in some way, where traffic signs and stop lights send out public-key encrypted signals or something similar. I really hope I’m around long enough to see much of this become reality.
It wasn’t an input, but it also could be inferred fairly closely from other features. Think of a text processing model where the creation date is often part of the text – the model found it easiest to parse the date from the text and use the accidental correlation.
A simpler model would not have been able to “cheat” that way, but a simpler model wouldn’t have been enough to do the kind of inference we needed.
@DMC I agree with the thrust of your post overall. But speaking to just this:
From the perspective of the business user, the “libraries” of actual computer code plus the weightings or whatever other config data emerges from the training process are the algorithm.
As a business matter you cannot separate the library from whatever behavior it exhibits after it emerges from the training. If the library cannot be trained by people with actually practical levels of skill and diligence and using data of actually available quality, consistency, and curation, then the total system promises what it cannot deliver: accurate mostly self-taught classifications of real world data.
If one is a library author it makes sense to put your focus first and foremost on robustness and correctness of your part of the overall project: the library and the implementation details that are its algorithms. That’s a necessary, but by no means sufficient, condition for a successful product: an accurate real world classifier. The history of every sort of engineering is littered with cool inventions and cool tooling that promised a revolution but proved too difficult to master in practice.
Is black-box neural net AI one of those cool near misses? I’m certainly not competent to offer any firm conclusion. But it leaves me suspicious and I’m far from alone in that. Plenty of academics and industrial researchers raise similar concerns.
I was kind of afraid that we might be going in this direction, so I’m going to stop before we go to far down the LeCun/Gebru path that took place recently (LeCun gave up Twitter for a while due to this disagreement). I’ll say that I think that they are both right. It is the training data at fault, but in the end it doesn’t matter. Simply hand-waving it away with a simple “the people doing the training should use or create better datasets” or “I blame the data”, while accurate, doesn’t move us forward. Those of us in the industry have to continue to call out our peers who make these mistakes and offer them help in reducing their impact.
Facial recognition of black people could be solved for the most part by simply training on millions of faces of black people from all over the world. Simply stating that is not the fix, we need to do it. Luckily, bias in AI has been a major discussion point at all conferences I’ve attended recently (virtually this year), so it’s not being ignored. It just needs to continue to be a large focus until we eliminate it.
So, with that said, I’m checking out of any more discussions around bias with the potential to be taken the wrong way. It’s a problem. We need to solve it, no matter the cause. That is all.
I wasn’t trying to be contentious. I hope I didn’t sound that way. I was speaking (or trying to) at the meta level of the tool vendor needing to ensure their tool works in the real world, not just on excessively curated laboratory-purity data.
If the problem of poor training efforts leading to GIGO is well-recognized we’re moving forward as an industry towards solving it. It might be difficult, it might be unpopular with the PHBs who were promised effort-free results, and it might take a little while. But we’ll get there. That’s excellent news.
Speaking for myself, bias in the racial or SES sense is not an issue in its own right. I agree that that way lies hysteria, failures to communicate, and rampant accusations of bad faith. Not interested in moving the discussion that way even a little bit.
I didn’t take it that way at all, but this is pretty close to how the discussion between the two started. Again, all respect to both of them as they are giants to me.