A lot of things in medicine come down to complex judgement calls. Sometimes, as in blood clots to the leg, there are point system tools available. And often they are nearly useless, as the point system where “blood clot is the most likely diagnosis” is worth a high score to verify a likely blood clot in the leg.
One of the Gladwell books tells a story about simplifying cardiac chest pain diagnosis to three simple questions which was reportedly more accurate than the scads of heart tracings, historical features and symptoms, and blood tests. While I am skeptical, it seems that many assume “big data” is better. While probably it sometimes is, often it can mislead, or get list in the shuffle. And one can only pay attention to so many things at once.
I would need to know a bit more about the book, the story and the “three simple questions which were reportedly more accurate than the scads” of verifiable hard evidence gathered.
Isn’t that part of the reason full body scans for an asymptomatic individual with no family history of certain diseases is counter-indicated because of false positives and more undue worry for the patient in finding benign masses than accurately disagnosing missed cancers?
For those who don’t want to watch a twenty minute video, the thirty second summary is that by just reflexively running a bunch of diagnostic tests you end up with a bunch of false positives that doctors then treat to the detriment to both the patient’s health and bank account. Too much data can lead to being overwhelmed by possibilities, many of them more unlikely than the weird things writers put Dr. House to do (improbably readily) diagnosing. I’m a little leery of giving Gladwell credit without knowing the actual context of the claim he is making but it is easy to get tunnel vision in diagnosing and treating the wrong condition, or reflexively treating a condition that doesn’t require treatment.
Medical tests often diagnosis abnormal by graphing a curve and cutting them off at, say, 5% and 95%. If you take a hundred such measurements, several tests will likely be abnormal - but not necessarily significant. This is indeed a problem with MRIs - an incidentaloma may need more and more tests.
Did not discuss Gladwell and the Cook County use of research by Lee Goldman since this thread is meant to be much more general. I have plenty of experience diagnosing cardiac chest pain and believe the algorithm to be valuable. But to read more (this is not the main thrust of this thread):
Overfitting can be toxic. Data alone, not so much.
The people building models can make all sorts of mistakes and perhaps having more possible features or something along those lines might increase the likelihood of said mistakes, but that’s not a problem of the data.
Stranger, I’m not sure I’m ready to blame those issues on data. When building classifiers, the modeler (human) gets to balance recall and precision. If false positives are bad, increase precision. If something not being caught is bad, improve recall. Since more and better data can actually improve both recall and precision together, I’d need more convincing to blame data volumes.
And Cook County isn’t an example of using limited data, it’s an example of limiting feature selection. They were able to limit those features considered important, and the weights thereof, as a result of, wait for it, shit tons of data modeling. It’s really a simple decision tree, which has some disadvantages, but also advantages as a model, the primary one being explainability. If I need a prescriptive method for doctors to make quick decisions, it’s a perfectly cromulent model, built on the back of heavy research.
Emergency doctors cannot be sending patients home if they are having a heart attack - not medically, not morally, not medicolegally. Everyone knows this and it does not happen often - the question is how many tests are really needed to draw the correct conclusions. Tests which provide data cost money and a few have significant risks. But I agree this is not the best example of what I was getting at.
So your problem is all of the testing, not all of the data? Okay, I’m on board. I want more and better data so we can get by with a single test or two for many health issues. We in agreement?
Sure, but that’s not a too much data problem, that’s a worried about liability problem. Simple, data driven solutions to help us detect various illnesses at an early stage are both cost-effective and life-saving. To get there requires data.
If I run 10,000 images through a neural network and tell the model which 5,000 had stage 2 lung cancer, the model would probably be decent and might even rank up there with a panel of radiologists. If I show it 100,000 images, 50,000 of which had stage 2 lung cancer, and if they are from various sources, but all excellently labelled and imaged, I’m probably beating panels of radiologists in my predictions, even if I control for precision to avoid a large number of false positives.
In general, more, disparate, annotated, quality data helps build better models.
Overfitting is a serious problem in evaluating data trends in general, although I can’t think of a specific case where it is a direct medical hazard. What it can result in is nutrition or lifestyle guidance that isn’t really supported by evidence, e.g. that some special micronutrient or exercise will provide an exceptional benefit, or a supposed efficacy of a treatment or pharmaceutical intervention that is really an artifact of how the data was evaluated rather than a real trend, although if the treatment has serious side effects or if it means that the patient doesn’t seek other, more efficacious treatments then those could be considered harms immediately stemming from poor analysis.
Although the o.p. used the term “big data”, I inferred that what he actually meant was a lot of data from extraneous tests that aren’t looking for something specific, or diagnostics with poor specificity (providing a lot of false positive results) rather than actual problems with the volume of data. Agreed that larger and more precise data sets allow someone developing a diagnostic test to refine the test for better sensitivity and specificity, or to at least more precisely quantify the accuracy limits of the test, although it is still up to the treating physician or pathologist to interpret the results in the context of the patient (including medical history, diet and lifestyle, other signs and symptoms, et cetera) rather than blindly accept a path report that indicates a single condition when others may still bear consideration.
Seems like a Pareto’s law (aka 80:20 rule) thing here. Absolutely for most cases less data is better, those three questions will get to you a correct diagnosis ASAP, and everyone is a winner: less invasive tests and quicker diagnosis.
But for the minority of patients who have something rarer, and more complicated, then the opposite is true. You are likely to get treated for something completely inappropriate, and have bad outcomes. For those you do need that mountain of data,
When you do the statistics the average patient does better with the simpler system, but that’s not much consolation if you are that patient.
But again, and we’re back to semantics potentially, that’s not less data, that’s fewer tests. The data behind that decision tree is mountainous. That decision tree is derived from the work of Lee Goldman, who used massive amounts of data to reach his conclusions. I’m pretty sure he relied upon the unbelievably enormous volume data produced by the Framingham Heart Study to start. In these cases, much data is the answer, not the problem.
According to the specifications, Data consists of 24.6 kilograms of tripolymer composites, 11.8 kilograms of molybdenum-cobalt alloys and 1.3 kilograms of bioplast sheeting, a polyalloy spinal support, and a skull composed of duranium and cortenide. The LD50 for cobalt(-salts) for a 100 kg person is about 20g, and there may be some toxicity to molybdenum as well, so I’d caution against overconsumption.
The OP’s post and all the subsequent replies all seem to be medical in nature, so I hope this isn’t a hijack. But when I read the thread title my first thought was in regard to the flood of internet misinformation, specifically or especially political misinformation, and how it’s led to the spread and proliferation of conspiracy theories and just plain bad ideas, like a virus. QAnon surely wouldn’t be as widespread as it is without the internet.
I regret giving medical examples since I would have got better replies and debate merely by asking “is data toxic”?
If it is so valuable that civil liberties can be ignored than the answer is yes. If you are studying for an anatomy exam than it depends upon the difficulty of the questions.
This is something of a semantic argument, but I wouldn’t regard misinformation as being “data”, at least per the primary Merriam-Webster definition. Misinformation and baseless conspiracy theories are counterfactual and generally anecdotal (i.e. they don’t even provide data to assess), so they aren’t an example of a toxic surfeit of data but rather a perverse manipulation of it.