Is data toxic in large doses?

Dr_Paprika · September 28, 2021, 11:50pm

A lot of things in medicine come down to complex judgement calls. Sometimes, as in blood clots to the leg, there are point system tools available. And often they are nearly useless, as the point system where “blood clot is the most likely diagnosis” is worth a high score to verify a likely blood clot in the leg.

One of the Gladwell books tells a story about simplifying cardiac chest pain diagnosis to three simple questions which was reportedly more accurate than the scads of heart tracings, historical features and symptoms, and blood tests. While I am skeptical, it seems that many assume “big data” is better. While probably it sometimes is, often it can mislead, or get list in the shuffle. And one can only pay attention to so many things at once.

Data. Toxic in large doses, or not?

Czarcasm · September 28, 2021, 11:54pm

I would need to know a bit more about the book, the story and the “three simple questions which were reportedly more accurate than the scads” of verifiable hard evidence gathered.

pulykamell · September 28, 2021, 11:57pm

Isn’t that part of the reason full body scans for an asymptomatic individual with no family history of certain diseases is counter-indicated because of false positives and more undue worry for the patient in finding benign masses than accurately disagnosing missed cancers?

Stranger_On_A_Train · September 29, 2021, 12:06am

It’s a real and pervasive problem:

For those who don’t want to watch a twenty minute video, the thirty second summary is that by just reflexively running a bunch of diagnostic tests you end up with a bunch of false positives that doctors then treat to the detriment to both the patient’s health and bank account. Too much data can lead to being overwhelmed by possibilities, many of them more unlikely than the weird things writers put Dr. House to do (improbably readily) diagnosing. I’m a little leery of giving Gladwell credit without knowing the actual context of the claim he is making but it is easy to get tunnel vision in diagnosing and treating the wrong condition, or reflexively treating a condition that doesn’t require treatment.

Stranger

Dr_Paprika · September 29, 2021, 12:14am

Medical tests often diagnosis abnormal by graphing a curve and cutting them off at, say, 5% and 95%. If you take a hundred such measurements, several tests will likely be abnormal - but not necessarily significant. This is indeed a problem with MRIs - an incidentaloma may need more and more tests.

Did not discuss Gladwell and the Cook County use of research by Lee Goldman since this thread is meant to be much more general. I have plenty of experience diagnosing cardiac chest pain and believe the algorithm to be valuable. But to read more (this is not the main thrust of this thread):

DMC · September 29, 2021, 12:20am

Overfitting can be toxic. Data alone, not so much.

The people building models can make all sorts of mistakes and perhaps having more possible features or something along those lines might increase the likelihood of said mistakes, but that’s not a problem of the data.

Stranger, I’m not sure I’m ready to blame those issues on data. When building classifiers, the modeler (human) gets to balance recall and precision. If false positives are bad, increase precision. If something not being caught is bad, improve recall. Since more and better data can actually improve both recall and precision together, I’d need more convincing to blame data volumes.

DMC · September 29, 2021, 12:30am

And Cook County isn’t an example of using limited data, it’s an example of limiting feature selection. They were able to limit those features considered important, and the weights thereof, as a result of, wait for it, shit tons of data modeling. It’s really a simple decision tree, which has some disadvantages, but also advantages as a model, the primary one being explainability. If I need a prescriptive method for doctors to make quick decisions, it’s a perfectly cromulent model, built on the back of heavy research.

Dr_Paprika · September 29, 2021, 12:35am

Emergency doctors cannot be sending patients home if they are having a heart attack - not medically, not morally, not medicolegally. Everyone knows this and it does not happen often - the question is how many tests are really needed to draw the correct conclusions. Tests which provide data cost money and a few have significant risks. But I agree this is not the best example of what I was getting at.

DMC · September 29, 2021, 12:45am

So your problem is all of the testing, not all of the data? Okay, I’m on board. I want more and better data so we can get by with a single test or two for many health issues. We in agreement?

Dr_Paprika · September 29, 2021, 12:49am

Basically we are in agreement. But doctors often practice “defensive” medicine since authoritative textbooks and lawyers may not be.

DMC · September 29, 2021, 1:00am

Sure, but that’s not a too much data problem, that’s a worried about liability problem. Simple, data driven solutions to help us detect various illnesses at an early stage are both cost-effective and life-saving. To get there requires data.

If I run 10,000 images through a neural network and tell the model which 5,000 had stage 2 lung cancer, the model would probably be decent and might even rank up there with a panel of radiologists. If I show it 100,000 images, 50,000 of which had stage 2 lung cancer, and if they are from various sources, but all excellently labelled and imaged, I’m probably beating panels of radiologists in my predictions, even if I control for precision to avoid a large number of false positives.

In general, more, disparate, annotated, quality data helps build better models.

Stranger_On_A_Train · September 29, 2021, 1:10am

Overfitting is a serious problem in evaluating data trends in general, although I can’t think of a specific case where it is a direct medical hazard. What it can result in is nutrition or lifestyle guidance that isn’t really supported by evidence, e.g. that some special micronutrient or exercise will provide an exceptional benefit, or a supposed efficacy of a treatment or pharmaceutical intervention that is really an artifact of how the data was evaluated rather than a real trend, although if the treatment has serious side effects or if it means that the patient doesn’t seek other, more efficacious treatments then those could be considered harms immediately stemming from poor analysis.

Although the o.p. used the term “big data”, I inferred that what he actually meant was a lot of data from extraneous tests that aren’t looking for something specific, or diagnostics with poor specificity (providing a lot of false positive results) rather than actual problems with the volume of data. Agreed that larger and more precise data sets allow someone developing a diagnostic test to refine the test for better sensitivity and specificity, or to at least more precisely quantify the accuracy limits of the test, although it is still up to the treating physician or pathologist to interpret the results in the context of the patient (including medical history, diet and lifestyle, other signs and symptoms, et cetera) rather than blindly accept a path report that indicates a single condition when others may still bear consideration.

Stranger

DMC · September 29, 2021, 1:11am

Sounds like we’re all on the same page. I like days like this.

Stranger_On_A_Train · September 29, 2021, 1:39am

Yeah, agreed. Discussing minor points of semantics is enjoyable and almost never results in someone getting an eye put out.

Stranger

griffin1977 · September 29, 2021, 2:44am

Seems like a Pareto’s law (aka 80:20 rule) thing here. Absolutely for most cases less data is better, those three questions will get to you a correct diagnosis ASAP, and everyone is a winner: less invasive tests and quicker diagnosis.

But for the minority of patients who have something rarer, and more complicated, then the opposite is true. You are likely to get treated for something completely inappropriate, and have bad outcomes. For those you do need that mountain of data,

When you do the statistics the average patient does better with the simpler system, but that’s not much consolation if you are that patient.

DMC · September 29, 2021, 2:53am

But again, and we’re back to semantics potentially, that’s not less data, that’s fewer tests. The data behind that decision tree is mountainous. That decision tree is derived from the work of Lee Goldman, who used massive amounts of data to reach his conclusions. I’m pretty sure he relied upon the unbelievably enormous volume data produced by the Framingham Heart Study to start. In these cases, much data is the answer, not the problem.

Half_Man_Half_Wit · September 30, 2021, 5:47am

According to the specifications, Data consists of 24.6 kilograms of tripolymer composites, 11.8 kilograms of molybdenum-cobalt alloys and 1.3 kilograms of bioplast sheeting, a polyalloy spinal support, and a skull composed of duranium and cortenide. The LD₅₀ for cobalt(-salts) for a 100 kg person is about 20g, and there may be some toxicity to molybdenum as well, so I’d caution against overconsumption.

solost · September 30, 2021, 11:06am

The OP’s post and all the subsequent replies all seem to be medical in nature, so I hope this isn’t a hijack. But when I read the thread title my first thought was in regard to the flood of internet misinformation, specifically or especially political misinformation, and how it’s led to the spread and proliferation of conspiracy theories and just plain bad ideas, like a virus. QAnon surely wouldn’t be as widespread as it is without the internet.

Dr_Paprika · September 30, 2021, 2:50pm

I regret giving medical examples since I would have got better replies and debate merely by asking “is data toxic”?

If it is so valuable that civil liberties can be ignored than the answer is yes. If you are studying for an anatomy exam than it depends upon the difficulty of the questions.

Stranger_On_A_Train · September 30, 2021, 3:15pm

This is something of a semantic argument, but I wouldn’t regard misinformation as being “data”, at least per the primary Merriam-Webster definition. Misinformation and baseless conspiracy theories are counterfactual and generally anecdotal (i.e. they don’t even provide data to assess), so they aren’t an example of a toxic surfeit of data but rather a perverse manipulation of it.

Stranger

Topic		Replies	Views
Abusing statistics for fun Miscellaneous and Personal Stuff I Must Share	24	3291	January 24, 2011
The plural of anecdote is not data, um it kind of is. In My Humble Opinion	28	5864	May 12, 2014
Goddammit, learn the difference between correlation and causality!! The BBQ Pit	32	1984	July 29, 2002
Ordering Pizza in 2010 Great Debates	51	2796	September 22, 2006
Is it just me or is this site kind of soft? The BBQ Pit	113	7168	June 5, 2005

Is data toxic in large doses?

Related topics