No, I appreciate you letting me off easy by framing it as semantics, but now that you point it out, you are absolutely correct and my referencing online misinformation as ‘data’ is a misnomer, if not clearly and completely wrong.
Well, people are treating conspiranoia as if it is data, as in, “There are all these studies…” The problem is, of course, that they generally cannot produce the studies, or if they can they are completely misinterpreting the results. Not so much a problem with “toxic doses of data” as much as “toxic doses of bullshit”, but unless the hypothetical consumer of information understands the difference and has the tools to question it, they can’t distinguish one from the other. It’s like the moon landing hoaxers who will shove pages of calculations ‘proving’ why NASA couldn’t have sent astronauts to the Moon which looks impressive to anyone not conversant with orbital mechanics but is evident garbage to anyone who is.
Stranger
I do think there is an interesting conversation in this sentence as I own and have read and enjoyed Weapons of Math Destruction.
But even in that area, the problem isn’t really the volumes of data, it’s mostly just that completely innocent models inherit the bias in the data they are fed and then propagate that bias forward. We practitioners in that space need to always be aware of this sort of thing and try to eliminate it wherever possible and speak to it when not.
In a predictive model that attempts to predict criminal drug use (not that I’m advocating for such a thing, but I can almost guarantee you that someone is building or has built this model), taking all of the drug crime data and feeding it into the model is going to show a much higher propensity for the perpetrator to be black. That could easily come from police focusing on certain areas when fighting drug crimes or courts more strongly convicting black people for no reason other than race. So, if we assume that criminal drug use (not arrests or convictions) is proportionally distributed among races (and I’ve seen no compelling counter evidence), then while the model builder could be completely innocent, the model would have a built-in bias, causing the department to dedicate their resources in the same areas that caused the bias in the first place.
That sort of thing is a real issue and has highly important repercussions if not dealt with, but again, it is a data quality problem, not a data volume problem.
And I forget this earlier.
![]()
Please tell this to my sister-in-law!
Data is often viewed as the raw material, so any manipulation by some black box algorithm that produces a result can be characterized as due to the process, not the data, certainly not the volume of data. (Data is an independent variable subject to some function). But is this always so?
Such a result might be subject to whatever assumptions were used to produce it, and this criticism is said to also apply to AI as mentioned above.
Your senses rely on proven (but complex and barely known) heuristics automatically ignoring most of the inputs (the data depends on the function). You do not pay attention to most of the background noise or available visual detail - but still pick out the conversation mentioning a subject of interest or possibly miss (as per the Internet meme) the guy dressed up as an ape while counting basketballs (or whatever it was).
Some surveys seem to ask more and more invasive questions leading to the impression that the survey is not the goal. No wonder people are increasingly reluctant to complete election polls. In this case, data quality and volume are both the issue.
Again, I haven’t seen a single example in this thread where too much data was in any way toxic or the cause of an issue. I’ll happily look at any examples anyone can bring forth, but so far have seen zero. People will say that overfitting is caused by too much data. Not only are those people wrong about the cause, one of the first solutions offered to counter overfitting is to get more data.
Does anyone have a single example where the volume of the data was the cause of an actual problem (beyond the inability for inferior hardware to process it in a timely manner)? Just one.
Google seems to show a mention of this by Bruce Schneier. He says companies should not keep unneeded data because of the risks of compromise. He calls data toxic, but not seem this commonly applied. This is not what I was thinking either.
The volume of my data has nothing to do with risks of compromise other than perhaps whether I am a “juicy” target. Frankly, these days everyone with any data whatsoever is “juicy” enough to someone. We get targeted many times a day and I don’t consider our data all that exciting. We have some PII, but it’s not particularly useful PII. The likelihood of a hacking attempt getting through is not dependent on whether I am storing petabytes or kilobytes of data on my servers, it’s dependent on how well my security procedures are set up.
I’m assuming this is what you are referring to from Bruce. I read that more as “Don’t keep shit around that might get you in trouble or embarrass you, unless you actually have to for some reason”, which I don’t believe is the same as “Data is toxic in large doses.”
If you get married, perhaps lose that small pile of love letters you saved from your first girlfriend, but it’s probably less of an issue to keep that gigantic baseball card collection. As you see, it’s not a quantity issue.