Abusing statistics for fun

I love statistics. They can be put to use for good or evil. I overheard two people arguing about cell phones and cancer, and they were quoting studies and all kinds of numbers.

I randomly flipped through information about cancer rates; I found that 75% of the states that start with vowels have lower than average cancer rates. That has to be significant! If you live in a state that starts with an “M” you’re in trouble though; exactly 75% of those states have a higher than average cancer prevalence. Be sure to reflect on these statistics next time you move.

There must be a money making opportunity here somewhere . . .

(Please note that these don’t fall into the 73% of statistics that are made up. I have an actual source!)
http://statecancerprofiles.cancer.gov/map/map.noimage.php

You can quote a stat to make any point you choose to make.
4 out of 5 people know that…

I like.

Did you know that on average people have less than two arms? Total up the number of arms in the world and divide by the number of people. Since some people are missing one or two and even fewer have an extra arm the result will be less than two. :smiley:

I’ve always been a fan of Simpson’s paradox. Stein’s paradox is also extremely interesting, but it might be a bit technical.

This issue has very important implications.

Suppose you had a mass of data available for study and had two methods in which you could use it. Either you could form your hypothesis first and then test it against the data. Or you could analyze the data and use it to form your hypothesis. Most lay people would think the second choice is better, since you’re not letting your bias influence the results, and are going where the data leads you. But in reality, the opposite is true. Because any data will contain some statistical anomalies, if you look hard enough. There’s no limit to the amount of things that you can correlate against each other, and if something is 1% likely to happen by pure chance, that means that of 100 such things you would expect about 1 of them to happen based on chance alone.

So the likelihood that you’ll find some bogus but “statistically significant” correlation in any mass of data is extremely high. The likelihood that you’ll find any specific pre-specified one is very low. The way to avoid arriving at bogus conclusions is by deciding in advance what’s going to be tested.

[This is a problem in a lot of peer reviewed studies. Some researcher spends an enormous amount of energy studying some data set and does not find what he is looking for. Does he just throw out all his work? He could try to get an article describing his failure to find anything. But he’s better off if he does find something. So he looks through the same data set until he finds something. And the reviewers have no way of knowing that this was done.

Of course, this is why scientists reserve judgement until studies have been replicated.]

By extrapolation, therefore, you can insult your (two armed) friends by telling them they’re mutant freaks due to having more than the average number of arms.

I have always liked the expression: “Figures don’t lie, but liars figure”

And, half of the population has a below average understanding of mathematics.

Very true. In this case, there’s an additional issue: the states that start with vowels are states with low population, which means that the cancer rate in those states can vary significantly from the mean, for no reason other than chance.

I’m reading this post as anti-exploratory data analysis, which is a bit too conservative. EDA is fine as long as you don’t try to use the same dataset for validation, which is the real issue that you’re concerned with.

This is a mistake you see all the time. If 25 other people try but can’t get the same results you did, is it more likely all 25 screwed up or just you did?

I recall a Dilbert cartoon where the boss is upset because 40% of sick days are taken on Mondays and Fridays.

The guy who was saying cell phones cause cancer said that cell phone use was up 300% (didn’t say from when) and that brain cancer mortality was now at “almost 4 in 1,000!” First I think he meant 4 in 100,000 and second, that is meaningless to the average person unless you put it in context. In the early 1990s it was at 4.9/100,000. I think he was basically arguing against himself. :rolleyes:

It’s not even necessarily a question of screwing up. Because of the way that we do hypothesis testing, you would expect to see false rejections of the null hypothesis about 5% of the time. If you do twenty-six studies, you’d expect to see about one rejection by pure chance.

One way you can attempt to avoid errors in fishing for conclusions in pre-existing data is to only analyze half the data set, form conclusions based on that half, then see if those conclusions hold for the other half of data set. At least in that way you’re forming a reasoned hypothesis on the basis of data, and then testing that hypothesis on wholly independent data.

Simpson’s paradox I remember studying and find hardly surprising at all given the ability for sample sizes to differ.

Stein’s phenomenon is way out there though, and seems like magic.

I always “like” hearing the car insurance commercials that say “people who switched save an average of gazillionty-five dollars!”. That people who switch save money should be obvious; they wouldn’t switch otherwise. The problem is that people are going to hear that as “people save on average if they switch!”, getting the causation completely backwards. It’s not the commercial’s fault that people are stupid like that but it’s definitely worded in a way that makes people think the wrong thing.

My favorite has always been the Pepsi taste challenge from many years ago on TV.

What I admired was the clever wording of the stated facts at the end. Having shown a few avowed Coke lovers choosing Pepsi in a blind test, the voice over said, “In recent blind taste tests the majority of Coke lovers preferred the taste of Pepsi.”

So what does that mean? Let’s say I break up all my Coke lovers into groups of 3 and call each group a “test.” Whenever 2 or 3 people prefer Pepsi that is one test where “the majority of Coke lovers preferred the taste of Pepsi.” So how many do I need to make my tagline true? Only two, even if in 200 cases it’s not true.

Beautiful. An apparently compelling stat that tells you nothing at all.

Michael Symon on “Food Feuds” had a similiar experience. People had a stated preference, then picked the opposite diner when tasting blind samples.

As Archie Bunker said: “Ninty Five percent of all needless operations are totally unnecessary”

That’s not true! 4 of 5 mathematics teachers report that fully 76% of their students are in the top 98th percentile.

And unbelievably, 3 out of every 4 people make up 75% of the population!

According to Wikipedia (they have the cite) “A 1996 survey found that 95% of all American preschoolers had watched [Sesame Street] by the time they were three years old.”

Watched just once? For twenty minutes? Does it count if they saw a clip from the show that was just Big Bird? What if it was on at daycare and they weren’t paying attention? Did they include Amish kids? I like that statistic though; I prefer my kid watching Elmo over Spongebob.