Can anyone comment on the mathematical veracity of the argument in todays’s "The Best of the Web"column at the WSJ, discussing a NYTimes storyearlier this week:
[…A NYTimes] news story from earlier in the week [is] headlined “Obesity Rate for Young Children Plummets 43% in a Decade.” But if you read closely, you’ll see the headline oversells the good news:
[INDENT]The figures on Tuesday showed a sharp fall in obesity rates among all 2- to 5-year-olds, offering the first clear evidence that America’s youngest children have turned a corner in the obesity epidemic. About 8 percent of 2- to 5-year-olds were obese in 2012, down from 14 percent in 2004.
It’s true that 8% is 43% less than 14%. But percentages of percentages are tricky. The decline amounts to only six percentage points, or 6% of the total population in that age range. And it’s a narrow age range, raising the possibility that the findings are the result of sampling error or short-term demographic trends. (Many children who were 2 to 5 in 2012 were born during a recession, and recessions tend to depress fertility, especially among the less affluent.)[/INDENT]
If you really want to oversell it, couldn’t you say that there were 75% more overweight children in 2004 than 2012?
Let’s say the population was the same at the beginning and the end of the decade, for simplicity. Consider the following:
Obesity2/Pop Obesity2
--------------- -1 = --------------- -1 = % drop in obesity.
Obesity1/Pop Obesity1
I may have mangled that a little, but you can see that the figure isn’t total garbage.
The report by the NYT is highly dubious though. It looks like they have emphasized an outlying datapoint, given that obesity among the age group immediately above it hasn’t changed much. Look even closer and you see that they rely heavily on the most recent data at the end of the decade. Caveat lector.
I just skimmed the actual paper and I get a whiff of “data dredging”. Here’s the relevant passage in the paper for this result:
The statistical test in this case was a bunch of pairwise T-tests, with no adjustment for multiple comparisons as far as I can tell. This is a naiive and often problematic approach, especially for post-hoc data analysis.
The P values tell you the probability of getting your observation from two random samples from the same group. P = 0.05, the de-facto threshold for “statistical significance”, means that there is a 1 in 20 chance that your observed difference is just a random fluke. So if you make lots of different comparisons at the same time, you should expect that some will be “significant” by chance alone. (Illustrated by this XKCD.) Here, the authors of the paper report nine statistical comparisons, where only one is “significant”.
It is possible that the authors may have made a lot of other unreported tests, between the 2011-2012 period and each of the four previous time periods, for each age group, adding up to 36 comparisons in total. If they made all of those comparisons, of course a few are going to be “statistically significant” by chance. This practice is a statistical sin but distressingly common in the medical literature.
Caveats: I am not an actual biostatistician. I am currently trying to analyze a bunch of old datasets where there are no established multiple-comparison post-hoc methods. That means I probably know just enough dangerous.
TL;DR: I bet that this results is just a fluke.
Believe it or not, I’m sure I know the xkcd strip you cite :::checks::. Yup.
Thanks to all. It’s amusing that a columnist from Mother Jones and one for the WSJ both picked up on that and called it dicey. And tells you something about the implacable nature of math.
I can’t seem to get the JAMA reader to work on my iPad (it’s stuck on the last article I read - so your link just sends me to that).
Anyway - it appears you are right - as one of the comments in one of the articles on Mother Jones quoting the original study - they did NOT correct for the multiple comparisons.
That is actually a HUGE drop on first glance - I would have guessed with the CDC - they’d be dealing with a lot of data. But a p of 0.03 isn’t that impressive to begin with even if there weren’t multiple comparisons - as you have the same issue on a different scale. Of every 20 experiments you do at random testing one thing - you’ll get a “significant” result one time. Obviously more papers are sent in with significant results - and you end up with claims like this. Usually the authors throw in some caution in the paper somewhere (they did here as well if you can believe the quote in the comments section). The media then often report - only sometime with the caution included (of course almost never in the headline - or first paragraph - which if someone is browsing the news - is all they read).