So, an article in Slate recently talked about how a group of researchers showed how even studies using honest data and correct and widely accepted methods could produce bad conclusions; They did this by publishing a technically correct paper showing that people who listened to the song “When I’m Sixty-Four” became younger- not just felt younger, but actually were younger after listening than before. The study itself wasn’t linked, but a followup essay by the authors was:
So, not being mathematically inclined and only slightly familiar with statistics and experimental design, how were the authors able to do this? What flaws in the accepted methodology allowed such a nonsensical result?
There were two experiments. In one Penn students listened to a children’s song and a control song and reported how old they felt. In the second study Penn students listened to the Beatles’ “When I’m Sixty-Four” and a control song and reported their actual age. They felt older after listening to a children’s song and younger when listening to a song about aging.
But the paper says that the two methods were the same, which is not the case for what is described. I don’t understand the discrepancy and it’s a critical one.
That’s not what I get out of reading that paper. What I see is that is says that they felt older after listening to the kids song (compared to the control) but they - a different group of subjects - were younger after listening to the Beatles song (compared to the control).
IOW - Study 2 is exactly the same as Study 1, except that in place of the question “how old do you feel” they asked the question “how old are you”.
I think the key to the study is the sentence from the Requirement-Compliance report on page 1364 - “We conducted our analyses after every session of approximately 10 participants; we did not decide in advance when to terminate data collection.”
Notice how they have different numbers of participants in each study. IOW, they have admittedly run each experiment until the point at which they get the ‘right’ result by natural variation, and then stopped. But you do need to look carefully at the method to see that - a cursory reading of experiment 1 would probably lead a lot of people to the conclusion ‘oh, the song changed people’s mood’ although it doesn’t show that at all. But nobody’s going to think the Beatles will actually IRL make you younger (though wouldn’t it be nice … ;))
Also, they have cued the reader somewhat with their ‘this study showed the expected result’ formulation each time. It’s not immediately obvious to me that ‘feels older after listening to a kids song’ or ‘feels younger after When I’m 64’ ARE the ‘expected’ effects. They could easily have used the exact same verbiage (‘this is totes what we expected to see!’) with any random variation that happened to cross the .05 barrier. So that gives them yet another degree of freedom to ‘prove’ that something causes something else, when it does no such thing. And also, they asked for father’s age, mother’s age AND ‘how often they referred to the Good Old Days’ in the ‘unrelated questions’ which gives them another three degrees of freedom to look for random correlations. They’re bound to find something by pure chance here, which is of course the point.
The ‘throw 50 questions against the wall, then go on a correlation hunt’ technique absolutely is used in academia to generate garbage papers, and it really is pretty worthless. The trouble is, people’s career progression is based on how many papers they can write, and writing papers is comparatively easy, compared to running studies. And even twenty vaguely related questions gives about 200 different possible correlation pairs, so you’d expect to find around two ‘1% confidence’ and ten ‘5% confidence’ correlations just by random chance. That’s twelve papers right there! W00t!!
they analyzed their data in many different ways until one analysis produced a p-value lower than 0.05
they stopped collecting data as soon as they obtained a p-value lower than 0.05 (which is also likely to happen sooner or later with random data).
It is very easy to produce false positive at p<0.05. Fortunately, results are not just p-values and any experienced scientist who scrutinize the underlying data (e.g. data tables, scatter plots etc) will smell that kind of crappy p-values easily.
I have been doing (and publishing) experiment-based science for a decade; my own threshold for considering a p-value as “significant” is 0.01. That makes it much harder to produce this sort of false positive.
At the end of the day, any scientific result that is worth caring about will be reproduced by many independent teams, and this is where solid results will show.
Really, though, no p-value threshold is really significant. If you look at 100 jellybean flavors instead of just 20, you’re still likely to find some spurious correlations. You need to look at the whole methodology to weed out the good from the bad.
The jellybean example is a textbook case of Bonferroni correction - and sure, lowering p-values to 0.01 won’t replace that (I wrote my post at the same time that the xkcd was being posted so I didn’t have this in mind).
My point was that reasonably correct (i.e. as correct as can be done in real-life conditions) statistical approaches will produce p<0.05 false positives waay too easily. and that the threshold should really be p<0.01.