Is there a name for this effect in scientific publishing?

Essentially publishing so many randomized controlled trials that by sheer probability, something incredibly unlikely is almost certain to come out (note how even with P = 0,95 of a correct result, by the end of the 21 tests, the chances of no false positives is still below 50%). Also present in the infamous 2012 Seralini et al paper, where, with 20 groups, there was almost guaranteed to be one that was “shockingly” more cancerous (although that’s not the biggest issue with the study by a long shot). Is there a name for this?

One name would be data mining.

It’s called a Type I error (false positive). Reviewers do consider this when there are a large number of tests being conducted.

Note that a Type I error can occur even with a single test, it’s just that the more tests there are the more probable they will occur.

The file drawer effect, perhaps.

I am not sure whether you understand that point is that the green jellybean experiment is the only one that is actually going to get published, and the fault is not so much with the scientists as with the journal editors (and journal economics).

Missed edit window:

There has been some talk in recent times, now that data can be stored, and made available online, so much more cheaply electronically than it ever could be on paper, of publishing journals devoted to negative results. Perhaps some have even started up by now. However, as the page charges for open access pay-to-publish journals show, even digital publication costs significant money (especially if, as tends to be the case with scientific research reports, you need to have figures and graphs and stuff), so I am not sure how well this can work.

The easiest way around this is a Bonferroni correction. Every paper I’ve seen that does large numbers of tests uses this or something similar to account for the problem.

The problem is sometimes referred to as the multiple comparisons problem, or even the multiple comparisons fallacy.

It should be mentioned that scientists do take the level of significance into account when evaluating a paper. If you report a highly controversial result, you’ll have a lot more credibility if your results are significant at the .001 level than if you barely crack .05.

Another way of dealing with the same problem is to conduct a meta-analysis, combining the results of many different studies to evaluate the effect of sample size.

I believe the specific scenario in the comic is called publication bias.

Bingo.

Actually, if the comic is referring to non-scientific publushing, it could be double publication bias.

However, it’s quite possible that the media would seize on the one significant outcome even if the paper itself had included all of them.

I’ve heard it called cherry picking, if (at some point) the only study mentioned is the odd one.

Yes, the problem is called “pharmaceutical research” (which is closely related to studies in substance safety such as whether jelly beans cause acne). :smiley:

I am being somewhat facetious, but not entirely. For example it’s been known for some time that the Kaplan-Meier efficacy of certain drugs can suddenly and mysteriously reverse itself, and no one knows why. These strangenesses are so pronounced in medicine that the Harvard Medical School has set up an entire program to study them.

Only peripherally related to your question, I know, but still… seems related to the phenomenon of interactions being so profoundly complex that it can be difficult to study them systematically or describe them in statistically reliable terms.

Yeah, it’s a type I error, but I mean specifically testing so often to try to get a handful of positive flukes that you can wave around. Possibly maliciously, possibly unintentionally but without paying attention to it.

Oh, I know how to mitigate it, I just want to make sure there isn’t already some term for it before I make up a term for it on my blog and look like a tool. :smiley:

Not quite. There is only one paper about food dye XYZ (assume thats the green dye… ) to publish !

Publication bias is where there are a number of other papers said food dye XYZ does NOT cause cancer, but only the paper that gives a “positive” is published and hence the journals effectively say that “1 out of 1 papers says green colour XYZ does cause cancer”. If you included non-published papers, it may be that actually 4 out of 5 papers say that it doesn’t.
The slide is not actually showing any error - it may well be true.
How else can a ‘green dye XYZ causes cancer’ study be initiated unless someone found some reason to target that ? Of course, the next study may be a meta-study of other research done on food dye XYZ … hopefully that meta-study is not biased due to publication bias…

If you’re doing it deliberately, the best term for it would be “scientific fraud.” The problem is discussed in basic statistics courses, so if you’re doing it unintentionally the best term would be “scientific incompetence.”

This is for the case of a single investigator or team of investigators doing all of the tests (as in the case illustrated in xckd). If different investigators are doing single tests, and only the ones with positive results publish, then it’s a case of publication bias.

What you’re talking about is the problem of multiple comparisons.

I like to describe it more picturesquely (and if I’m ever called upon to do an intro stats lecture, I will) as “if you throw enough turds at a wall, one of them will stick.” The problem has been known about since forever, and is dealt with in introductory stats classes, so a researcher really should have no excuse for not doing multiple comparisons corrections.

At one level, there are a lot of ways to ‘solve’ the problem: you can simply increase the threshold of statistical significance, or you can constrain the number of hypotheses you test, or some combination of the two. I like to use the Bonferroni correction, which smeghead alludes to: it’s the oldest, simplest, easiest to use, one of the most general and IIRC the most ‘conservative’ procedure. There are lots of other ways to correct for multiple comparisons, depending on the exact design of your experiment (Sidak, Tukey, Dunnett, Scheffe, false discovery rate, Hochberg, and others), and some of them can be more powerful. At another level, the problem is a thorny one because typically, you ‘protect’ against type I error at the cost of increasing type II error (draw some normal distributions and you will immediately see graphically why this is the case). Andrew Gelman recently had a blog post where he points out that typical multiple comparisons corrections have another problem: effect size and level of significance are typically not independent (in a t-test, for example, the level of significance is going to depend on effect size, standard error and degrees of freedom), so by increasing the threshold of statistical significance, you’re going to ensure that only very large effect sizes make it through your testing procedure, which is going to distort your estimates about how large the true effects are.

Regardless, you need to do some correction for multiple comparisons (and really, you should address the problem in advance by running experiments with larger sample sizes, well specified models, and clearly designed hypotheses rather than running fishing expeditions to see what looks interesting). Just my two cents, so take it for what its’ worth, i.e. not much.

This merits emphasis. In other words, if comparisons are being made here, there, and everywhere, finding a significant difference, as has been pointed out, is expected and likely of no import.

On the other hand if the comparisons to be done were specified in advance (and, yes, limited in number) and were formed the basis of some a priori hypotheses or expectations, the result is much more telling.

Such advance specification is especially important when subgroups are being compared. If the subgroups were defined before the study, and if there was a plausible or justifiable reason to form such a subgroup in the first place, comparisons that are then found to reach statistical significance can be very illuminating, unlike those formed post hoc.

I think that it shows the difference between testwise error rate and experimentwise error rate.

In the case of the comic, the testwise error rate is 0.05 (which relates to the 95% confidence level). So if the effect is not significant, the test will erroneously conclude that it is significant 5% of the time.

But doing 20 tests, each with a testwise error rate of 0.05, will lead to an experimentwise error rate that is much higher. The equation is found here. For the comic, the 20 tests would result in an experimentwise error rate of about 64%. So it is more likely than not that they will erroneously conclude that something is significant.