Is Snopes wrong? (Statistical Significance)

If I’m amongst the “you”, I must say: cite? My posts are in response to the statistics of the quoted scenario in the OP. If you think I’ve stated anything incorrect, do point out the incorrect statements. I understand there may be other statistical questions of interest in the original source material for the quoted text, but I have neither looked at nor addressed those in this thread. The OP wasn’t asking about firearms. He was asking about the statistical issues surrounding a short quote. Of course, the fact that he wasn’t focusing on the broader context is relevant to the interpretation of the short quote, which has been duly pointed out.

If in fact the OP is asking “What have the outcomes of the Australian gun buyback program been?” then that’s a different topic altogether, one which you seem to be trying to discuss.

Maybe a slight aside, Yeah. I used to work at NIH in Bethesda, MD. A lot of the researchers/postdocs in my lab were MDs with no statistical training and doing multiple t-tests was the go to statistical analysis for many of them (one person wanted to compare 50 treatments with t-tests in one experiment).

Most of the people responding to this OP seem like they’re more statistically sophisticated than me, but for any non-posting readers checking out the topic (since I love reading topics on the SDMB that I find interesting, but don’t know enough to contribute–which is vast), I once used this analogy during a lab meeting when a lab mate was insisting there was nothing wrong with using multiple t-tests to compare all treatments.

I said, “Let’s say there is a 5% chance (call this Type I error) I get hit by a car if I run across the Capital Beltway (Beltway = a nearby Washington, DC area driving disaster). If I do that once, yeah, it’s a 5% chance. Now, if I run back-and forth across the Beltway several times, do you still think I only have a 5% chance of getting hit by a car?”

Blank looks from several postdocs, but some of the summer volunteers and interns nodded their heads in understanding. So maybe some hope.

Did you follow through on this teaching moment? It obviously suggests an experiment on the Beltway, involving… I don’t know, you probably have some mice around? Or interns? I’m pretty sure this is how the Bonferroni correction was figured out.

Yes, the Bonferroni correction made everything alright. This was why they didn’t do ANOVA with post multiple means comparisons.

Unfortunately, we could not get human subjects to test the Beltway crossing in real life.

And, context! Rush hour, when the Beltway is a parking lot crawl, or during hours when the off hours avg travel speed of a car is 60+ mph?

I gotta admit, this experiment opens a whole new can-o-worms with subject consent!

And, more importantly,as you mention–how is the context reported in journals so I can replicate it?

Missed the edit window, but genetically modified mice are worth more than human interns in the context of running a lab.

Well, I just looked up several statistical formulas and none of them had a variable for context.

Really, I’m done here. Thanks, everybody.

Ignorance fought, battle lost.

That is unfortunate because people were giving you accurate answers and the reasons for it. It isn’t a simple calculator problem that you can just blindly plug numbers into for the reasons given above. I used to teach statistics at a very prestigious university BTW and I am sure that other people that took the time to answer are similarly well versed in statistical theory. You should listen to people rather than just demand the result of a formula that probably isn’t applicable based on the question. No one was ignoring the question that you posed. What we are saying is that there isn’t enough information given to produce a valid answer without it being hopelessly biased.

Dammit! I want some R package I can plug numbers into! What is this about context! A bunch o’self satisfied [insert whatever UK insult seems appropriate]!

Why doesn’t this apply to my lottery picks?

And on reflection, given the subject matter, I now realize that you were probably more interested in finding a way to support a presupposition that Scopes was wrong, rather than learning the correct statistical approach to figure out whether they were actually wrong.

Anyway, I had forgotten the right test to apply here, so at least I’ve learnt something thanks to Pasta.

Damn Scopes. So are you saying that the OP is…
(••)
( •
•) ⌐■-■
(⌐■_■)
Monkeying around?

For instructive purposes, can anyone provide a couple of contrasting examples of “an-increase-from-7-to-19-out-of-4.5-million” that would obviously need to be handled differently for purposes of determining statistical significance in a way that might be easy for non-experts to understand?

Oops. I got one sixth of the letters wrong… I’m pretty sure that must be [dramatic music] significant.

Well, let me take one more stab at this before I go to bed.

First of all I genuinely appreciate everyone’s input and I apologize for the snark in my last post.

Maybe the problem here is how we are defining “statistical significance”. Let’s say we have our control group and our test group with 100 subjects each. We apply some treatment to the test group and then measure them on some scale, but the numbers we enter are completely made up or we pull them out of a phone book. Whatever. Now we run whatever would be the appropriate statistical test(s) on the data, and surprise, surprise, there is a difference between the control and test group that is “statistically significant”, i.e. the p-value is less than the significance level.

So we have here a completely bogus experiment, but is the data statistically significant or not?

Of course that’s a pretty extreme example, but subtle errors can sneak into even the best studies. What if everything was legit but there was a problem in the way the subjects were selected? Is statistical significance to you dependent on proper methodology?

My coins case was meant to do that, but in retrospect it was too convoluted to be obvious. A more obviously different scenario would be, say, the number of food-borne *E. coli * cases in a year. If one year saw 7 people with E. coli infection and the next year saw 19, is this a significant change? What if the FDA had relaxed their hand-washing policy in between these two time periods? Is the rise in cases significant?

A realistic scenario for the above numbers could be that the 7 cases in the first year all came from a single salad bar that got contaminated one afternoon and that the 19 cases all came from the same salad bar contaminated one evening (during the dinner rush) in the subsequent year. The actual number of sanitation failures was the same both years – one failure each. If all the info you had was the number of infections (7 vs. 19) you might erroneously claim that the FDA’s new hand washing policy led to more infections.

This would be an example of clumping mentioned by Riemann upthread. More generally, the individual events counted (here, E. coli infections) may not be independent.

For a different example with a different statistical “feature”, consider the following hypothetical observation: “We found that in 2014, 7 New York state employees were dismissed due to harassment charges while 19 were dismissed due to harassment charges in 2015.” Is that a significant rise? It depends on the context. If all 50 states were examined and this was one of the larger changes (i.e., low probability of occurring by chance), than it is definitely not significant since there were 50 opportunities (50 states) where a statistically extreme scenario could be found, and one expects to find a 2%-ish situation if you look at 50 different samples. But if in fact the entire study was limited to New York state to start and was set in a context specific to that state, then the increase is more significant. But then to go deeper: if the claimants had looked at NY state employees as one class, NY educational employees as another, NY small business employees as yet another – so all still in NY – and then they used the least likely (by chance) increase as their exemplar, the significance changes yet again. At no point is it strictly incorrect to report the number of dismissals increasing or what the probability of it happening is under, say, a model where no underlying change occurred. But inferring how significant the change is depends on knowing how the data were obtained, processed, and reported. This is one of the most common ways folks “lie” with statistics (sometimes accidentally, sometimes maliciously).

As a point of emphasis: it is possible to quantitatively make statements of statistical significance in all of the cases above. But the math and the answers are different in each case, and it is only in very narrowly defined situations that an out-of-the-book formula yields the answer. Such formulas make frequent appearances always, just usually in a larger context-specific calculation.

Maybe? Your example is a good one. Let’s say the experimenters really did want to choose random answers, and they really did decide to use the phone book, and they observed a p-value of 3%. You could decide as a linguistic convention that a p-value less than 5% shall be labeled “statistically significant”. Under this convention, the above phone-book measurement would be statistically significant. But this semantic view of it misses the goal, I’d say. If I saw the above situation (“Random phone book draws lead to p-value of 3%”), a few questions I’d immediately ask are:

  • Were there 30 other research groups doing the same thing, and this study got published because they happened to be the one with a low p-value?

  • How were the phone book pages picked? Was one set of numbers drawn from the “X” pages, enhancing specific telephone exchanges for neighborhoods with large Chinese populations?

  • Was the data point selection process alternated in any way that should wash out potential biases?

It is linguistically handy to have phrases like “statistically significant”, “definitive observation of…”, “evidence of…”, “hint of…”, etc. to imply certain levels of confidence in a claim, but the numerical p-value assumed for each possible phrase varies with the research community and with time.

There is also an art (for lack of a better word) to interpreting claims or, on the other side, designing experiments or research studies, that goes well beyond the hard numbers. Here be the dragons of “systematic uncertainties”. These do not play by the mathematically hard rules of stochastic uncertainties – random errors – that can be (often) well-modeled. The “art” part is looking at an experimental design, understanding the issues of data collection specific to that experiment, and deciding how robust the experiment is to systematic uncertainties that may have been overlooked or whose impact may have been poorly estimated. For the phone book case above, the phone numbers could have been chosen any numbers of ways, but if I knew that each drawn number was put into sample A versus sample B based on the outcome of a coin flip, I would no longer be concerned about the procedure for drawing the numbers from the phone book itself, since any initial bias would be erased by the coin flip (which I would feel comfortable trusting to be completely independent with the number drawn). Did the researches do this or not? I would want to know, especially if an extraordinary claim was being made.

So… this is why I stopped short of saying whether my 2.9% number upthread was statistically significant or not. The number is what the number is, calculated under a specific stated model, but we haven’t established a semantic convention for “statistically significant”, there is a broader context that may be relevant, and some artful inspection of the methods may be needed to make any actionable inference from the number.

Pasta has taken the trouble to put up some great examples and guidance.

El Zagna, let me just add a couple of things that may correct a misapprehension of yours.

When we were discussing “context” before, it was entirely oriented toward figuring out “whatever would be the appropriate statistical test”. Which, as in your own description here, precedes establishing statistical significance. See Pasta’s examples for the kind of things that we were trying to flesh out.

The context we were considering did not extend to “did they just fabricate the data”. And we were certainly not trying to make this into a discussion of potential bad-faith political motives for doing so. That is certainly something best kept to a separate discussion.

It seems to me that a basic flaw in your understanding of statistics is that you think there’s one mathematical formula that always applies, and then it’s just a binary decision about whether to tear up the result because there was dishonesty or incompetence. No, statistics is all about thinking through the kind of data you have, and how it was collected; and thus figuring out the best model and the appropriate test to use.

Thanks to the experts in this thread! This is fascinating, and I’m learning quite a bit.

Excellent point! Congratulations, you just grasped a concept which I found very difficult to impart to my students when I taught Probability and Statistics: All the math in the world won’t mean anything if your experiment is flawed. Bad data gives you bad results. Garbage in, garbage out.

When you read survey results in a magazine, they say something like “37% Yes, 52% No, 11% Undecided (margin of error +/- 4%)”. That +/- 4% at the end only addresses the question of how confident they are that the sample size was big enough. It says nothing at all about how biased the sample was, how poorly worded the questions were, how much the pollster’s tone of voice could have influenced the outcome, et cetera, et cetera. The number of possible flaws in how you get your data vastly outweigh the possible bad luck of random fluctuations.

Getting back to Australia… discussing whether the difference between 7 and 19 can be attributed to random chance is one thing. But it’s a whole other discussion to talk about whether 7 and 19 are accurate numbers to begin with. Here’s an example. Suppose that you send out two researchers into two different territories. One of them was given instructions to talk to the local police and find out how many gun deaths happened in 2008. The other was given instructions to talk to the local newspapers and find out how many people got shot in 2009. Suppose one answer comes back 7 and the other comes back 19. What would that prove? Absolutely nothing, because your methods have multiple flaws.

Well, it looks like we have been using different definitions for statistical significance. No wonder we were talking past each other. I was using something like this: (from Wikipedia) “In statistical hypothesis testing, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level (denoted α, alpha).”

It would be helpful to me if you (Pasta or Reimann or others) could give an explicit definition of what you mean by statistical significance, however I understand that it involves more than just p-values and takes into account the methodology and other aspects of the study.

So with that, let’s take a look at my second example that I gave earlier - a perfectly legitimate study that had a flaw in its sampling methodology. Furthermore let’s say that the study was published in a peer reviewed journal, and generally accepted, but years later the error was discovered. How would one go about talking about that study after the error was discovered? Would you say that the results were not statistically significant? Or would you say something like, “the study reported statistically significance differences, but later the methodology was found to be flawed”? Or would you use some other language?