Is Snopes wrong? (Statistical Significance)

No, that’s not the problem. I agree with the Wikipedia definition. Assumed in that definition is that you know what the hell you are doing, and you choose a valid model and an appropriate statistical test. That requires knowledge of context, not just plugging numbers into a fixed formula. That has been explained to you numerous times, now with copious examples.

This really doesn’t just come down to the semantics of how you might describe a situation where you proclaim significance and then discover an error much later.

I’ve tried to come a little way to help you understand this; Pasta has come a long way. But I’m not going to concede that you were right in some way and we were just “talking past each other”! It’s pretty clear that there are a number of people with expertise in statistics on this thread, and you’re not one of them. You’ve already been a little rude, and we’ve shrugged that off and continued to be patient and Pasta has put in a lot of effort above. Frankly, a little humility would be in order.

I think we agree on the strict definition of the term “statistical significance”. I’m happy with your wikipedia quote. But notice: before you can decide if something is “statistically significant” by your definition, you must first choose an (arbitrary) α and you must also correctly calculate the p-value. Most of the discussion has been about these two inputs. In particular, calculating the p-value has produced the most ink (pixels?) in the thread. The methodology and model enter critically in calculating this value. Then, how small the p-value needs to be to get you out of bed (i.e., what α you’d like to use) depends on the more artful things. After all, there’s nothing special about 5% or 1% or any percent. But once you’ve correctly calculated the p-value and once you’ve decided on your threshold α (which has been what all the discussion is about), you can immediately determine if a result is “statistically significant”.

I’ve left the term in quotes in that last sentence to emphasize that it’s just a shorthand phrase for “My correctly calculated p-value of x% is less than my arbitrarily chosen threshold of y%.” Nothing of importance happens when that condition is met accept you can now get away with using the phrase “statistically significant” without raising eyebrows.

(I’ve split my reply in two to separate out the more important previous one.)

I wouldn’t discuss the result in terms of statistical significance at all. It would no longer fit into the neat box that the shorthand phrase “statistically significant” provides. I have more to post about this sub-topic, but I’m going to hold off until I’m sure we’re on the same page regarding my previous post. If anything in that post doesn’t jive with you, we should address that disconnect first.

First of all let me apologize if I have come across as a bit rude. This has been a frustrating conversation for me as well as others, but I appreciate everyone’s input.

It might be helpful if I give a little background information about myself. About 40 years ago I took two statistics classes at the University of Texas and did perfectly well in each. Later I wrote a Master’s thesis (in an unrelated discipline) that examined factors that might contribute to the professional success of certain students. The thesis was based on a mail survey, the results of which I dutifully punched into IBM cards and ran through SPSS. In fact I still have the SPSS manual with a couple of punched cards right here next to me. The thesis was basically a bunch of crosstabs with the results of Pearson’s correlations and ANOVA. During my time as a grad student I worked at a state agency that did victimization surveys, and I did more SPSS type work there. All that was 40 years ago, and most of the technical stuff such as which test to use for which data has faded in my memory.

So with that said, I don’t claim to have any real expertise in this stuff, but I do have a basic foundational knowledge, and I fully appreciate the importance of good methodology and proper analysis. Back then the term statistical significance had the very limited meaning that I have been using here. (At least that’s how I used it and perhaps only thought that’s how everybody else was using it.) So we might say something like, “Sure the results from Pearson’s correlation are statistically significant but the data is bullshit.”

From what you are saying, if I understand you correctly, you wouldn’t feel comfortable with that statement at all because proper methodology is critical to achieving statistical significance. Are we on the same page?

You seem to be fixated on a binary notion that data are either “good” or “bullshit”.

Assuming “good” data (not fabricated, clear methodology), the main skill of a statistician is in finding the best model and the appropriate statistical test for the data. If you have good data, but use the wrong model or the wrong test, then your result is wrong.

In fact, this thread’s response to your OP is a pretty good example of this process, because people were “thinking aloud” to some degree. If you read back, I suggested that Poisson was probably an ok model to start with, with caveats. But I’m a bit rusty, and frankly, I couldn’t remember the right test to use. A couple of people suggested some tests, then eventually, Pasta hit upon the right one. We then started trying to break the model, as any good scientist should do in any field. Are the assumptions of the model met? If not, in which direction will the error be? Were the data sampled in a manner consistent with the assumptions of the test? Etc. etc. So this thread was a decent example the general process. Honing in on the best model and the most appropriate test (and the pitfalls) comes with experience. (I’m experienced but rusty, Pasta is obviously more current.)

The other thing you seem fixated on is the semantics of a situation where data are later found to be “bad” in some way - fabricated, or the study did not follow the claimed methodology. Was this “statistical significance but bad data”? Was it “never significant in the first place”? I dunno, it’s semantics, it’s just not an interesting question imo. But I think there are real issues here about whats involved in doing valid statistical analysis assuming good data that you need to get to grips with.

This is what I love about the SDMB. I have two graduate semesters in stats, but I would not consider myself an expert. It’s great to read people who are.

I was hoping to get to a solid footing on the issues in post #62 first. I touch on the “bad data” issue in post #63, but it’s hard to know we’re using the same language on oddball bad data cases unless we agree about the language in good data cases. (Riemann’s last post says basically the same thing, which is a pattern in this thread!)

As a start, we should rewind to post #7 to see where things stand.

In that post, a claim was made that one can “Forget that it’s about homicides. In fact you could restate the question <in terms of generic ‘hits’ out of generic ‘attempts’>…” Given the discussion since then, would you agree that that claim is incorrect? If you think that claim is still valid, re-read the first two paragraphs of post #55 to see a counter-example, and then return here. If you still think the claim is valid, we should clarify that first.

Notice that the above question has nothing to do with bad data. All the data is assumed to be squeaky clean so far here.

(As before, I’m splitting my reply in two in the hopes that you’ll focus on the previous post first. This reply is about the “bad data” semantic issue itself, which it unrelated to pretty much everything that everyone has been discussing in the thread so far, so please don’t conflate the topics. If you are tempted to conflate them, just ignore this post entirely and instead address the previous one only.)

That sentence is odd given that you know the data are bullshit. I don’t see why anyone would construct that sentence in a real situation. You can run some program that prints out “Final p-value = 2.3%” but that’s just a computer doing what it was told. Sure, you could choose to apply the term “statistically significant” to any situation where a computer program prints that series of characters, but that would just be a semantic contrivance with no real value in communicating research findings. And, it’s not what most people do, now or back then. You need to at least suspect that the results are meaningful before there’s any reason to crowbar statistical language onto them.

Again, this is all about the narrow semantic issue related to having known bad data. The more important issues in the thread center on how to apply the term in cases where all the data are good.

I apologize if my posts in this thread seemed condescending. That was not my intent.

To crystallize the differences we are talking about here, If we had 7 individual homicides in 1996 vs 19 individual homicides in 1997, then it might make sense to do the analysis initially in post 29 (ignoring the issues of cherry picking among the different states). However if the data for 1996 came from 5 shootings in which 1 person was killed and on shooting where two people were killed, while the 1996 data came from 10 shootings where one person was killed, 2 shootings where 2 people were killed and one shooting where 3 people were killed, then it would be more appropriate to compare the number of incidents. So that the comparison would be 6 versus 13 which would give a p-value of 0.167.
It is important to remember that a low p-value really is just telling you that the data doesn’t fit the model of your null hypothesis. This could be because the model of your alternative hypothesis model is correct, or it could be because neither model is correct.