First let me apologize for taking so long between posts. Since the thread isn’t titled “Statistical analysis of Police shootings” I forget to check it. Second, I’m mostly just reading Damuri Ajashi’s comments on my posts so if I missed something that has already been covered previously in the thread I further apolgize.
No what I am saying is that they did a subset analysis that resulted in a result so significant that it didn’t matter whether they data dredged or not. Perhaps and example of what is or is not data dredging might be useful.
Suppose I was doing a study of lead levels in children and I report that “Dreadville California has a lead level in its children that is 2 times the national average with a p-value of 10^-5 (one chance in 100,000 that this could have happened by chance)” On its own that would seem to indicate that there was something wrong with Dreadville. But if it was later pointed out that Dreadville was to top city of 70,000 cities that I had looked at, then it becomes less interesting. I rolled that dice 70,000 times and one time I got a very high results, purely by chance this would have occurred 70% of the time. Dreadville just happened to be the (un)lucky one. This is an example of data dredging.
But suppose instead I reported “Flint Michigan has a lead level 20 times the national average with a p-value of 10^-16”. Then we have a different story even if I rolled the dice 70,000 times there is no way I would be able to get a result this high purely by chance. There must be something different going on in Flint that makes its lead levels so high. Now this result would only apply to Flint and doesn’t say anything in particular about the rest of the Detroit Metro area. In fact it might be that Grosse Pointe Shores has a level of lead poisoning that is significantly lower than the national average with a very significant p-vlaue. That is a different headline which in no way disputes what was found in Flint.
The Pro publica analysis fits more in this latter category. With the added proviso that given the strong notion in society that young black men are thugs, there is a compelling reason to concentrate on this group.
Incidentally data dredging and cherry picking is my bread and butter. I analyze genetic data that involves looking at tens of thousands of genes to find the ones that are likely to be important. Those at the top of the list always look great, and its hard to convince the biologists, (who can always make a compelling story after the fact as to why this makes perfect sense) that these are just random noise. But if there is something in the data it will come out from picking a few strong results out of tens of thousands of garbage. The key is to be able to tell the difference.
Sorry I was trying to combine my explanation with a bit of a statistics tutorial for those that were interested and no something of the subject. Long story short, the worst case scenario there could be around 3,000 possible cuts of the data (3,160 rolls of the dice) so we could multiply our p-value by 3,160. However many of these would result in subsets that have no chance of ever producing a significant result because the subsets are too small, and also many of those that remain are highly correlated since they include many of the same shootings. So in fact the actual amount that the p-value should be adjusted is probably much less than 3,160.
Sorry must have missed it in skimming. Given that the ratio of blacks to whites 1 to 3.54 we would only expect about 2 black so that we found 0 is low but not out of the realms of probable chance. Even without any adjustments due to the multiple comparisons you admit to, we get a p-value of .175 or a little more than one chance in 6. The 95% confidence interval for ratio of blacks to white being shot is (0-1.44) on this data meaning that if this was the only data you looked at, you could be pretty sure that the ratio between blacks to whites in this age group was less than 1.44 to 1. But since this was cherry picked, the results may be biased.
Its possible, but I see no reason that it is necessarily so and even if it was its not enough to fully explain their results.
Well, I guess I should say that after the two studies, the calculation didn’t need to be made. It was fine to throw numbers around and stab in the dark before we had any real facts.
Yes, there is a difference if you ignore all other variables. Hell you don’t need to cherry pick the data to see statistically significant differences. The confidence interval is even better when you look at ALL the data. Peer reviewed studies saw these differences and they say that the difference is basically illusory. When comparing like to like, there is no difference between blacks and whites.
Do you dismiss those studies in favor of the pro-publica bullshit as well? Or even put the Pro-Publica factoid on the same level as those peer reviewed studies?
And do you really consider what Pro-Publica did to be a “study”?
[/QUOTE]
I would say that the Pro-Publica was a subset analysis of another study that demonstrated that there was a subset of the data, namely those between the ages of 14-19, that had a much higher discrepancy between blacks and whites than was present in the data as a whole.