View Single Post
Old 10-18-2018, 09:55 PM
Damuri Ajashi is offline
Join Date: Jul 2006
Posts: 20,328
Originally Posted by Buck Godot View Post
Ask and the Dope will provide.

There is nothing inherently bad about taking a small subset of say 40 samples out of a larger set of 1200, if that smaller subset is the data of interest. All that you have to do is to treat it as though it was a study of 40 samples rather than a study of 1200. I understand your concern about data mining. When looking at the data you do need to account for the number of different subsets you looked at.

Since I haven't seen the data I can't be certain how it was organized and what different analyses they considered before settling on this one. Age is an obvious divisor to look at and it is possible that the sole hypothesis going in was to look specifically at the shootings of black youth (14-19 years) since anecdotally this is the group that appears to get the worst rap in the media.

If that is the case than there is no need for p-value adjustment.
It's broken down granularly by exact age at death. We should get your hands on the raw numbers but based on a more granular analysis by realclearpolicy, it appears that there are statistics kept much more granularly than 14-19.

Otherwise you would have to consider whether they analyzed a number of different subgroups and only focussed on this group when they found that the other groups weren't significant. For example they could have fully data mined by looking at every possible lower cut-point and every possibly upper cut-point until they found the one that gave the best results. I doubt they did this for two reasons. First, subgroups a division into under 14 (pre-teen), 14-19 (teen) and, 19+ (adult) with possibly a few other older subgroups seems natural rather than data derived. Secondly if they did this I would expect them to report 19 and under, since the under 14 group also showed significant bias and including them would probably improve their statistics.
There were only 2 murders in the under 14 age group in those three years. One black kid and one hispanic kid. (according to the footnotes in the pro-publica article). The numbers in the article go back to 1980.

What I actually think is probably most likely is that the data the got from the FBI was already divided into age groups that they analyzed directly. Worst case scenario you should multiply any p-value you come up with by the number of different groups they looked at (although you could then divided by 2 to create a false discover rate to account for the fact that both the under14 and 14-19 groups appeared significant.)
And what if there are actually 80 age groups from 0-80?

Would it be fair for me to point out that cops kill old white men age 74-79 INFINITELY more frequently than black men age 74-79? Cops killed 6 white men in that age group and no black men. What is the statistical significance of that?

There could also be some concern about the independence of the shootings. Correlated data will act to increase the variance of any estimates although not change the point estimates directly, I suspect however that most of the shootings are independent, given the fact that I haven't hear any reports along the lines of Police slay 3 black teenagers in mass shootout, which would make national headlines.
Yeah, I don't think that's a concern unless there are gangland type shootouts.

So following Damuri Ajash back of the envelope calculation with a black/white ratio of with a black vs white relative risk of 31.17/1.47=21.2. According to the article, if blacks and whites were killed at equal rates, than there would be 185 additional deaths implying that N*21.2=N+185 so the number of white deaths was about 9. The ratio of blacks to whites aged 15-24 in the population is about 33.3/7.32 = 3.54, and so the number of black youths shot in the study was about 9*21.2/3.54 =54.
Ah OK I missed the fact that the 185 deaths were ADDITIONAL deaths.

So the log odds ratio is equal to log(21.2)=3.05
The standard error of this is approximately sqrt(1/8+1/54) =.378
resulting in a 95% confidence interval of (2.31 - 3.79) corresponding to a relative risks of 10.1-44.3. (about the same as they report)
and a Z-score of 3.05/.378 = 8.07 and a (two sided) p-value of 7.1*10^-16
I'm not arguing the math. I'm arguing the logic of even applying the math.

So while it is possible that the authors might of taken multiple looks at the data, I think it rather unlikely that that the number of looks were greater than the 7x10^13 that would be required to make their result insignificant due to multiple comparisons (or data drudging as you call it).
There are only ~3000 deaths to cherry pick. Are you saying that I have to run 70,000,000,000,000 simulations to figure out how to come up with skewed numbers? I literally just looked at the numbers for about 10 seconds to come up with the fact that white men between 74-79 are murdered by police infinitely more frequently than black men 74-79.

A better complaint is that they didn't take into account other covariates (different poverty rates between whites and blacks is the most obvious), this plus the issue of correlated data (which is very hard to account for) might lead me to hold off on fully endorsing the final number, but I find it difficult to believe that any alternative analysis would fully eliminate such a massive effect.

- Buck Godot Statistics PhD.
Recent paper by Roland Fryer says almost exactly but with more variables.