View Single Post
Old 10-18-2018, 02:22 PM
Buck Godot's Avatar
Buck Godot is offline
Join Date: Mar 2010
Location: MD outside DC
Posts: 5,882
Originally Posted by Damuri Ajashi View Post
Do you know any statisticians? Please. Go to them. Ask them about data dredging and then ask them if pulling a subset of 3 dozen out of 3 THOUSAND datapoints is generally good statistical science.
Ask and the Dope will provide.

There is nothing inherently bad about taking a small subset of say 40 samples out of a larger set of 1200, if that smaller subset is the data of interest. All that you have to do is to treat it as though it was a study of 40 samples rather than a study of 1200. I understand your concern about data mining. When looking at the data you do need to account for the number of different subsets you looked at.

Since I haven't seen the data I can't be certain how it was organized and what different analyses they considered before settling on this one. Age is an obvious divisor to look at and it is possible that the sole hypothesis going in was to look specifically at the shootings of black youth (14-19 years) since anecdotally this is the group that appears to get the worst rap in the media. If that is the case than there is no need for p-value adjustement.

Otherwise you would have to consider whether they analyzed a number of different subgroups and only focussed on this group when they found that the other groups weren't significant. For example they could have fully data mined by looking at every possible lower cut-point and every possibly upper cut-point until they found the one that gave the best results. I doubt they did this for two reasons. First, subgroups a division into under 14 (pre-teen), 14-19 (teen) and, 19+ (adult) with possibly a few other older subgroups seems natural rather than data derived. Secondly if they did this I would expect them to report 19 and under, since the under 14 group also showed significant bias and including them would probably improve their statistics. What I actually think is probably most likely is that the data the got from the FBI was already divided into age groups that they analyzed directly. Worst case scenario you should multiply any p-value you come up with by the number of different groups they looked at (although you could then divided by 2 to create a false discover rate to account for the fact that both the under14 and 14-19 groups appeared significant.)

There could also be some concern about the independence of the shootings. Correlated data will act to increase the variance of any estimates although not change the point estimates directly, I suspect however that most of the shootings are independent, given the fact that I haven't hear any reports along the lines of Police slay 3 black teenagers in mass shootout, which would make national headlines.

So following Damuri Ajash back of the envelope calculation with a black/white ratio of with a black vs white relative risk of 31.17/1.47=21.2. According to the article, if blacks and whites were killed at equal rates, than there would be 185 additional deaths implying that N*21.2=N+185 so the number of white deaths was about 9. The ratio of blacks to whites aged 15-24 in the population is about 33.3/7.32 = 3.54, and so the number of black youths shot in the study was about 9*21.2/3.54 =54.

So the log odds ratio is equal to log(21.2)=3.05
The standard error of this is approximately sqrt(1/8+1/54) =.378
resulting in a 95% confidence interval of (2.31 - 3.79) corresponding to a relative risks of 10.1-44.3. (about the same as they report)
and a Z-score of 3.05/.378 = 8.07 and a (two sided) p-value of 7.1*10^-16

So while it is possible that the authors might of taken multiple looks at the data, I think it rather unlikely that that the number of looks were greater than the 7x10^13 that would be required to make their result insignificant due to multiple comparisons (or data drudging as you call it).

A better complaint is that they didn't take into account other covariates (different poverty rates between whites and blacks is the most obvious), this plus the issue of correlated data (which is very hard to account for) might lead me to hold off on fully endorsing the final number, but I find it difficult to believe that any alternative analysis would fully eliminate such a massive effect.

- Buck Godot Statistics PhD.

Last edited by Buck Godot; 10-18-2018 at 02:25 PM.