Lets discuss Bayesian Statistics

The thread that inspired this one is Bricker’s question in the “Why hasn’t the Neighborhood Watch shooter been arrested?”, where we have a discussion of what essentially amounts to Bayesian Statistics.

This is an older discussion/argument, which essentially amounts to answering the following question: if black women are almost never raped by white men, is a black woman who claims to have been raped by a white man more likely to be a liar?

This is essentially a Bayesian statistics question, which I’d like to address in this thread. And, given that this is the Dope, I imagine that I’ll get corrected a few times, and learn something in the process myself, so please chime in.

To understand Bayesian statistics, first consider the following question, (which must be the most commonly asked trick question in statistics):

Suppose that in the country of Statistica one in a million people has AIDS. You know this is true because God parted the heavens and told you so. You have an AIDS test which is 99% accurate (God told you this also). You pick a random Statistica citizen and administer the test. It comes back positive. What is the probability that this person has AIDS?

Your typical person will think for a second and conclude that if the AIDS test is 99% accurate, and came back positive, there is a 99% chance that the person has AIDS.

Sounds good, right? Well, this is a trick question. You actually have two pieces of data which you have to consider; you know that the test is 99% accurate, but you also know that only one in a million people has AIDS. Suppose that instead of administering the test, you just assumed that the tested person didn’t have AIDS. You would be right 99.9999% of the time; your guess would be more accurate than the AIDS test.

So, is the chance that the person has AIDS 0.00001%? Well, no, because you also have a test that says he does have AIDS, and you have to consider that also.

Bayesian statistics is the way that we combine these two pieces of data to find the true probability that the person has AIDS. Here’s how it works; suppose you apply the test to a million people, only one of whom has AIDS. Because your test is 99% accurate, you’ll pick up 10,000 false positives, but you’ll probably also pick up the guy who really has AIDS. So you’ll have 10,001 positive results, one of which is correct; therefore, the probability that the person has AIDS, given that they have a positive AIDS test, and were picked randomly from the population of people with a one-in-a-million chance of having AIDS, is 1 in 10,000. (all numbers rounded, so don’t get on my case)

OK, so now that you understand the concept of Bayesian statistics, consider the following modified question

Suppose that in the country of Statistica one in a million people has AIDS. You know this is true because God parted the heavens and told you so. You have an AIDS test which is 99% accurate (God told you this also). You pick a random Statistica citizen and administer the test. The random person you picked is a male prostitute who happens to also be an intravenous drug user who shares needles, and who doesn’t use those wax pieces of paper to cover the seat when he uses public toilets. The test comes back positive. What is the probability that this person has AIDS?

OK, let’s do the math. …carry the one, multiply by six…OK, the chance this person has AIDS is 1 in 10,000, right? If this answer sounds wrong to you, and you can figure out why, then you know the problem with applying statistics to the rape question. You also know the general problem with applying Bayesian statistics to social sciences questions, and why so many people make so many mistakes trying to do it.

In fact by naively applying your Bayesian statistics math in this case you’re moving away from the right answer. You’d have been better off never having heard of Bayesian statistics, because by trying to apply it here you’re making your answer more wrong than the 99% guy in the first question.

Let’s suppose that all the claims in the various thread are true: suppose black women are only raped by white men 1% of the time. Suppose that 8% of rape allegations are false. Can we combine these statistics to determine if a particular black woman who claims to have been raped by a white man is lying?

No, we can’t. Suppose that the 1% figure is true. It’s likely that it’s true because black women are rarely alone with white men, so it’s a statistic that doesn’t apply to a black woman alone with a house full of white men, and if we try to start calculating Bayesian statics using this number, our calculations will be more wrong than if we ignored it.

Suppose that the 8% figure is true. It’s likely true because, I don’t know, some women are trying to get back at ex-boyfriends who angered them. Does this number apply to women who claim to have been raped at a frat party? Almost certainly not, and by using this statistic in your calculations, you’ll be more wrong than if you ignored it.

Now, that’s not to say that we can’t apply Bayesian statistics in this case, we just have to use the right data. We need to know the percentage of black strippers who are raped while alone in a house with drunk frat boys, and we need to know the percentage of women who claim they were raped during a drunken frat party who are lying, and then maybe we can start making some estimates. But if we start applying population statistics to the wrong population, then we will get the wrong answer. And it won’t just be wrong, it’ll probably be more wrong that if we just ignored the statistics entirely.

Oh, **Bricker’s **actual question was: If 8% of all rape allegations are false, it seems obvious to me that if we know nothing else about a particular rape allegation, there’s an 8% chance it’s false. Why is that wrong?

Well, it’s not wrong. Just that this is only true if you know nothing about the rape case, and you’ll have to throw that number away once you learn something. And it doesn’t address the case discussed in the linked thread, which also included the probability that a woman who claimed rape was lying.

So, let’s discuss the general case rather than **Bricker’s **specific question here.

Nothing to add, I just wanted to say thank you. I had been hoping to have a statistical discussion thread on this topic, but didn’t have enough faith in my understanding and explanatory skills to do it justice.

OK, bear with me, here.

It seems to me that it’s obvious we’ll have better statistics as we get more information, so I’m totally on board with your pointing out that the more information we have, the better (male prostitute who happens to also be an intravenous drug user who shares needles).

But here’s where you lose me: you act as though this proves we never should have said the chance he has AIDS was AIDS is 1 in 10,000.

If we start out knowing nothing about him except his positive test, and the fact that he lives in a country where one in one million people has the disease, and that the test is 99% accurate… then why can’t we say that the chance for him is 1 in 10,000? Period. We know nothing else.

Now, if we start adding additional facts, sure, that calculation may change. But it seems to me perfectly fair to say that IF ALL WE KNOW is that we’ve picked a random person out of the population and given him an AIDS test, the test is positive, it’s 99% accurate, and 1 in a million people have AIDS, then the chance of him having AIDS is 1 in 10,000.

Right?

Um… that’s all I was ever saying:

Is that a correct statement, or not?

As written it’s correct (provided that God parted the heavens, leaned down, and whispered “8%” in your ear); if this is as far as you went in that thread, then it was a mistake to link back to your postings (maybe it was **Huerta88 **who was saying wrong things?). But–and I don’t mean this to sound rude–a big part of the reason you’re right is that you haven’t actually said anything. Yes, if you know nothing at all, then you can apply population statistics to the general population. You will never know nothing at all, so this is an empty statement. It’s the equivalent of me (as a non-lawyer) saying “things that are illegal are against the law.” Great as far as it goes, but probably not necessary to say, and missing the point in a discussion of whether a particular activity is illegal.

I will also append that once you know, say, the victim was alone with drunken frat boys, then this becomes wrong, and repeating it also becomes wrong. This statment is only “correct” before you know that the victim was alone with drunken frat boys. If you repeat it after you know more, then repeating it, no matter how many caveats you apply about how you know nothing, isn’t really an honest continuation of the conversation.

How extraordinary it is, then, that you managed to write some sixteen paragraphs in rebuttal.

Could you tell us the gravamen of your complaint in perhaps one hundred words or fewer?

But I never said it about her – that is, I never said we could make any judgments about Ms. Mangum based solely on her race and the rape allegation.

Perhaps you could find a post of mine that you believe contained an inaccurate statement? So far, it seems you agree I said nothing incorrect, but simply something that was not particularly useful in evaluating a real-world case.

If you posted the above after the point that you knew more about the case than the victim was black, then it was correct but misleading (but in your case not deliberately misleading). If you had some sort of statistics background and posted the above then I would call it deliberately misleading and probably rising to the level of bullshit (this doesn’t apply to you- I consider you an honest poster who doesn’t understand the statistics). The case you should consider here is the way you would treat differently a non-lawyer posting something wrong but doing it honestly vs a lawyer posting the same thing, when you can probably assume that he’s trying to bullshit the readers.

Let me repeat, arguing the general case, once you know the general case no longer applies, is wrong, and if you want to discuss the statistical concepts, should be don in a different thread.

Thank you for your contribution, I guess. Feel free to read the post and ask for clarifications of things you disagree with or don’t understand.

I understand that concept… but it seems to me that it can’t have been too badly misleading, since the proper rebuttal is, I gather, basically what you said: “Yes, Bricker, if one truly knew nothing else, your statement would be correct, but since we DO know other things, your statement doesn’t remotely apply to the current actual case.”

Which I have no problem with.

Looks like a question is still pending, doesn’t it, brah?

I’ll ask it another way: What exactly is your point? Where do you feel Bricker went astray?

What about the more general issue? I’m not interested in arguing about what Bricker did or didn’t say, but I think there’s something here that’s more broadly relevant to public discourse.

Isn’t there some basic logical fallacy regarding taking generalized statistics and using them to determine the facts of a particular case?

For example. I have flipped a quarter 100 times and 75 times it has come up heads and 25 times it has come up tails. So then I say, “According to the statistics, there is a 75 percent chance that the next flip will be heads.”

Now, we know that this is wrong, because we know the probabilities regarding a flipped coin. But what if we didn’t? Isn’t it still some kind of fallacy to take a generalized statistic and apply it to a specific case when we have not shown that the statistic grows out of an inherent characteristic of the coin?

For what it’s worth, here’s what I said in the thread started to Piit me for my assertion:

It’s unclear to me how that can be wrong, or even misleading.

That would be incorrect, as you note, but it has nothing to do with the Bayes Theorem.

Would it be, if we don’t know whether a particular quarter is fair?

I mean, I suppose we’d need to figure out the probability of having a fair quarter, and cross-reference it to the probability of flipping a fair quarter 100 times and having it come up heads 75% of the time.

I had a similar situation today in class. Yesterday I showed a group of students that if you roll 2 6-sided dice, the most common sum will be 7. But I screwed up how I showed them: before they tried the experiment, I had them predict what number would come up most.

Very few people predicted 7. Most predicted 5 or 8 or 10. And the weird thing was, most of the people were right in their prediction: if they predicted 5, that’s what came up the most for them. Little errors entered the process, leading to some sort of weird version of confirmation bias.

Today I had a kid come in, thrilled, because last night he told his mom about the experiment. They tried rolling 2d6 120 times, and it came up 7 over 80 times! He was very proud.

Given his results, there’s no way I’d predict that his next roll had a 1 in 6 chance of being a 7. Some sort of error or anomaly entered his rolling process, leading to a 75% chance of rolling a 7. I’d feel comfortable predicting that his next roll had a 3/4 chance of being a 7, because otherwise his results were astronomically unlikely.

I think there is a confusion here. Starting with the assumption of fair dice, you can calculate, using the binominal probability distribution (with outcomes {7, not-7}), the expected number of successful (dice roll 7) trials out of 120 trials and the standard deviation of the same. Using that, you can then determine how probable it is that you would have successful trials in 80 trials out of 120 solely due to chance. At a certain point, called the level of statisical significance (a very poorly misunderstood term around here), this (due to chance) becomes so unlikely that you endorse the alternative hypothesis (the dice are unfairly weighted) over the null hypothesis (the dice are fairly weighted).

I think there is a confusion here. Starting with the assumption of fair dice, you can calculate, using the binominal probability distribution (with outcomes {7, not-7}), the expected number of successful (dice roll 7) trials out of 120 trials and the standard deviation of the same. Using that, you can then determine how probable it is that you would have successful trials in 80 trials out of 120 solely due to chance. At a certain point, called the level of statisical significance (a very poorly misunderstood term around here), this (due to chance) becomes so unlikely that you endorse the alternative hypothesis (the dice are unfairly weighted) over the null hypothesis (the dice are fairly weighted).

What is going on his is the distinction between a priori probability and what’s called “informed priors.”

If I’m understanding this correctly, let’s say you’re a junior detective assigned to a case. The initial case meeting starts out with you, your boss, and a half dozen other detectives in a room with a file sitting on an empty table. The chief detective opens it up and reads the entire contents of the file: “A woman claims she was raped.”

You pipe up and say, “There’s an 8 percent chance she’s lying,” and everyone nods wisely and applauds your contribution.

You all somehow solve that case and are assigned another one. You walk into the same room, and the chief opens the case file on the empty table. This time he reads, “A female stripper claims she was raped at a frat party.”

You pipe up and say, “If we didn’t know half of that information, there’d be an 8% chance she was lying.”

While factually accurate, you’ve contributed nothing to the problem at hand. Your fellow detectives hose you down and throw powdered sugar on you.

Isn’t that kind of what you did? And it somewhat poisons the well, by bringing up irrelevant statistics, because someone is going to read the thread and not realize that it’s irrelevant. Evil Economist is giving you a pass because it wasn’t intentional, and as far as I’m concerned he’s made a convincing case.