How is the battle between Bayesians and Frequentists going? And is it a war or a friendly squabble?

OK, so here is how I, as a layperson, currently understand it:

In statistics, there are two different basic approaches for calculating the probability that a given fact is true. One is the frequentist approach, which is the “standard” statistics as I learned it in school some twenty years ago, involving hypothesis testing, standard deviations, Poisson distributions, etc. A typical statement made by a frequentist statistician might be “Our prediction is that individuals with genetic trait X have a greater chance of showing symptom Y. The null hypothesis is that these two traits are unrelated. We sampled a thousand individuals with gene X and 70 of them turned out to have symptom Y. We then sampled a thousand other individuals without X, and only 20 of them turned out to have Y. We calculated that the probability of this difference occurring by chance, is less than 0.01. Therefore we consider our hypothesis confirmed.”

The other camp is the Bayesians, who focus more on the idea of having a starting assumption about the probability that a given fact is true, and then adjusting that assumption every time a new piece of evidence comes in, using Bayes’ Theorem to calculate exactly how to do the adjustment. Rather than forming a hypothesis and then confirming or rejecting it, a Bayesian would assign a certain starting value to the assumption that X are more likely to have Y, and then adjust that value every time he encountered an X with or without Y, or a non-X with or without Y.

Is this understanding correct so far?

In the things I read on-line, I get the impression that when someone explicitly states their allegiance to one of these factions, it is almost always a Bayesian, making sarcastic comments about those silly old-fashioned frequentists who are missing the obvious – this XKCD strip being a typical example. So does that mean that the Bayesians are winning, or just that they are more vocal and/or more combative?

My concrete questions:

Among working scientists with respectable credentials, what is the distribution of Bayesians versus frequentists? Is it really the case that most scientists do self-identify as one or the other, or would most statisticians say that both approaches are valid and it’s just a matter of picking the best tool for a given task? To what extent is this really an actual, serious controversy?

And how much difference does it actually make in practice? How often does it occur that two scientists, presented with the same objective facts, would come to radically different conclusions depending on which camp they are in?

I think it’s safe to assume that the “frequentist” in the XKCD strip linked above, who believes that the Sun is about to go supernova (with p < 0.05, hence statistically significant) on the basis of a test with an 1/36 chance of giving the wrong result, is a strawman. But does it often happen that scientists disagree on the meaning of the outcome of a given experiment, purely on the basis of whether they are using Bayesian or frequentist reasoning?

My understanding was that these aren’t so much different camps as ways of dealing with different data sources. The Bayesian approach works better if you are getting your data points one at a time and have to have a moment-by-moment answer, the “Frequentist” approach is basically to gather all the data ahead of time and have no answer until you have sufficient data for a “correct” one. In the long term, their answers should converge, anyway.

The strawman part of that cartoon is using the Frequentist “approach” for a single data point rather than a collection, which is half of the joke. (The other half is the bet that can’t be lost: you’re not going to have to pay the $50 if the sun actually has exploded.)

Absolutely. Law of large numbers and all that.

Congrats if you manage to witness first hand sufficient sun-like stars going nova to get into LLN territory.

Mostly. There is an important nitpick that even most practicing scientist get wrong as a matter of course, and this mistake is the root cause of a lot of the hub-bub (and is at the heart of the xkcd joke).

This is the incorrect statement. Frequentist statistics and Bayesian statistics answer different questions. The “given fact” that each answers is not the same given fact. At a deeper level: the things are allowed to have probabilities assigned to them are different in the two approaches.

Bayesian view: Consider a quantity or fact of nature whose value / truth we do not know. We model this lack of knowledge as a probability distribution (or just “probability” for discrete quantities like “fact is true”). If we have a probability model for our current knowledge, then we can update that model when new data is presented. In a hypothesis-testing example, we can say something like, “There is a 72% chance that the hypothesis is true, given our prior assumptions about its truth.”

Frequentist view: Consider a quantity or fact of nature whose value / truth we do not know. We assume that value is a fixed quantity. It makes no sense to talk about a probability distribution for it. For any assumption about the underlying value, we can answer questions about the probability of observing certain experimental outcomes. We can say something like, “If the hypothesis is true, we have a 72% chance of observing what we do.”

The common mistake is to use frequentist statistics to correctly form the second bolded statement above and then incorrectly treat it to mean what the first bolded statement says. In truth, frequentist statistics can never answer anything about the probability that some fact of nature is true because probability is not defined that way in frequentist statistics.

In your example:

Up to the part I bolded is exactly right. The subsequent sentence would be at best misleading. The trouble is this phrasing is extremely common even though it’s sloppy, and even most practicing scientists do not appreciate what they are saying when they are saying it. In the xkcd sun example, the frequentist is wrong in concluding that the sun has exploded. He can only state something like, “If the sun has exploded, the chance of seeing our experimental outcome is XX%.” That is very different from, “The sun has probably exploded.”

The trouble on the Bayesian side is the issue of “prior knowledge”. You can’t calculate anything unless you have some model for the starting probability distribution for the thing you are trying to measure. The answer you get depends directly on this input probability, but for true unknowns of nature, there is no guidance on how to choose this.

As I like to say (source long lost) – Bayesians answer exactly the question people want to know using assumptions no one believes. Frequentists use exquisite mathematical rigor to answer a question that nobody cares about.

So, both have issues.

There is quite a mix. It’s mostly a matter of what is expected/common in your field.

Most use only one in their regular work and don’t think about the issue at all. They use what they learned in grad school, and that’s that. A reasonably sized minority dive into the issue and make conscious choices about it all. Both are valid approaches to the extent of their applicability.

It’s a controversy insofar as people make philosophical mistakes of inference, like unjustifiably assuming their prior probability distribution is the One True Prior (Bayesian) or erroneously assigning probabilities to underlying facts of nature (frequentist). Most of the time it’s not a big deal, but if someone’s claiming some big discovery, or someone’s claiming to have refuted someone else’s result, or someone’s arguing not to fund something, etc., it can become heated.

It makes all the difference since they answer different questions. It makes little difference when you squint at the whole thing and say, “Screw it, do I believe I’m on to something interesting or not?!”(*)
sup Note: this is a Bayesian question. :)[/sup]

Thanks, Pasta, that’s exactly the kind of information I was looking for!

I suppose that’s the “third” joke in that panel – the empiricist pointing out that sun-like stars don’t explode at all.

For statisticians, the Bayesian controversy has largely been resolved in favor of the Bayesians. The general feeling is that it’s fine to use Bayesian methods, but they need to work well in a frequentist sense, and Bayesian methods are better when they’re less sensitive to the exact beliefs that you started with. Now the discussions tend to be much more technical, and they have to do with computational issues and what exactly a Bayesian should do in a given situation.

My own impression (as a mathematician who knows zilch about statistics) is that most working scientists are frequentists, but don’t understand what they are doing. This became clear to me when my uncle the biochemist asked me the following question. he said that in his area, you had to have p < .05 to have a publishable result. I took that to mean that there was less one chance in 20 that the results were a statistical fluke. That’s right, isn’t it? (You see how little I know about stats). Then my uncle went on to say that if they get p = .1, say, they report, “There was no effect”. That can’t be right, he asked. And I agreed. In fact, I told him that the experiment should be repeated and if you still get p = .1, then you should conclude that p = .01, assuming the trials were independent. Of course, if you actually repeated the experiment you would likely get a totally different p and the effect might be reversed.

One thing I don’t understand about Bayesians is where their priors come from. I mean if I toss a coin three times and it comes up heads twice, I might take a prior that it will come up heads 2/3 of the time and continue to experiment. If the coin is actually fair, the Bayesian stats will converge to 1/2. But there is no way I could have started with 1/2 if I tossed in only thrice. And even my prior of 2/3 seems like a frequentist choice. And even had I studied stats, whoever taught me would have likely been committed to one school or the other and would not likely have discussed the question. Let me finish with a quote from a statistician colleague of mine, “After all, statistics is not exactly rocket science.”

Another nice one, Pasta.

Great. I’ll remember that, I’m pretty sure.

I have a a value of one (as of this post) and have no idea how to gauge the truth-value (is that a word that totally confuses the issue?) of that response.

Yes, that’s also the part I’m still struggling with. My understanding from what I’ve read is that normally you’d start by talking to domain experts and ask them to give their best guess as to how likely a given hypothesis is. Pasta or ultrafilter, could you weigh in here please?

I guess frequentists might complain that this is circular reasoning – starting out with the assumption you’re trying to prove, instead of letting the data speak for itself. To which Bayesians would then probably retort that the frequentists are doing the same, only in a less formal manner – after all, nobody would seriously believe an extremely unlikely claim (such as that the sun just went supernova) on the basis of a test with a 1-in-20 chance of getting the wrong outcome by chance. So the p-value you need to be taken seriously, is influenced by the inherent plausibility of your claim. So the p-treshold serves a similar purpose as the Bayesian prior, but in a less explicitly defined manner. Right?

Well, once you have already flipped three times, it’s not really a ‘prior’ anymore, is it?

Obviously I’m not an expert either, otherwise I would not have needed to start this thread in the first place. But here is my understanding of how a Bayesian would approach it:
[ul]
[li]Start by making a default assumption about the coin. If you have no reason to suspect that the coin is crooked (e.g. you just took it out of your own wallet), you would assign a large prior to the assumption that the coin is fair. If you have reason to be suspicious, but you don’t know whether the coin will be more likely biased towards heads or towards tails, then you start with a probability distribution in which all possible biases are equally likely.[/li][li]After every coin flip, adjust your expectations. If you started out with a firm belief that the coin is OK, it will take quite a few flips with an obviously skewed distribution before your assessment of that probability has sunk below the 50% line. If you did not start with such a strong assumption, you will be more easily convinced, but still after just three flips you won’t be able to draw any firm conclusions yet (except that the coin is obviously not so biased that it always gives the same result), hence your probability distribution will still look fairly flat at that point.[/li][/ul]

Xkcd again has a nicely relevant entry. There are many layers to this joke. The forefront one is that the media certainly doesn’t understand statistics, but another is that a given result in a field that uses the “p<0.05” approach cannot be usefully examined in isolation.

Not all fields that use frequentist statistics use this p-value cut-off silliness. It shows up notably in biological and social sciences, and indeed the consequence is that there is a lot of noise out there. Part of the reason this cut-off shows up in biology, etc., is that even in the best of cases, you aren’t going to get huge significance given the complexity of the systems (hard to control for externalities) and given the often small numbers of trials practically obtainable. Thus, you’re sort of stuck sifting for any hints of significance, which necessarily means your signals will be living amongst noise.

Physics, as a counterexample, doesn’t use “p<0.05” as a standard rubric. This is in part because physicists are often trying to measure something rather than demonstrate a yes/no effect, but when the latter is the goal the cut-off criterion for claiming discovery with a straight face is much higher and more fluid. Physicists as a community recognize that there are a lot of experiments out there, so you can’t get too excited about p<0.05. And since physics can yield highly significant results, it’s not unfair to require it of experimenters, meaning the noise is much more suppressed in the literature.

If a man on the street comes up to you with a coin-flipping game, you can fold together all your worldly knowledge about grifters, coin-making, etc., to estimate the probability a priori that the coin is biased. In a more scientific setting, though, the priors that get chosen are usually of the “throw your hands up and just use something generic” variety. People will use phrases like “uninformed prior” or “uniform prior” to imply that they are using a prior that embodies complete ignorance. But even these have issues. A common case would be someone trying to measure a parameter of nature, call it a. This might be the ratio of the populations of pigeons to rats in NYC, or it might be the Hubble constant, or whatever. You can do calculations assuming initially that all values of a are equally likely, but the very choice of a as the variable of interest is arbitrary. If you had chosen to cast the problem in terms of the ratio of the populations of rats to pigeons (a’=1/a) instead of pigeons to rats, and you said all values of a’ are equally likely, you’d get different results. More generally, this comes up as needing to decide whether it’s log(parameter) or 1/(parameter) or (parameter)[sup]2[/sup] that should have equally likely values along its domain.

When informed (so, not uninformed) priors are used, they are the posterior probabilities from past measurements. A nice thing about Bayesian statistics is that folding in past experimental results is trivial. Those past experiments, somewhere in their history, likely assumed some base uninformed prior, but if the subsequent measurements have had sufficient precision since, the original choice of prior may not matter that much(*).

Yes, in a sense. The underlying difference is that the Bayesian prior deals with what might be true about nature (regardless of experimental issues) whereas a frequentist p-value threshold deals with potential false positives (regardless of how likely something in nature might be). The issue with the frequentist “protection” is that it doesn’t care if a hypothesis is crazy or not, and the issue with the Bayesian “protection” is that you can’t say up-front what’s crazy without introducing real bias.

Exactly right.
sup Experts will notice a bias still lurking in this statement.[/sup]

Just loved that xkcd. Imagine finding 20 (he should have had 21) flavors of jelly beans. Is there any way to date an xkcd panel?

My girlfriend is an actual, real life statistician who manages other statisticians and has forgotten more about statistics than most people would ever know. I asked her about the “Bayesian vs. frequentist” battle and she reports real statisticians use both methods.

A lot of people choose priors that lead to mathematically convenient posteriors, simply because sampling from a posterior distribution is hard unless you can write down a closed form for it. That’s not a good thing to do, but as the practice of statistics goes, it’s not the worst thing people do.

One of my projects for next year is to work my way through Lindley’s Understanding Uncertainty. Lindley was one of the most influential Bayesians of the twentieth century, and if I had to pick one person who made Bayesian statistics respectable, he’d be pretty high on the list. This book is his magnum opus for the nonspecialist, and based on what I’ve heard I’m looking forward to it.

Thanks, that book goes on my list!

I’ve also heard good things about Edward Jaynes’ Probability Theory: The Logic of Science. Any opinions about that one?

That’s true, but it’s also a very dangerous statement. I’ve seen too many scientists who claim to be using Bayesian statistics, but who then put no effort at all into choosing or defending their priors, on the grounds that “it’ll all converge to the same answer in the end, anyway”. But how long is “long term” depends on how far off your prior was. For any given conclusion and any given set of data, there is some prior that will give you that conclusion, which means that if you’re not at least somewhat careful about choosing your priors, you can never meaningfully make any conclusion at all.

Actually, no, you can’t. There is no uniform distribution on all real numbers. There is a uniform distribution on any finite range of real numbers, but that’s dangerous, too: Suppose you take as your prior that the ratio of pigeons to rats could be anywhere from 0 to 100, distributed uniformly. If the real answer is that there are 101 pigeons for every rat, then you’ll never get the right answer, starting from that prior.

Damn. Some laymen. $105!

This is not relevant here. You can still implement the prior that all values are equally likely even if the probability distribution that would, in isolation, capture this statement is improper. In practice, this is just a matter of not putting the prior probability distribution in at all and then normalizing the posterior probability in the end (which will be proper). This is equivalent to the likelihood function for the data becoming the posterior probability distribution function for the unknown quantities. So, while there are certainly lots of difficulties when dealing with uninformed priors, this isn’t one.

It’s also supposed to be good, but it’s a little rough because Jaynes died before he could finish it, and some of the material in there is the editor’s best guess as to what Jaynes would have written.

There’s a uniform measure on the real numbers, and that’s a perfectly good prior as long as you can guarantee that you’ll get an actual probability distribution for your prior. The use of improper priors is pretty widespread, although there is some controversy about it.

So I decided to write a little Ruby program implementing the “determine if a coin is biased” thing, to check my understanding of the theory.

I start out by creating an array where each element represents a different hypothesis about how biased the coin is. Then after every coin flip, I use Bayes’ Theorem to adjust all the hypotheses. So whenever the coin lands tails, all the hypotheses which state that the coin is biased towards tails become a little more plausible. At any time, the sum of the probabilities for all the hypotheses is 1, and the weighed sum of all the hypotheses is my current best guess as to how biased the coin is.

I can choose to initialize the array with either a flat probability distribution (so all hypotheses are equally likely) or with a strong prior assumption that the coin is fair or close to fair. Actually, I keep track of both these arrays in each run of the program.

(Note that if I would simply assign a probability of 1.0 to the assumption that the coin is fair, then that value would stay at 1.0 even if the coin landed heads a thousand times in a row. No amount of evidence can change your mind on something which you consider to be axiomatically true. And when I set the prior for the case that the coin will always land heads to 1.0, and then tell the program that the coin landed tails, I get a divide-by-zero error. :D)

And indeed, what you see is that although the two arrays do eventually converge, it can take a while. After 50 runs of the program, in which the coin has landed heads 26 times, the array which started out with a flat probability distribution now assigns a value of 0.88 to the chance that the coin is fair, while the array which started out with a strong preference (0.71) for that assumption, now has it at 0.98. The program’s confidence that the next flip will be heads, is 0.5116 based on the first array and 0.5021 based on the second one.

It’s quite fun to play with, and watch how the combination of the chosen priors and the actual results of the coin tosses, affect the program’s constantly-updated “beliefs” in different ways. What’s also fun is how the array with the probability distributions, “keeps track of” the ratio between heads and tails implicitly, without that ratio actually appearing in the program code explicitly anywhere.