Frequentism vs Bayesianism--what's the practical distinction?

I’ve always been confused by discussions (okay, arguments) between Frequentists and Bayesians about the interpretation of probability. I can see that they are assigning different meanings to probabilities. But I have not been able to figure out what the practical upshot is of this difference. I would listen and follow as closely as I could, and invariably come away with the idea that while their theories of probability are meaningfully distinct, they nevertheless both lead to exactly the same practical outcomes when you actually sit down and try to solve a problem in the world using probabilities. But when I say this, they insist it’s not so. And then I stop being able to understand what they say.

There’s a video at the bottom of this link that purports to explain the difference, but frustratingly, when the guy does say something about an actual different answer that Bayesians and Frequentists would give for a particular problem, he then goes on to say “Frequentists are probably kind of mad at me right now because they actually can do what the Bayesian does, in effect, using such-and-such technique. But the way Bayesians do it is more natural.”

“More natural.” That’s the kind of phrase I hear in these discussions that make me suspect there is no practical distinction–that, like with the interpretation of Quantum Physics, the theories, while distinct, don’t make any actual difference to the answer that will be given for any individual problem. For me, this makes the problem fun but not interesting if we’re trying to figure out what we can actually do with probabilities.

Well, can anyone here help me with this? Can I understand the real, significant difference between the two camps without going and getting another degree or something?

The philosophical differences between the frequentist and Bayesian approaches to probability are complicated and nuanced (and are not fully agreed upon by either camp) but the practical differences are clear: the frequentist approach lays upon the testing of a proposed fit which is either provisionally accepted (“The null hypothesis is rejected.”) or accepted (“The null hypothesis is true.”) based upon some degree of fit of sample data to the presumed distribution. There is no intervening likelihood or degree of truth. The Bayesian approach, on the other hand, addresses proposed mechanism as a degree of belief, and how valid that belief is shown to be in practice after evaluating empirical data.

Each has its place in the toolbox of a statistician. For instance, if evaluating measurement or process errors that are presumably probabilitically randomly distributed on a normal (Gaussian) distribution, the frequentist approach is clearer and more rigorous; you can either see whether the data fits a normal distribution to an expected level of confidence (e.g. 95% probability with 50% confidence), and if it doesn’t you likely have some kind of bias in your measurement or process. An example is flipping a coin (Bernoulli trials); when you flip a coin for a given number of trials you can explicitly calculate the likelihood of head to tail ratio within a certain confidence level or interval, and if multiple trials fall outside of that expectation then the null hypothesis is rejected, e.g. there is some kind of bias in the coin.

On the other hand, if you are dealing with a process that does not necessarily have a clear expectation of fitting a specific probabilistic model, the Bayesian approach is preferable. You begin with a degree of belief about a specific probabilistic model (prior distribution), and then compare the fit of the data from your trials, and then modify the expectation (posterior distribution) to match the observed result. This can support the empirical development of distributions which are not categorically defineable and for which an explicit test-of-fit would not be feasible, and such, are a better way to approach multivariate distributions in which parameters are clearly not independent.

The frequentist approach is good for scenarios in which the reality can be represented closely by theory, e.g. measurement theory, quantum and statistical mechanics, mechanical tolerance stackups. The Bayesian approach is good in situations where clear hypotheses cannot be stated in terms of a collection of independent variables, e.g. the distribution of Thai restaurants in San Francisco, and then testing them for significance against various parameters. The Bayesian approach has come to dominate data science in the “Big Data” domain because it can give results even when a clear hypothesis is not readily definable, and therefore is useful in exploratory statistics. However, it can also be a good approach with sparse data, especially in long tale events (those that would occur at the extrema of a frequency distribution) in terms of evaluating the likelihood of such events occurring compared to belief. However, it doesn’t give definitive assessments of just how probable an event is to occur (compared to a presumed distribution); just how close the degree of belief is to the measured reality or expectation.

In practice, human reasoning processes are fundamentally Bayesian, e.g. we believe a proposition and then modify our future predictions based upon how well events matched the expectation, so in terms of replicating intuition, the Bayesian approach is clearly the way to go and is why the approach is dominant in machine learning and synthetic intelligence. If you have a sufficient (e.g. very large) set of measurements about an explicit distribution, both approaches will converge toward the same result.

Stranger

In many situations the difference between the two approaches ends up being very simple: the distinction between Maximum Likelihood value and Expected value. The Frequentist points to the Mode value of a probability distribution; the Bayesianist points to the Mean value taken over the distribution. (Note that if the distribution is a symmetric bell shape, they get the same answer.)

This is not a complete answer, just an observation of what the two approaches reduce to in many practical situations.

Obligatory XKCD xkcd: Frequentists vs. Bayesians

A Bayesian starts with a prior belief and modifies it to a posterior as data is acquired. The more data there is, the more and more will the posterior resemble just the data and the Bayesian view will match the observed frequency. In many cases the Frequentist view is exactly that of a Bayesian with a diffuse prior.

In some cases, Bayesian analysis is required. For example, A Frequentist view on something that has not happened yet assigns a zero probability. Often this is not a good viewpoint.

Are there not clear practical differences? For example, traditional frequentist statistics makes frequent (er…) use of p-values P(observation | hypothesis) in significance testing, while Bayesian statisticians insist this is not the value one is actually interested in and one must rather compute the converse probabilities P(hypothesis | observation) from their priors, which for the Bayesian is meaningful but which the frequentist insists is unmeaningful (thus the resort to the mere p-values…).

[For my money, both are right, about both being wrong: in typical contexts, the Bayesians are right that P(observation | hypothesis) is not the value one is interested in, and the frequentists are right that P(hypothesis | observation) is not meaningful… But nevermind me.]

Nope.

That’s a tall charge.

First: Okay, I’m not a statistician, but whenever I’ve seen attacks of “Bayesianism” in the past, that charge of “meaningfulness” is seldom (in my experience) the complaint. The attack is almost always that the unconditional prior is chosen arbitrarily. Hence the joke: “The prior was pulled from their posterior”. The worst kind of examples of this sort of attack have always been, “Look, guys, I deliberately chose a bad prior! Look at how terrible the results can be from that, derpy derpy derp!!” (I’ve seen similar examples, too, of Bayesians cherry picking examples where a frequentist approach is doomed to fail. Instinctively, I’m Bayesian to my bones but that’s never seemed like a fair way to do things in either direction. You use the tool that’s fit to the task.)

Second: P(hypothesis | observation) can be interpreted meaningfully so easily that to regard it as “not meaningful”, without further comment, is just… highly ill-advised. If somethin don’t work for you, then it don’t work for you. That’s no skin off anybody’s back. But talking about meaningfulness means you’re getting downright philosophical on everybody else, and when you do that, you’re going to collide with the views of people who have spent large amounts of their life thinking about this very topic.

I’m not anything close to a statistician, but I could go on for weeks about this stuff.

You must have some highly specific notion of “meaningful” in mind if it does not permit such charges as “But those priors are arbitrary!” to be phrased in ordinary language as “But those numbers are meaningless!”.

I don’t mean literally meaningless: any calculational ritual anyone performs anywhere has some literal meaning, the meaning prescribed to it by usage, the formula of the ritual. But, yes, I would dispute the sensibility of supposing that uncertainty is generally best coded probabilistically.

That is, I’ve never been sold on the idea now eating the world that all uncertainty is best thought of as numerical probabilities (i.e., proportions in the range [0, 1]). Certainly, in some cases, there is a clear frequency model of uncertainty in which case the assignment of probabilities is readily interpreted in its relation to real-world quantities. When this frequentist interpretation of probability is available, its sense is clear, and the accuracy of a claim can be gauged. Does the claimed frequency match the empirical frequency?

The Bayesian of course goes beyond this and says probabilities needn’t be based in frequencies, but reflect belief credences.

Are they they speaking about credences of people’s beliefs as they actually are? For I look at my beliefs, and they don’t act anything like how probabilities (again, meaning numerical proportions in the range [0, 1]) act.

For example, why couldn’t it be that there were three uncertain propositions A, B, and C of which I had no reason to find A more or less plausible than its negation ~A, nor than B, nor than C, though A, B, and C are exclusive and mutually exhaustive? Why couldn’t I have total ignorance about a situation?

It seems to me, in fact, this is very often how my uncertainty works; i.e., with “Knightian” uncertainty as the usual kind. Only very particular uncertainties having reasonable frequency-like interpretation (here, I say “frequency-like” just in reference to the mathematics of probability, which is, and developed as, the mathematics of frequency/proportion/etc., whether or not it has any other interpretation).

[If my beliefs do, in fact, have the structure of probabilistic numbers, they have it in a way which is inaccessible to me upon introspection, at which point I don’t know what it amounts to that they’re supposedly my beliefs…]

Well, of course I’ll be told, Bayesian probabilities aren’t meant to model my beliefs as they are, but rather, as they should be, were I truly, properly rational… But why SHOULD my beliefs have this structure, either?

“But what of your betting odds?”. What of them? I don’t walk around with particular betting odds attached to every proposition. There are, in fact, propositions of which I would toss up my hands and say “I dunno. I don’t really want to bet either way at most odds, because I just don’t know and am risk-averse”. Most propositions are like this, for me. Is this not the ordinary human way?". There is not a secret fixed price buried in my brain for everything. (Similar I can quibble with von Neumann-Morgenstern utility arguments)

“But you’ll be susceptible to a Dutch book!”. No, I won’t. Because, like I said, I don’t take most bets. Bam! Dutch book proof. You are only susceptible to Dutch books if you choose to set hand-forcing prices on each bet. Why would you do that? Don’t do that. That’s foolish.

“But Cox’s theorem!”. Cos’x theorem is contrived from the start with the presumption one must model uncertainty as just the sort of numbers I haven’t been sold on. OF COURSE if plausibilities are all linearly ordered, one could naturally represent them with numbers in [0, 1], by comparison to suitable coin flip bets (for which frequency interpretations are available). But the A, B, C example above illustrated plausibilities not well-modeled as points along a totally ordered continuum. Thus scuttling Cox from start.

I’m willing to hear an argument for why rationality demands I reason by assigning propositions numbers and updating those numbers in accordance with the particular mathematical rules of frequency, but I, too, have been thinking philosophically and professionally about what logic is and demands for a very large amount of my life, so I don’t quake at the mere fact that I disagree with others (though admittedly others have lived longer than me…).

[Going back to “meaningfulness”, again, as I said, when the frequentist interpretation of probability is available, its sense is clear, and the accuracy of a claim can be gauged. Does the claimed frequency match the empirical frequency? But in the Bayesian interpretation of probability, there’s no criterion of truth to provide any sense to a claim about probabilities beyond “Is this in fact the number that the model produced?”. Which becomes tautologous. “Do you in fact believe what you said you believed?”. Yes, you can ask about calibration in the sense of “Does this model announce p% confidence claims correctly p% of the time?”, but that doesn’t really give a sense to any particular claim that a particular proposition ought be treated as having a particular sort of probability.

I do not dispute that Bayesian models can in practice usefully carry out actions of desirable sorts with high frequency, incidentally. Let me be clear about that. But I do dispute this is because there’s some magic secret sauce in the particular rules of modelling and updating propositional truth values according to the particular mathematics of [0, 1] frequencies. All kinds of non-Bayesian models can also carry out actions (make predictions, or whatever) of desirable sorts with high frequency as well.]

I also do not dispute the charge that I say ill-advised things on occasion! Let that be clear.

(But, again, I dispute that I have not earned the right to collide with the views of others on “downright philosophical” matters… Not that I am alone in any of my positions on these contentious issues, individually. For example, whether or not one agrees with them on this point, I believe that, yes, many of the historical founders of frequentist statistics wrote explicitly on their consideration of the probability P(hypothesis | observation) as “meaningless”, contra the Bayesians. I will find quotes and cites shortly [but first, having stayed up all night for no good reason, I will go to bed…])

I’m not sure what you’re saying here, but it’s not right.

The more bells and whistles a problem has, the more likely someone is to slip into a mixture of Bayesian and frequentist thinking. For a clean example, let’s take the simplest parameter estimation problem one can have.

Say you put a cat on a weighing scale and extract one data point, w, which might have a value like 9.2 lb. So, what can you now say about the cat’s actual weight?

Both a Bayesian and a frequentist would study the weighing apparatus and weighing procedure to establish the function f(w|w[sub]true[/sub]) – the probability density for measuring a value w as a function of the (unknown) true weight w[sub]true[/sub]. This function describes everything there is to know about the experiment itself, and all that’s left is to hold our data point up against f in some way.

(1) The Bayesian would talk about the probability density function for w[sub]true[/sub]. He’d combine f with a prior pdf for w[sub]true[/sub] and construct a posterior pdf. He could then answer a question like “What’s the most likely value for w[sub]true[/sub]?” Note that it makes perfect sense to talk about probabilities of w[sub]true[/sub] here.

(2) The frequentist would say that makes no sense. w[sub]true[/sub] isn’t a random quantity at all. It’s an unknown quantity, but it has some fixed value. So, he talks about probabilities of obtaining certain experimental outcomes for various assumptions about the unknown-but-not-random value of w[sub]true[/sub]. He could construct a confidence interval that has a certain probability of containing w[sub]true[/sub]. He cannot answer the question “What’s the most likely value for w[sub]true[/sub]?” If he does answer it, he’s being Bayesian and doesn’t realize it.

There are other “practical” questions that one approach can answer and the other can’t, but I’ll stop with this core one for the moment.

I don’t know what this means.

“Generally best” is also slippery. For a mathematician like you, something that is generally true could mean true in all cases. Or it regular English, generally true can mean most often true, but with definite exceptions.

To be clear, then, I’m not saying that a Bayesian approach is always best in all possible cases. I said the same thing above: it’s wise to be flexible and use whichever tool is most appropriate to the given job. But in my experience, a Bayesian approach is very often the most helpful approach for a wide variety of questions.

I am certainly not.

You ever see the Lego Movie?

It’s actually one of the deepest children’s movies I’ve watched. A problem with a lot of art is that they’re trying to repackage old, tired ideas in a new format, and the Lego Movie takes a self-referential wink to this. (Here’s FILM CRIT HULK on the flick, with spoilers in the analysis, if you stand the all caps.) One of the refrains of the movie is:

“I know that sounds like a cat poster, but it’s true.”

There are ideas that are very obvious or platitudinous, to the point of schmaltz or even outright tedium, but the ideas are nevertheless true and important. Junior year, there was a poster on the wall of my high school English teacher’s wall which was the same sort of thing: “If you have a choice to make, and don’t make it, that is itself a choice.” I don’t mean to be tedious here, but I really need to drive home the platitude because it’s important. When you’re faced with a bet between A, B, or C, and you refuse to make any bet at all, that is also a choice.

I know you know this. But I need to hammer it because there is always a cost involved with literally every choice we make. When we choose to turn left at this instant, we cannot simultaneously turn right. The price of the path we take at this moment are the paths we cannot simultaneously take. For many, many of our choices, this cost is small. There is no particular loss in choosing not to bet on A or B or C at all. No big deal. You’re risk averse. You’re not knowledgeable about the topic. You’re not interested in the topic. You have more important uses of money. The reward is not very good. All of us face these kinds of bets every day, and not only is it no big deal not to bet, refusing to bet is pretty clearly the best choice to make out of our set of possible choices.

But the existence of that cost, and just as important the conscious awareness that that cost exists, necessarily means that there is a margin at which we would change our minds about the choice we make.

The way you’ve posited the hypothetical is to make the point that there are bets you don’t care about. But you’re not engaging the margin when you say your choice is “no bet”, and the margin is what’s actually interesting about this problem. What cost of staying out makes you change your mind? What if the choice is a billion dollars and the ability to bring reliable plumbing and mosquito-resistant bed netting to the developing world? Saying “risk averse” is not going to cut it here. Even people who are risk averse will eventually face a cost high enough that it changes the choice that they want to make.

That is the margin you need to engage.

Which brings me to this:

I’m taking this out of context. This is from the previous section where you’re talking about how human beings don’t operate by Bayesian reasoning. (Which is true.)

But you know as well as I do that if you’re faced with an opportunity cost that is sufficiently high that you’re induced to make a bet, instead of choosing to stay out, you are never going to start your analysis from the possibility that A and ¬A are equally plausible, while B and ¬B are equally plausible, while also C and ¬C are equally plausible, when A, B, and C are mutually exclusive. If the cost of non-participation is high enough that you’re induced into a bet instead of a non-bet for your choice, you will be more rigorous than that.

The likeliest form of your bet is going to be deciding first what you think you want. You will decide your utility function. Of course, you’re not naive, and you know as well as anyone that this won’t be a real reflection of your inner-most desires, that your conscious self is less than fully informed about what you want. But that’s exactly why you will think carefully about your goals. You will try to put your wants into some sort of workable context, because you don’t know exactly what you want. This “utility function” won’t necessarily follow the vNM axioms, but in your case, I bet it’s pretty close.

The next step you are very likely to take if the stakes are high enough is considering, as carefully as you can, which of the mutually exclusive possibilities is most likely to be true. I don’t know how well read you are on the topic of real-world forecasting, but I imagine if you are sufficiently versed in the habits of skilled forecasters, you’re going to follow their example. (The Tetlock book is excellent.) Taking their advice would mean getting your “base rates” from previous historical experience as much as possible.

And if you’re like most everyone else who consciously and rigorously sorts through these sorts of considerations, what you’re going to find is that Knightian Certainty collapses right down into the Bayesian probabilities. It’s not a separate thing. The possibility of major ignorance generalizes (in the mathematical sense) right into the Bayesian probabilities, with the result that we see more uniform-looking distributions because of an inability to distinguish between plausible choices.

All of this is especially clear with respect to something like stocks, because the no-bet choice is, in fact, a major fucking bet. People who think they’re not making any bet at all with their investments are, in fact, putting 100% of their bet on the stability of their native currency. A person who actually took Knightian Uncertainty seriously would have an extremely diversified portfolio, including stocks, bonds, commodities, both domestic and also (crucially!) internationally with a range of investments across many different countries and currencies.

And this gets me right down to the key issue.

I see, too often, Knightian Certainty used an empty excuse for inaction most especially in those cases when the “no bet” choice is not only a major bet, but one of the riskiest bets that a person could actually make. I don’t know if that’s the case with you. Your examples above were all, of course, perfectly legitimate. But part of their legitimacy is that they weren’t pushed far enough. Physicists fire up particles near to the speed of light to bang them against a wall. The extreme situation brings clarity. But yours was not a High Energy Hypothetical. You don’t discuss the margin at which your decision changes, and the margin – again – is what’s actually interesting about this issue.

When people think about this margin carefully, they almost all of them turn Bayesian in my experience, if only provisionally for the sake of the single issue.

I’m totally open to more possibilities, but I’d want to see the reasoning behind those exceptions.

I’m not convinced that rationality demands pure Bayesian thinking to the exclusion of all other possibilities. But I am convinced that any sufficiently robust thinking along these lines will turn out to be at least approximately Bayesian in appearance.

I’d be willing to see alternatives, of course. But any idea along these lines needs to have particles flung at it at near the speed of light. We need to have High Energy Hypotheticals to see those situations at the margin where one choice becomes a different choice.

The frequentist method you describe there is perfectly practical. It’s a good procedure.

But your subsequent criticism of a Bayesian probability is philosophical, which means my criticism here must be likewise. You don’t actually know that the empirical frequency matches the claimed frequency. You might have written the claimed frequency down wrong. Your instruments measuring the empirical frequency might be miscalibrated, in a way that doesn’t match any previous instrumental error. You might not be doing an experiment at all, but lying about all the data in order to get a publication.

These kinds of considerations are, usually, irrelevant to the practical purpose at hand. But they still exist. We sweep them under the rug, and rightfully so, because the opportunity cost is generally low enough that it’s not worth the attention.

You move from there to a “criterion of truth” by which to judge a Bayesian probability, which is of course entirely the main issue. This particular sentence: “that doesn’t really give a sense to any particular claim that a particular proposition ought be treated as having a particular sort of probability”.

Ought be treated as a probability! What a phrase.

This post is long enough, but this is exactly the issue. If you can explore a bit for me how you might personally act on the margin from “no bet” to “bet”, then we can move from there to the point about the best treatment for them, which will by necessity be another long post.

That bit was just a wording thing. It was just that you seemed to be particularly objecting “Don’t say Bayesian probabilities lack for ‘meaningfulness’. The frequentist complaint isn’t that Bayesian reasoning is ‘meaningless’; the complaint is that it uses arbitrary priors”, and I was clarifying that part of what my use of the word “meaningless” (and its variants) in this context was meant to cover was indeed such issues as the arbitrariness of the prior.

Bayes’ Theorem is certainly true, and one ignores it at one’s peril, but a purely Bayesian interpretation of probability, without any frequentist aspects, is completely useless. A purely Bayesian interpretation can do nothing without a prior, and if the prior is bad enough, the results will be nonsense. So, clearly, a prior should be chosen with some care… except that Bayesian methods provide no way whatsoever of producing a prior. Bayes’ Theorem cannot ever give you a probability without first having a probability.

Now, as to what Indistinguishable says, that there can be notions of “likelihood” to which a number cannot be assigned, here’s my favorite example:

Take Goldbach’s conjecture (or the twin primes conjecture, or the Collatz conjecture, or any of a variety of other unsolved math problems). Ask any mathematician whether the Goldbach conjecture is true, and they’ll say “probably”. But with what probability? How can any number be assigned to that probability, and how can such a number be interpreted? If I say that it’s 99% likely to be true, does that mean that given 100 sets of all positive integers, the Goldbach conjecture will be true in 99 of them?

Yes, even the frequentist accepts Bayes’ Theorem as certainly a true fact about frequencies (and proportions more generally). Of that there is no dispute; only as to whether the mathematics of frequencies/proportions/etc should also generally be used to model truth values under uncertainty.

Yup, although I’ll note that all of those are propositions that admit the possibility of undecidability from ordinary axiom systems, such that no definitive resolution may ever come down one way or the other within those frameworks. We can go even further and even consider examples which are decidable (in that there is a straightforward process to eventually obtain finite resolution to either true or false), but which simply have not yet been decided: e.g., uncertainty as to whether chess between perfect players is a win for white, win for black, or draw, or uncertainty as to whether the product of the first googol many primes is 1 less than a prime, or such things.

Well, if you properly analyze the situation, you get the same answer both ways.

The main difference is that if you use Bayesian methods, then your internet post with the answer must include at least two references to “Bayes” or “Bayesian”, and a tone implying that non-Bayesian methods are wrong. Long, vague, philosophical discursions are optional, but must include sweeping statements about ‘frequentist’ thinking.

As a practical matter? Well, if you’re looking at the internet, if a post includes ‘Bayes’ or some variant, it means the poster isn’t completely ignorant about probability, so slightly more likely to be correct, but will be so much longer, that you’re probably likely to find the right answer just as quickly by reading only posts without ‘Bayes’ in them.