I have been reading about Bayesian statistics, Markov chains, and Monte Carlo simulations for a long time, but I have only the vaguest grasp of what these terms really mean. Reading Wikipedia quickly gets confusing as the articles all seem to assume some base level of knowledge, and trying to click the links that explain the individual terms just takes me down a rabbit hole.

Can anyone provide a for-the-layman summary? Or, point to a site that does so? I’m not mathematically illiterate, but calculus was a long time ago and I never took any courses in statistics.

This covers kind of a broad range of levels - MCMC is much more advanced and technical than basic concepts in Bayesian statistics.

You can approach it from the computer science direction and not the pure math direction if you don’t want to spend a lot of time on khan academy For MCMC really graph theory is probably the plast to start to help conceptualizing the concept.

The first video in this playlist “Seven Bridges of Konigsberg” will start to give you a good start on graphs.

Or you can try this video from PBS’s dead Infinite Series about Random Walks

In an attempt to summarize Bayesian networks are a type of probabilistic graphical model, and Markov is typically a type of random variable or "walk" Markov Chain termination can broadly thought of as the place that random walk tends to end up.

It is almost impossible to accurately convey how they work without using math as a language. An example would be that using MCMC and Lye groups to solve very hard problems like the mass of the proton are unsolvable if you try to solve all of the variables. If you have a good model on where things may end up and use a “random walk” to narrow those down the problem size can be reduced to where you want to end up.

A lot of well-known problems in graph theory are are NP-complete decision problems. An overly simplified meaning of NP-complete is they are problems that will probably will never be solved by brute force because the time required grows so fast.

The *NP-complete *travelling salesman problem is: “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?”

It may not be intuitive, every day needs like google searches are actually very related to that problem. It may be worth your time to dig into graph theory and/or probability theory if the topic interests you. Think of those topics as the mathematical foundations for statistics, and those roots are what will help you learn about these topics.

Here’s a simple way to think of a Monte Carlo analysis: say you want to find the value of Pi. Draw a one-foot by one-foot square on the wall. Then draw a one-foot diameter circle inside that square. Now, throw a million darts at random into the square. Some will be inside the circle, some will not.

The area of the square is obviously one square foot. The area of the circle is Pi/4. By looking at the relative number of darts inside the square and inside the circle, we can solve for Pi.

Okay, thanks for the replies, and I will check out the videos.

The Monte Carlo method is useful when a system has so many variables and so many possible outcomes that it’s difficult to analytically calculate the probability distribution of the outcome.

For example, consider predicting a Presidential election based on polls from every state. There are 50 states, each of which can vote for R or D (OK, simplifying here), so the number of possible outcomes is 2^50, which is just over a quadrillion (10^15). The “correct” (analytical) way to predict the outcome is to calculate the probability of every one of those possible outcomes, and add up all the outcomes that end up with R winning vs D winning. But that would take forever.

Instead you do a Monte Carlo simulation. For every simulation run, the computer randomly chooses an outcome for each state according to the polls. I.e. if polls show R has a 70% chance of winning state A, then the computer picks R 70% of the time. And you can easily add other parameters too - e.g. maybe there is a chance of systematic error in the polls, say 20% chance R does 1% better than polls in all states. And run this simulation as many times as you want, and it would still be a lot quicker than calculating the probability of EVERY outcome.

Another example from my own line of work in optical design. Nothing is made perfectly, so we need to analyze how perfectly each optical element needs to be fabricated and positioned (i.e. tolerances). If I were designing a telescope with 2 mirrors, I could try to plot the imperfection of each mirror (positional error, tilt error, curvature error, etc) vs. the impact on image quality. But what if I’m designing a telescope with 3 mirrors + 10 lenses? There’s no way I could exhaustively calculate the combined effect of all errors. Which is why the optical design software has a Monte Carlo simulation capability - I define the tolerances for each mirror and lens, and the software runs a simulation. For each simulation run, each mirror/lens is deformed, tilted and displaced randomly within the defined range. This gives me a good idea of how the finished telescope is likely to perform, assuming it’s manufactured to those tolerances. If it’s unacceptable, I’d tighten the tolerances and try again (and increase the cost estimate, because I’ll have to pay more to the machine shop or optics supplier to achieve those tolerances).

Bayesian statistics, loosely, is a type of statistics that treats probability as “degree of belief” in a thing, and under that paradigm, Bayes’ rule describes the way you update belief based on observations.

The hand-wavy version of Bayes’ rule is:

(Amount you believe in something after observation) = (Amount you believe in something before observation) * (The probability of the observation if the thing is true) / (The probability of the observation in any case)

Basically, if you see something that is very likely if X is true, and not very likely if X is not true, it should increase your belief in X.

Conversely, if you see something that is not very likely if X is true, and likely otherwise, it should decrease your belief in X.

(And if you see something that is just as likely whether X is true or not, then your level of belief in X doesn’t change).

I don’t think the term “belief” is appropriate here. It’s all about joint probabilities.

Let’s say there is a blood test for Ebola that’s 99% accurate. You haven’t been to any countries with an Ebola outbreak, or been around anyone who came back from those countries recently. You take the test anyway, and it’s positive! Classic (non-Bayes) probability will suggest there’s a 99% chance you have Ebola. Bayes statistics take into account the fact that you have (say) 1 in a million chance of you getting Ebola in the first place, and correctly tell you that it’s far more likely that the test result is wrong.

Plus Bayesian stats is properly run multiple times. Suppose the p(having terminal amnesia) in the general population is 1% and you test positive. We run the stats and come up with a 17% chance you have terminal amnesia. Your second test comes up positive but now I use p(YOU have terminal amnesia) = 0.17 and not 0.01 as on the original test.

Also Bayesian stats is great to discuss type 1 and type 2 errors. Run a Bayesian test on p(you have breast cancer given a positive mammogram) and the result is non-intuitive. Then ask do we want more false positives or more false negatives?

The best explanation I’ve seen for Bayesian statistics

In most introductory statistics course, they focus on the question “If X is true, what’s the probability that I would have gotten the observations I did?”. But that’s probably not really the question you want to answer. What you really want to know is “Given what observations I know I got, what’s the probability that X is true?”. And to get that, you have to have what’s called a prior: An estimate of what the probability would be if you had no data at all.

In the comic, we want to know whether the Sun has gone nova. If it hasn’t, then there’s only a low probability that the device will lie to us and say that it has. But on the other hand, we have a prior expectation: Even without seeing any data at all, we can say that it’s a really, really low probability that the Sun has gone nova. So, even with the device saying that it has, because our prior had such a low value, we can still say that it probably hasn’t gone nova. To put a precise number on the probability, what we’d use is called Bayesian statistics.

Of course, statistics isn’t actually a duality between the Bayesian and the frequentist interpretations. To actually do anything, you have to use both, and they’re inseparable. Because after all, where do you get your priors from?

Okay, I watched the videos. The Konigsberg Bridge one was comprehensible, but it’s not clear to me how that relates to Bayesian statistics and I didn’t want to watch all ten of her videos to find out.*

The Random Walk one was good, and the same mathematician also had one about Markov chaining which I watched. I think I’m starting to have a glimmer on those.

Chronos, thanks for the XKCD. So Bayesian analysis involves baking in some presumptions/preconceptions–does that not risk producing an outcome predicated on the assumptions? Begging the question, in its original meaning. I’ve seen references to a “Bayesian Trap”–is that what’s being described by the term?

*I also spent some time for laughs watching a video in which someone “solves” the Konigsberg Bridge problem…by adding a bridge.

Of course anything can be misused. Bayesian statistics will give you the wrong answer if you calculate the prior incorrectly or if your estimate of the prior is way off. But even then, it’s generally better than just going with the non-Bayesian analysis. E.g. in my Ebola test example, let’s say the doctor told you “it’s most likely a false test result” but didn’t know that your wife had just come back from Congo. So your prior (probability you had Ebola) was 1/1000 instead of 1 in a million. Even then, the conclusion is more right than wrong.

You may need to set the “style” on the bottom left of the page to “Straight Dope v3.7.3” to read this.

But this is the example of programming challenge where they give you a description of an “absorbing markov chain” without telling you what it is. While it may still take some thinking on how it works, there is enough information there to help you figure out the logic without knowing the math.

``````

For example, consider the matrix m:
[
[0,1,0,0,0,1],  # s0, the initial state, goes to s1 and s5 with equal probability
[4,0,0,3,2,0],  # s1 can become s0, s3, or s4, but with different probabilities
[0,0,0,0,0,0],  # s2 is terminal, and unreachable (never observed in practice)
[0,0,0,0,0,0],  # s3 is terminal
[0,0,0,0,0,0],  # s4 is terminal
[0,0,0,0,0,0],  # s5 is terminal
]
So, we can consider different paths to terminal states, such as:
s0 -> s1 -> s3
s0 -> s1 -> s0 -> s1 -> s0 -> s1 -> s4
s0 -> s1 -> s0 -> s5
Tracing the probabilities of each, we find that
s2 has probability 0
s3 has probability 3/14
s4 has probability 1/7
s5 has probability 9/14
So, putting that together, and making a common denominator, gives an answer in the form of
[s2.numerator, s3.numerator, s4.numerator, s5.numerator, denominator] which is
[0, 3, 2, 9, 14].

``````

This is not a good example of a distinction between frequentist and Bayesian statistics. Frequentist statistics still has the concept of conditional probability (and Bayes’ rule even) and would just as well come up with P(has ebola|tests positive for ebola) is small.

Frequentists would just interpret that statement something like, “if I repeated the test a large number of times, and only considered those cases where the outcome is a positive test for ebola, in only a small fraction of those cases would the patient actually have ebola”. Bayesians would say something like, “the positive test for ebola increases my expectation that the patient has ebola, but not enough to overcome my low prior probability that the patient has ebola”.

The fact that a seemingly accurate test (only 1% false positive and false negative rate) would result in such low probabilities of successful true detection is a paradox of intuition not matching reality, not a flaw in either interpretation of statistics.

I recall a brief review of basic statistics in the back of the particle physics handbook, discussing things like defining confidence intervals. I may as well quote directly from the introduction:

Long ago I opened a thread on this board, asking about the difference between Bayesian and Frequentist statistics. People seemed to agree that my layperson’s summary of the two approaches was basically correct (despite me being very much a layperson myself), and Pasta in particular provided a lot of very comprehensive explanations and details. You may find it useful.