Monte Carlo Analysis: How is the variance/randomness introduced?

stolichnaya · October 22, 2008, 3:50pm

I have been reading up on this after having been introduced this election season by Nate Silver of fivethirtyeight.com. But not having much statistical knowledge, I am a bit baffled by the mechanics of the Monte Carlo technique.

I understand that there is a set of algorithms based on some observation, rules about the environment. In 538’s case, I think it is demographic analysis, donation history, vote history, etc. Then there are variable inputs like current polling. But these inputs are defined, not purely random, they are the source information.

My understanding: at some point, a random distribution of possible results is introduced and then weighed against the algorithms and the inputs to see if it fits the prediction. Is that a proper assessment of what is going on? If so, how/where in the analysis does this random element come about?

I know that the system for 538 is not public, but even an example of a simpler, similar model would help me to grasp this. I understand, for example, the “Battleship” model that is used as an intro on the wiki page for Monte Carlo, but it’s too simple, for the life of me I can’t see how that would apply to a more complex system like Nate’s.

Any elucidation would be gratefully accepted.

Simplicio · October 22, 2008, 6:42pm

If you link to the page your talking about I’ll take a look and maybe I can see what they’re doin, but here’s my best guess:

Usually what they mean is that they take the results of their polls as the actual values of public support for candidate X (lets say we found 39% of the public liked our candidate). Then they use a random number generator to run a simulation of a poll (so lets say our poll had a sample size of 100, then we generate 100 random numbers 0-1, and count every value below .39 as a person who said they support our candidate). This will give them the result of their simulated poll (so in our example, if 37 of our 100 numbers were less then .39, the result of our pretend poll was 37%). Now, since your using a computer you can do, say, a million of these simulated polls. This gives you the chance that the actual value of support for the candidate was within some region (returning to our example, if we do a million simulated polls, and 900,000 of them are above 37%, we can say that there’s a 90% chance that the actual support for the mysterious candidate X is above 37%, based purely on our simulations and the one actual real poll that we started with).

This is a simplified case, but you can see how you could expand it to find the chance that a desired value was within some range even if that value depended on a mess of different input polls (such as the “demographic analysis, donation history, vote history” mentioned in your OP).

-Simplicio (who isn’t a statistician, but has to read a lot of papers that use statistical methods)

Chronos · October 22, 2008, 7:24pm

Actually, I strongly suspect that what 538 does is more like drawing numbers from a Gaussian distribution centered on the poll numbers with a standard deviation equal to the statistical margin of error of the polls. It’s a little more complicated than that, though, since he also uses correlations between the states: The current run has it as impossible that McCain will win Florida without also winning Ohio, for instance.

ultrafilter · October 22, 2008, 7:27pm

He could just be drawing from a multivariate normal distribution with the appropriate covariance matrix, but he probably has a few other tricks.

Indistinguishable · October 22, 2008, 7:37pm

Simplicio:

Usually what they mean is that they take the results of their polls as the actual values of public support for candidate X (lets say we found 39% of the public liked our candidate). Then they use a random number generator to run a simulation of a poll (so lets say our poll had a sample size of 100, then we generate 100 random numbers 0-1, and count every value below .39 as a person who said they support our candidate). This will give them the result of their simulated poll (so in our example, if 37 of our 100 numbers were less then .39, the result of our pretend poll was 37%). Now, since your using a computer you can do, say, a million of these simulated polls. This gives you the chance that the actual value of support for the candidate was within some region (returning to our example, if we do a million simulated polls, and 900,000 of them are above 37%, we can say that there’s a 90% chance that the actual support for the mysterious candidate X is above 37%, based purely on our simulations and the one actual real poll that we started with).

I think this bold part is glib (which is to say, an incorrect interpretation of the result).

In fact, as for the whole method described: Huh? You

Start with a poll whose results you take as the actual value of public support for candidate X.
Then run a bunch of simulations with it to determine the chance that the actual value of support for the candidate X was within some region
?

If you actually made the assumption you claimed to in step 1) then step 2) should trivially boil down to “Yep, the actual value of support is within region R if and only if R contains the results of our original poll.”.

I mean, the method you’ve described sounds like this: you start with some data “The polls say D”. And you know that the reality may or may not actually match the poll results. So you want to figure out the probability that the reality is within the interval I, given that the polls say D. And to do this you… instead calculate the probability that the poll results would fall within the interval I, given that the reality is D, and pretend that this is the number we want? That seems so wrong; a total bait-and-switch.

(I am also very much not a statistician. But everything in so-called “frequentist” statistics always seems so terribly backwards to me. Not that “Bayesian” statistics isn’t strewn with landmines of its own (e.g., the problem of the priors), but at least they seem to keep a more coherent view of how probabilistic modelling works)

stolichnaya · October 22, 2008, 8:40pm

I didn’t think to include links, but the 538 FAQ is nicely written if elusive on the subject of introducing the randomness.

http://www.fivethirtyeight.com/2008/03/frequently-asked-questions-last-revised.html

I appreciate any discussion about 538’s model specifically and about Monte Carlo generally, but I am most interested in how the randomness comes about.

Simplicio, you have me a bit confused, because I would think that a random distribution between the upper bound and the lower would eventually deliver the same result as the target, not really telling you anything.

Topic		Replies	Views
What's the difference between the Monte Carlo method and polling? Factual Questions	31	2447	February 3, 2016
Monte Carlo Simulationof the electoin Factual Questions	4	812	November 2, 2004
How random is random enough? Factual Questions	26	1463	February 19, 2004
dumb statistics question regarding analyzing polls Factual Questions	19	2465	September 25, 2008
538 takes on another pollster Great Debates	38	6224	July 4, 2010

Monte Carlo Analysis: How is the variance/randomness introduced?

Related topics