I have been reading up on this after having been introduced this election season by Nate Silver of fivethirtyeight.com. But not having much statistical knowledge, I am a bit baffled by the mechanics of the Monte Carlo technique.
I understand that there is a set of algorithms based on some observation, rules about the environment. In 538’s case, I think it is demographic analysis, donation history, vote history, etc. Then there are variable inputs like current polling. But these inputs are defined, not purely random, they are the source information.
My understanding: at some point, a random distribution of possible results is introduced and then weighed against the algorithms and the inputs to see if it fits the prediction. Is that a proper assessment of what is going on? If so, how/where in the analysis does this random element come about?
I know that the system for 538 is not public, but even an example of a simpler, similar model would help me to grasp this. I understand, for example, the “Battleship” model that is used as an intro on the wiki page for Monte Carlo, but it’s too simple, for the life of me I can’t see how that would apply to a more complex system like Nate’s.
If you link to the page your talking about I’ll take a look and maybe I can see what they’re doin, but here’s my best guess:
Usually what they mean is that they take the results of their polls as the actual values of public support for candidate X (lets say we found 39% of the public liked our candidate). Then they use a random number generator to run a simulation of a poll (so lets say our poll had a sample size of 100, then we generate 100 random numbers 0-1, and count every value below .39 as a person who said they support our candidate). This will give them the result of their simulated poll (so in our example, if 37 of our 100 numbers were less then .39, the result of our pretend poll was 37%). Now, since your using a computer you can do, say, a million of these simulated polls. This gives you the chance that the actual value of support for the candidate was within some region (returning to our example, if we do a million simulated polls, and 900,000 of them are above 37%, we can say that there’s a 90% chance that the actual support for the mysterious candidate X is above 37%, based purely on our simulations and the one actual real poll that we started with).
This is a simplified case, but you can see how you could expand it to find the chance that a desired value was within some range even if that value depended on a mess of different input polls (such as the “demographic analysis, donation history, vote history” mentioned in your OP).
-Simplicio (who isn’t a statistician, but has to read a lot of papers that use statistical methods)
Actually, I strongly suspect that what 538 does is more like drawing numbers from a Gaussian distribution centered on the poll numbers with a standard deviation equal to the statistical margin of error of the polls. It’s a little more complicated than that, though, since he also uses correlations between the states: The current run has it as impossible that McCain will win Florida without also winning Ohio, for instance.
I think this bold part is glib (which is to say, an incorrect interpretation of the result).
In fact, as for the whole method described: Huh? You
Start with a poll whose results you take as the actual value of public support for candidate X.
Then run a bunch of simulations with it to determine the chance that the actual value of support for the candidate X was within some region
?
If you actually made the assumption you claimed to in step 1) then step 2) should trivially boil down to “Yep, the actual value of support is within region R if and only if R contains the results of our original poll.”.
I mean, the method you’ve described sounds like this: you start with some data “The polls say D”. And you know that the reality may or may not actually match the poll results. So you want to figure out the probability that the reality is within the interval I, given that the polls say D. And to do this you… instead calculate the probability that the poll results would fall within the interval I, given that the reality is D, and pretend that this is the number we want? That seems so wrong; a total bait-and-switch.
(I am also very much not a statistician. But everything in so-called “frequentist” statistics always seems so terribly backwards to me. Not that “Bayesian” statistics isn’t strewn with landmines of its own (e.g., the problem of the priors), but at least they seem to keep a more coherent view of how probabilistic modelling works)
I appreciate any discussion about 538’s model specifically and about Monte Carlo generally, but I am most interested in how the randomness comes about.
Simplicio, you have me a bit confused, because I would think that a random distribution between the upper bound and the lower would eventually deliver the same result as the target, not really telling you anything.