# dumb statistics question regarding analyzing polls

Need some statistics refresher…

Caveat: I don’t want to bring in discussion of the merits of specific polls or lack thereof, methodologies, flawed sampling, or any discussion about the 2008 election, etc.

With the various polls, you usually have a given sample size, the percentages of who gave what answer, and a margin of error usually represented as a 95% confidence interval. Of course, there is weighting and adjusting involved, but I ignore those for the purposes of this question.

My questions are:

1. http://en.wikipedia.org/wiki/Margin_of_error suggests that a margin of error is normally distributed; e.g., it is not a flat probability line where the “actual data” is just as likely to be on the edge of the MOE as it is in the middle. Is it a safe assumption to make that a poll that is representative of the population has a MOE that looks like a Gaussian curve, and does a cite for the answer exist?

2. Does any formalized way exist to analyze the movements in polls, whether one poll or multiple? E.g., let’s say 4 polls taken over the same period of time show movement of 1 point over their respective previous poll. Is there a way to determine the probability that the movement is significant in one poll, and can we show that probability that the movement is significant increases when combining multiple polls?

Here is an interview with Frank Newport, head of the Gallup Organization, about the reliability of political polls.

http://www.npr.org/templates/story/story.php?storyId=3930565

Whoah! You entitle your thread ‘dumb statistics question’ and then start talking about Gaussian distribution? What, in your opinion, would constitute a smart question?

I’m having a really hard time answering the first question but re: the second question I am under the impression that the type of analysis you’d need to analyze such data is called ‘time series analysis’. Unfortunately, all I know about time series analysis is the type of data you can analyze with it, and nothing else.

ETA: actually, Time series analysis usually requires that data be collected for the *same *cases (respondents, in this case) for different points in time. This can be done with polls, but usually the samples in polls are independent, in which you’d be needing to do some sort of mean comparison in order to compare outcomes out of two differently timed polls.

ah-ha, thanks for the code words, I’ll look through time series analysis and see if I can’t derive a partial answer to #2. Doubt it’ll go to the point of combining multiple polls, though…

If you’re polling a large random sample each time, then you should have an accurate estimator of the population response each day, so you should be OK with sampling. Of course, you do have to be careful about actually getting a random sample.

OP, you should do some reading over at FiveThirtyEight if you haven’t already for a very statistical view of election forecasting.

I don’t believe that it is true that a margin of error is normally distributed. It is always assumed that the true population data could be at any point within the margins. I didn’t see where in that Wiki article the contrary is stated, although it was longer than I have time to read thoroughly. Could you point it out specifically?

Multiple polls by different organizations, using different questions, different methodology, different weighting, and different assumptions cannot be combined by any method to yield a reliable result.

Margin of error is distributed normally, i.e. Gaussian.

Right.

It is important to note that a normal distribution is used as a default assumption for ease of calculations, and that a non-normal MOE may not necessarily be the case, but trying to Google for conditions where a MOE wouldn’t be normally distributed, I couldn’t find any probable situations that didn’t preclude a representative population sampling. Hence my q #1.

That page reprints the Wiki page word for word. Neither Guassian nor normally appears on that page.

For sure, any margins of error that you see reported on polls that you see in the news assume normality of the underlying sampling distribution. This is because the polls pick their samples (random and large enough) to justify normality. As to how to evaluate multiple polls across time, I’ve heard a neat podcast (NPR?) that indicated you should not really look so much at the actual numbers in the results. You should check to see which candidate leads (by whatever margin) in the most polls. The reason is that each poll can have its inherent problems or flaws, but when you look across polls these flaws sort of get smoothed out. So when the election gets close, see which candidate leads in the most polls in a particular state to predict who will win that state. This has apparently been a very good predictor in the past.

If you’re working in the standard sampling situation, then the error of your estimator, which is just the difference between your estimate and the true value, will be (approximately) normally distributed by the central limit theorem.

You’re hitting on one of my pet peeves here. While it’s true this problem is completely intractable using the methods that are taught in an introductory statistics course, it’s also true that there are statistical methods that are not taught in introductory statistics courses. In particular, the techniques of meta-analysis can be used here (and yes, I know that the article I’m linking to doesn’t specifically mention the meta-analysis of polls. It’s an overview, not an encyclopedic reference.).

Look at the image. That’s a normal curve.

The image is a pretty picture and nothing more. In fact, it says specifically “In other words, for each sample size, one is 95% sure that the “true” percentage is in the region indicated by the corresponding segment.” It does not say anything more about the distribution within that region.

It’s been many moons since I was a number cruncher and my math is admittedly rusty. But I still don’t see anything in the article you linked to that addresses this issue. It’s true that the central limit theorem means you can calculate the margin of error for a normally distributed sample. But that’s not the same thing, as far as I understand, as saying that the margin of error is itself normally distributed. That would be the same as claiming that a politician polling 55% +/- 4% would have a much higher - and known - probability of being at 55% than at 54%, at 54% than at 53%, at 53% than at 52%, and at 52% than at 51%. That seems odd to me, and it is never presented that way.

It does say something about the distribution. That’s what the height represents in the curve. I mean, just look at it… If you can identify any curve as the classic normal Gaussian bellcurve, that’s it.

What’s odd about that? That seems perfectly intuitive to me.

Here’s some other random links I can dig up:
http://matthewyglesias.theatlantic.com/archives/2007/03/margin_of_error.php#comment-148289

http://cs.wellesley.edu/~cs199/lectures/25-point-estimation-confidence-intervals.html

Yes, that’s exactly what the central limit theorem states. That is not at all the same statement as that the probability of a sample’s accuracy is normally distributed within its margin of error. That’s the issue at hand.

It may be true. However, nothing in any of the cites has said so specifically. Your cites merely repeat what we’ve already agreed to concerning the CLT.

BTW, in math, anything that appears intuitive is almost certainly wrong.

It’s trivial to come up with a distribution that’s not even remotely normal but whose distribution looks very much like a bell curve at casual inspection. As always, proof by picture isn’t.

(Since I know you’re going to ask me, try this one: generate a bunch of independent standard normal values. For each one, flip a fair coin. If it comes up heads, add 1; otherwise, add -1. The histogram will look very much like a normal histogram, but any of your standard normality tests will give you very low p-values if you have a decent sample size.)

Briefly: suppose each person you poll with a yes/no question answers yes with probability p, and no otherwise. Polls of different people are independent, and your selection procedure really is random. If you encode each yes as 1 and no as 0, your estimate for p is p’ = 1/n * sum(I[sub]i[/sub], 1 < i < n), where I[sub]i[/sub] is your encoding of the ith answer. By the CLT this quantity is normally distributed, and by some other theorem, it’s a consistent estimator for p. The error is p - p’, but since p is a constant, this is also normally distributed.

I know you put a smiley after this, but as a mathematician I still can’t let this slide.

That is possibly the least accurate statement in all of GQ. Ever.

That’s not really the gist of the CLT. It does not presume that the sampling distribution for the sample proportion (which is what the polls are reporting) is normally distributed. In fact, the underlying population distribution can be any shape at all. The theorem guarantees that the sampling distribution will be very close to a normal distribution provided that the sample was obtained randomly and the sample size was large enough. And all the good polls do their best to assure that randomness and adequate sample size are achieved.

EM: “BTW, in math, anything that appears intuitive is almost certainly wrong.”

I dunno, much of the frequentist tradition in statistics and a good part of probability isn’t especially intuitive. Then again Exapno Mapcase said something a little different.

It is intuitively obvious that .99999~ can never equal 1.

I rest my case.