dumb statistics question regarding analyzing polls

dre2xl · September 24, 2008, 3:41pm

Need some statistics refresher…

Caveat: I don’t want to bring in discussion of the merits of specific polls or lack thereof, methodologies, flawed sampling, or any discussion about the 2008 election, etc.

With the various polls, you usually have a given sample size, the percentages of who gave what answer, and a margin of error usually represented as a 95% confidence interval. Of course, there is weighting and adjusting involved, but I ignore those for the purposes of this question.

My questions are:

Margin of error - Wikipedia suggests that a margin of error is normally distributed; e.g., it is not a flat probability line where the “actual data” is just as likely to be on the edge of the MOE as it is in the middle. Is it a safe assumption to make that a poll that is representative of the population has a MOE that looks like a Gaussian curve, and does a cite for the answer exist?
Does any formalized way exist to analyze the movements in polls, whether one poll or multiple? E.g., let’s say 4 polls taken over the same period of time show movement of 1 point over their respective previous poll. Is there a way to determine the probability that the movement is significant in one poll, and can we show that probability that the movement is significant increases when combining multiple polls?

Morgenstern · September 24, 2008, 4:10pm

Here is an interview with Frank Newport, head of the Gallup Organization, about the reliability of political polls.

http://www.npr.org/templates/story/story.php?storyId=3930565

Svejk_1 · September 24, 2008, 4:30pm

Whoah! You entitle your thread ‘dumb statistics question’ and then start talking about Gaussian distribution? What, in your opinion, would constitute a smart question?

I’m having a really hard time answering the first question but re: the second question I am under the impression that the type of analysis you’d need to analyze such data is called ‘time series analysis’. Unfortunately, all I know about time series analysis is the type of data you can analyze with it, and nothing else.

ETA: actually, Time series analysis usually requires that data be collected for the *same *cases (respondents, in this case) for different points in time. This can be done with polls, but usually the samples in polls are independent, in which you’d be needing to do some sort of mean comparison in order to compare outcomes out of two differently timed polls.

dre2xl · September 24, 2008, 4:35pm

ah-ha, thanks for the code words, I’ll look through time series analysis and see if I can’t derive a partial answer to #2. Doubt it’ll go to the point of combining multiple polls, though…

ultrafilter · September 24, 2008, 4:51pm

If you’re polling a large random sample each time, then you should have an accurate estimator of the population response each day, so you should be OK with sampling. Of course, you do have to be careful about actually getting a random sample.

OP, you should do some reading over at FiveThirtyEight if you haven’t already for a very statistical view of election forecasting.

Exapno_Mapcase · September 24, 2008, 5:01pm

I don’t believe that it is true that a margin of error is normally distributed. It is always assumed that the true population data could be at any point within the margins. I didn’t see where in that Wiki article the contrary is stated, although it was longer than I have time to read thoroughly. Could you point it out specifically?

Multiple polls by different organizations, using different questions, different methodology, different weighting, and different assumptions cannot be combined by any method to yield a reliable result.

muttrox · September 24, 2008, 5:06pm

Margin of error is distributed normally, i.e. Gaussian.

dre2xl · September 24, 2008, 5:46pm

Right.

It is important to note that a normal distribution is used as a default assumption for ease of calculations, and that a non-normal MOE may not necessarily be the case, but trying to Google for conditions where a MOE wouldn’t be normally distributed, I couldn’t find any probable situations that didn’t preclude a representative population sampling. Hence my q #1.

Exapno_Mapcase · September 24, 2008, 5:59pm

That page reprints the Wiki page word for word. Neither Guassian nor normally appears on that page.

nivlac · September 24, 2008, 6:36pm

For sure, any margins of error that you see reported on polls that you see in the news assume normality of the underlying sampling distribution. This is because the polls pick their samples (random and large enough) to justify normality. As to how to evaluate multiple polls across time, I’ve heard a neat podcast (NPR?) that indicated you should not really look so much at the actual numbers in the results. You should check to see which candidate leads (by whatever margin) in the most polls. The reason is that each poll can have its inherent problems or flaws, but when you look across polls these flaws sort of get smoothed out. So when the election gets close, see which candidate leads in the most polls in a particular state to predict who will win that state. This has apparently been a very good predictor in the past.

ultrafilter · September 24, 2008, 7:56pm

If you’re working in the standard sampling situation, then the error of your estimator, which is just the difference between your estimate and the true value, will be (approximately) normally distributed by the central limit theorem.

You’re hitting on one of my pet peeves here. While it’s true this problem is completely intractable using the methods that are taught in an introductory statistics course, it’s also true that there are statistical methods that are not taught in introductory statistics courses. In particular, the techniques of meta-analysis can be used here (and yes, I know that the article I’m linking to doesn’t specifically mention the meta-analysis of polls. It’s an overview, not an encyclopedic reference.).

muttrox · September 24, 2008, 8:03pm

Look at the image. That’s a normal curve.

Exapno_Mapcase · September 24, 2008, 8:39pm

The image is a pretty picture and nothing more. In fact, it says specifically “In other words, for each sample size, one is 95% sure that the “true” percentage is in the region indicated by the corresponding segment.” It does not say anything more about the distribution within that region.

It’s been many moons since I was a number cruncher and my math is admittedly rusty. But I still don’t see anything in the article you linked to that addresses this issue. It’s true that the central limit theorem means you can calculate the margin of error for a normally distributed sample. But that’s not the same thing, as far as I understand, as saying that the margin of error is itself normally distributed. That would be the same as claiming that a politician polling 55% +/- 4% would have a much higher - and known - probability of being at 55% than at 54%, at 54% than at 53%, at 53% than at 52%, and at 52% than at 51%. That seems odd to me, and it is never presented that way.

muttrox · September 24, 2008, 8:59pm

It does say something about the distribution. That’s what the height represents in the curve. I mean, just look at it… If you can identify any curve as the classic normal Gaussian bellcurve, that’s it.

What’s odd about that? That seems perfectly intuitive to me.

Here’s some other random links I can dig up:

http://cs.wellesley.edu/~cs199/lectures/25-point-estimation-confidence-intervals.html

Exapno_Mapcase · September 24, 2008, 9:09pm

Yes, that’s exactly what the central limit theorem states. That is not at all the same statement as that the probability of a sample’s accuracy is normally distributed within its margin of error. That’s the issue at hand.

It may be true. However, nothing in any of the cites has said so specifically. Your cites merely repeat what we’ve already agreed to concerning the CLT.

BTW, in math, anything that appears intuitive is almost certainly wrong.

ultrafilter · September 25, 2008, 12:16am

It’s trivial to come up with a distribution that’s not even remotely normal but whose distribution looks very much like a bell curve at casual inspection. As always, proof by picture isn’t.

(Since I know you’re going to ask me, try this one: generate a bunch of independent standard normal values. For each one, flip a fair coin. If it comes up heads, add 1; otherwise, add -1. The histogram will look very much like a normal histogram, but any of your standard normality tests will give you very low p-values if you have a decent sample size.)

Briefly: suppose each person you poll with a yes/no question answers yes with probability p, and no otherwise. Polls of different people are independent, and your selection procedure really is random. If you encode each yes as 1 and no as 0, your estimate for p is p’ = 1/n * sum(I[sub]i[/sub], 1 < i < n), where I[sub]i[/sub] is your encoding of the ith answer. By the CLT this quantity is normally distributed, and by some other theorem, it’s a consistent estimator for p. The error is p - p’, but since p is a constant, this is also normally distributed.

Lance_Turbo · September 25, 2008, 4:03am

I know you put a smiley after this, but as a mathematician I still can’t let this slide.

That is possibly the least accurate statement in all of GQ. Ever.

nivlac · September 25, 2008, 5:02am

That’s not really the gist of the CLT. It does not presume that the sampling distribution for the sample proportion (which is what the polls are reporting) is normally distributed. In fact, the underlying population distribution can be any shape at all. The theorem guarantees that the sampling distribution will be very close to a normal distribution provided that the sample was obtained randomly and the sample size was large enough. And all the good polls do their best to assure that randomness and adequate sample size are achieved.

Measure_for_Measure · September 25, 2008, 8:45am

EM: “BTW, in math, anything that appears intuitive is almost certainly wrong.”

I dunno, much of the frequentist tradition in statistics and a good part of probability isn’t especially intuitive. Then again Exapno Mapcase said something a little different.

Random link: Why the Diageo/Hotline/National Journal/Atlantic Monthly daily tracking poll is useless.
Ditto for the Gallup 3 day poll.

Exapno_Mapcase · September 25, 2008, 5:21pm

It is intuitively obvious that .99999~ can never equal 1.

I rest my case.

Topic		Replies	Views
Polls and margin of error. Factual Questions	8	694	October 30, 2004
A poll of polls Factual Questions	5	695	November 1, 2004
Stats question: error of difference and error of estimate. Factual Questions	8	769	September 26, 2004
Margin of error = 4.5% Factual Questions	16	2475	October 3, 2000
Statistics question with respect to poll margins of error Factual Questions	19	1664	October 29, 2005

dumb statistics question regarding analyzing polls

Related topics