Thanks for giving me so much to work with, guys. Work is slow this morning, so I can give this the attention that I think it deserves.
elucidator brings up the following point:
It is not reasonable to infer the same unless you already have a strong prior belief of the same. If you believe that Bush voters are better educated, make more money, and are more intelligent, you would be more likely to doubt this survey on methodological grounds. Things always seem reasonable when they confirm your beliefs.
As far as I am concerned, if it is not demonstrable scientifically, it is not reasonable.
I know you are just being amusing with your characterization of the orthodox church of statistics. However, I find this sort of thing unhelpful. The consequences of building a bridge with faulty equations are obvious and terrible. It is not because engineers are privy to some secret knowledge and arbitrary methodology: they build bridges a particular way because they work. The same is true with statistics. They are done in particular ways because the conclusions can be proven analytically. There is also not a single Orthodox Institution: a Bayesian statistician might come along and have some big issues with my interpretation.
However, he would agree nevertheless that this study is crap.
Mtgman somehow things I am arguing against the utility of surveys in general, and that I do not believe that inferences can be made using representative samples.
Yes. However, tools are necessary to draw these conclusions. To return to the bridge example, suppose I build a toll plaza on either side of a river. Toll plazas may be required for the functionality of the bridge, but you cannot cross the river without the bridge itself. The authors of the PIPA study only build the toll plazas and fail to construct the entire bridge.
Mr2001 wants to get me talking some more.
Good question. In a nutshell I will build a study and give you a little insight on some of the tools a researcher ought to use. Please forgive the upcoming notation. It is a necessary evil. And please don’t sidetrack by challenging the qualitative argument I am about to make, as it is totally nonmaterial.
Suppose I believe that ignorance drives a vote for Bush. I believe that Bush’s campaign spin machine has intentionally kept the electorate in the dark about the real state of the world, that is, we are losing in Iraq, no WMDs, no connection with al Qaeda, etc. I believe that only ignorant people could possibly be favorably disposed towards Bush given the real state of the world, and therefore, ignorant people vote for Bush. I intend to test this theory quantitatively.
I have a limited amount of funds, so I decide I can only survey 968 people. This is fine with me. The equations I will use to calculate the magnitude of the effect of ignorance on electoral choice have lovely assymptotic properties, so 968 is close enough to infinity for my purposes. I can make robust inferences with a sample that size.
Now I need a model. A model in this context is a mathematical representation of electoral behavior. I suppose that there is a True Model out there somewhere in the universe which can explain with perfect clarity and predictive power what drives an electoral choice. I am going to do the best I can to estimate the true model. I believe that the true model looks something like this:
V[sub]i[/sub] = B[sub]0[/sub] + B[sub]1[/sub]X[sub]1i[/sub] + B[sub]2[/sub]X[sub]2i[/sub] + … + E[sub]i[/sub]
Where:
V[sub]i[/sub] is the ith voter’s choice to vote for Kerry or Bush (0 or 1)
B[sub]0[/sub] is a constant
B[sub]1[/sub]X[sub]1i[/sub] is a coefficient (beta) multiplied by the value of an independent variable for the ith voter.
E[sub]i[/sub] is the disturbance term (epsilon), that captures all of the randomness in the world not explained by the model that acts on the ith voter
NB: This is NOT in fact the model I would use, as it expresses a linear, unlimited relationship between the dependent and independent variables. What I really need is a tool that constrains the output of the above function to 0 and 1, since those are the only choices the dependent variable permits. I would probably tools called probit or logit analysis, but since it is often helpful to express them linearly, I am sticking to the above for now.
There can be as many X’s as you want. If the number of independent variables is greater than the size of your sample, you will have problems. In this study, I would throw in quite a lot of independent variables in order to control for many sociological factors. I could code yes/no answers to factual questions as 1 or 0, and stick them in as independent variables. I would also code MALE as 1 or 0, education level as an integer (this is hotly debated, but not important here), I would divide the country into regions and code them as a variable, etc.
Then, I would break out my stats package, feed in the numbers, and it would spit out some useful information. It would tell me the magnitude of each of the independent variables expressed as the size of their coefficients, and it would tell me the standard errors associated with all of my estimated parameters. This statistic, when properly interpreted, can tell me what probability the coefficient could have been estimated randomly. If there is a low probability that the coefficient could have been generated randomly from some distribution, then the effect the coefficient expresses is probably significant.
Ok, so suppose that ignorance on a few policy issues has a coefficient with some magnitude and statistical significance. I report my coefficients and standard errors, and then I do some interpretation. The question that should be on everyone’s mind is, what happens to the quantity of interest when you control for some of the independent variables?
In this context, “controlling” for an independent variable means holding it constant while you change something else. In the math world, it means you take partial derivatives. Basically, when you estimate the model, it tells you how strong the effects of the independent variables are on the outcome, the dependent variable. So I would take a hypothetical voter. Suppose he is white, male, college educated, makes $50k per year, lives in the Pacific Northwest, and answered correctly on all of the factual policy questions. Suppose he voted for Kerry. What is interesting is to see what happens if you keep his sociological variables constant and change all of his policy answers. To do this, you assign values to his sociological and political X variables, recalculate the above equation, and see what happens to his vote choice. Alternatively, you can do the same thing for a high school dropout who lives in the southwest and makes $10k per year. What happens when you keep his sociological variables the same but change his answers to the policy questions? These are the kinds of things that interest social scientists: what happens to the quantities of interest when you change the independent variables.
The big question, then, is why do these tools allow us to infer population preferences from a sample?
- These tools quantify the effects of the independent variables and provide a means to assess their probability of being close to true using test statistics.
- You can use these tools to generate out-of-sample predictions that can be meaningfully tested
- The model generates quantitative hypotheses that can be tested rigorously. SentientMeat did a little “classical hypothesis testing” above. Rather than wave your hands with percentages, this method allows the researcher to test rigorously whether an independent variable has an effect on a dependent variable.
Model specification is where the science of quantitative analysis becomes an art. Every model requires that you make certain assumptions on how both the data and the real world behave. Sometimes these assumptions are reasonable, sometimes much less so. The inclusion or failure to include independent variables can also seriously bias the estimators. Finally, the way the data is coded also implies a host of assumptions that can be challenged. The art is how to get the most bang out of the most innocuous assumptions that you can.
The real kicker here is that when you specify and test a model, you can come to a real conclusion about what forces actually drive the results. From what you know about the relationship between the dependent and the independent variables, you can make inferences about the entire population. In the PIPA survey, the researchers made no effort whatsoever to specifiy a model to explain the relationship between the dependent and the independent variables. We don’t even know if there is any. There are random forces that drive stuff in the world, and from the results of the study, we do not know if the stochastic forces correlate with dependent variable. If the epsilons correlate with the vote choice, you’re pretty much fucked. It means that you have left something very significant out of your model that is biasing your estimators. Since PIPA does not show us any of this data analysis, we simply cannot conclude that ignorance has anything to do with vote choice, since for all we know, something else that correlates with ignorance actually drives vote choice.
I hope this helps. If you are technically inclined, here are some excellent notes by my first grad school quant teacher. He sometimes plays a little fast and loose with the notation, but the information is well presented nonetheless.
Please feel free to assail me with questions.