# Statistics nerds: bit of help, please

In 2000, a random sample of adults in a particular population finds that 20% answer “yes” to the question, “Have you ever done X”.

In 2003, doing X is made illegal.

In 2012, another random sample finds that only 10% answer “yes” to the question “Have you ever done X”.

The immediate problem I see with this pair of statistics is that everyone who has “ever” done X as of 2000, by definition has also “ever” done X as of 2012. Which means that theoretically, the 2012 figure would have to be at least as high as the 2000 figure.

Now obviously in a random sample you aren’t going to be asking the exact same people, so there’s some room for discrepancy. And 12 years is enough time for at least some of the older population in your 2000 sample to die and be replaced by younger people. But still, a 50% drop in 12 years seems to me to be a little too big to be explained by a natural population turnover.

So basically what I’m wondering is, looking at those two sets of figures, would it be a viable interpretation that behaviour has actually changed from 2000-2012, or would you have to assume that something else is going on (e.g. that in 2012 people are lying because they don’t want to admit to criminal activity). Assume for the sake of argument that the basic methodologies of the two surveys are sound and identical and that the sample sizes are large enough to be taken as representative.

Hope I’ve made this all clear. I have no training in statistical evaluation and I want to be sure that I’m not missing something.

IANAStatistician, but I believe you are correct. And it’s a pain to deal with when creating studies that rely on self-reporting: people lie. People especially lie when the questioned behavior is illegal or perceived to be immoral.

Start with the assumption that all models are wrong, some are just less wrong than others, so don’t worry if a “correct” answer is not achievable. Just concern yourself with being clear on the question you are trying to answer, and the degree of confidence you have in that answer (given the limitations of the data) and what you intend to do with the result you get.
If it is merely for interest then “near enough is good enough”. If you are basing major policy and committing millions on back of it, then best invest in the resources make a good job of it.

The situation you describe certainly is a problem and it may not necessarily have a satisfactory answer.

One potential way of looking at this (If you want to know by how much the proportion of “people doing X” has really changed) is to see if there is a representative cohort of people that were not exposed to the pre 2003 conditions and were only truly active affected in the 2003-2012 illegality period. This could be the 18-20 group (or perhaps even narrow and younger). Then, if the data from 2000 can be stratified and the same grouping results used for comparison then that is probably as good as you are going to get (and ultimately that may not be good enough for your purpose)

Yeah, you need to take into account the part of the population the isn’t there anymore (people that died) and the part of the population that is ‘new’. Given the size of the sample(s) you can calculate the confidence intervals and see if the proportions really are different (or at least enough so, to claim they are). SE in these cases is calculated by SQRT(p*q/n). With a sufficiently small N, there might not be a significant different between the two proportions at all.

All Greek to me I’m afraid

Unfortunately, the figures aren’t broken down by age so I can’t tell whether the decline is only among post-2003 adults or spread out all over the place.

When we only have a sample and want to say something about the whole population, we typically take a certain confidence level - in frequentist statistics - and determine a range of the possible percentage (or average). To do this you need the standard error (SE), which is calculated by the formula I gave above. As you can see, this depends on the sample size.

So if you do this for both samples, you might get that the range for the first sample is 15-25 percent and for the second one 4-16 (all fictional). In that case you wouldn’t be too sure about there actually being a difference between the percentage in 2012 and 2002. You can also get the confidence interval of the differnce between the two (so starting from the 10% difference), but I don’t remember the exact formulas for that:).

There are multiple difficulties with the survey described in the OP, as described in the OP and other posts above, and a serious statistician would have to make some suitable adjustment in the formulas to account for them.
[ul][li] Some people from the earlier population dying out (attrition).[/li][li] Some new people entering the population (recruitment).[/li][li] Confounding due to the formerly legal activity now becoming illegal.[/li][li] Yes, people lying about it, denying that they did the illegal thing even if they did.[/li][li] Extra confusion because of people who did it when legal, now afraid to admit it.[/ul][/li]
Of course, a lot of polling is done for political or marketing purposes, where it’s questionable whether the pollster (or his client) wants an honest answer anyway (especially political polling).

When asking questions about illegal or other socially unsupported activities, it is especially important to convince the sample subjects, one way or another, that their answers are utterly, totally, completely, anonymous and confidential. This is nearly impossible to do.

I’ll write another post discussing a known (but lousy, IMHO) way to accomplish that.

When polling about an illegal (or otherwise unacceptable or embarassing) activity, it is essential to convince the subject that their answers are anonymous and confidential.

There is a well-established technique for doing this. But it’s just complicated and non-obvious enough that I am very skeptical whether many people would understand it, and therefore it would not be convincing to many people.

Say the question is: Did you ever have anal sex with a horse?

Gather a roomful of subjects, pass out the questionnaire to everybody, and instruct them as follows:
To assure you that your answers are completely anonymous and confidential, we ask you to follow this unusual procedure: Everybody pull a coin out of your pocket. Now flip it. Note if it lands heads or tails, but don’t tell anybody.

If it lands heads, then answer “Yes” to the question, whether that is true or not.

If it lands tails, then give the true answer to the question
Now suppose, hypothetically, that the subjects all became identified and all their answers became known to the Justice Department. Still, there could be (theoretically) no consequences to anyone who answered “Yes” because anyone who answered “Yes” could claim that he did so simply because his coin landed heads, not because it was the real answer.

The statistician, knowing that statistically 50% of the people gave unconditional “Yes” answers and 50% gave (presumably) true Yes or No answers, will have a formula to filter that out and determine the actual percentages of true Yes and true No answers. (No, I don’t know the formula myself.)

Still, as I said, I am skeptical that the majority of math-illiterate masses would understand the working of this procedure, nor trust the “authorities” who are conducting this survey.

But what do the letters in the formula represent?

p= proportion (percentage) of ‘succes’ (here the people that admit to doing it)
q=1-p (so the rest)
n= the number of respondents in the sample
Senegoid is right of course, but that isn’t so much statistics as research methodology. The way you present it here, the surveys have already been conducted and you just have the data to work with. But it never hurts to know the limitations of your data.