Statistics sampling question: am I doing this right?

cher3 · August 30, 2012, 4:42pm

This is not a homework question.

I have a database of 28,000 records and take a sample of 32 records. I find that 5/32 of the records contain an error, for an error rate of about 16%.

Now, I calculate the confidence interval for this sample at 95% and find that it is +/-17.3%. Is it correct to say that I am 95% confident that the true error rate is somewhere betwee 0% and 33.3%? The part that is bugging me is that the lower end of the confidence interval would put the error rate at less than 0%.

I’d appreciate any help on this.

ultrafilter · August 30, 2012, 4:50pm

This confidence interval calculator gives 95% CI of 15.62 + 12.57, which doesn’t include zero.

cher3 · August 30, 2012, 4:58pm

Hmmm. That’s the one I used to get +/-17.3%. I have the sample size set at 32 and the population at 28,000 and the percentage at 50%. Is there something I should be doing differently?

Baracus · August 30, 2012, 5:03pm

You should set the percentage at the actual average that you calculated, 15.6%.

cher3 · August 30, 2012, 5:14pm

Okay, I see. What we are trying to demonstrate, partly, is that 32 is a pretty low sample size to use in making a case that the error rate is really as high as 16% or more.

Is there any problem with using the 15.6% percentage rate, since it came from that small sample in the first place?

ultrafilter · August 30, 2012, 5:24pm

Not as long as you publish your confidence interval as well.

cher3 · August 30, 2012, 5:34pm

Thanks. The other question we are trying to answer is whether is is really fair to say that there is evidence that the population has an error rate of 10% or greater based on that sample.

Is is true that, since the lower bound of the confidence interval would be about 3%, that it hasn’t really been reliably shown that the real error rate is greater than 10%? This may be a more subjective call.

CookingWithGas · August 30, 2012, 5:39pm

I asked a related question, also driven by the desire to profile data error rates in a database. Here is the thread. **ultrafilter **was also the primary contributor to that thread.

MikeS · August 30, 2012, 5:50pm

If the real error rate was 10%, the probability of getting 5 or more errors out of a sample of 32 works out to [fires up Mathematica] about 21%. To put it another way, about one time out of five, a population with an error rate of 10% could produce the result you’ve seen here. I wouldn’t be terribly confident in excluding such a low error rate myself, given your data.

cher3 · August 30, 2012, 5:57pm

Thanks, I was wondering if I was using the right distribution.

We are trying to avoid having to go back and check all 28,000 records by hand, and if we can make the case that the error rate hasn’t been shown to be 10% or more within reasonable bounds, we won’t have to.

Buck_Godot · August 30, 2012, 6:18pm

Lets temporarily ignore the actual number and talk about the underlying paradox.

Namely
A) It is possible that a given sampling survey could have a number of error records that results in a confidence interval that contains 0.
B) If the error rate is actually 0, it is impossible to get the number of error observed in part A.

This appears to be a contradiction.

The problem is in the impercise terminology. The interval you are estimating is not actually the “confidence interval” it is instead a “prediction interval”
=It doesn’t actually give a region where there is a 95% probability that the true error rate falls, instead it says, if the estimated error rate the I estimated was correct, and I repeated the experiment what is the range of values I’m likely to get. So in your example (assuming your calculations were correct) the confidence interval says, if the true error rate was 16% and I drew lots of 32 records at a time, then 95% of the time my estimate for the error rate based on that sample would be between 0 and 33%.

A true confidence interval is instead the region such that if the true value was outside this region there would only be a 5% chance of getting the results you observed by chance. In the case of truly normal data with a fixed variance, the two regions are equivalent, and so are often used interchangeably. But in your case where your variance depends on the error rate (with 0% error rate you are guranteed to get 0 errors) the two regions aren’t equivalent.

In order to actually compute an interval that we are 95% confident that the actual error rate lies, we would need to use aBayesian confidence interval. But to do that we would need to have some grasp of the underlying distribution of possible error rates which is too much to ask for.

cher3 · August 30, 2012, 6:36pm

Okay, then. What do you think of the sample size they used to come with the estimated error rate of 15.6% (5/32)? Could we argue that it was too small to be a reasonable sample of 28,000 records?

President_Johnny_Gentle · August 30, 2012, 7:35pm

The 28,000 records is largely immaterial to the size of the sample you desire. A confidence interval for a proportion is largely independent of the population size so long as the population size is enough larger than the sample size (the textbook we use recommends 20x greater).

However, a good rule of thumb is to make sure the sample is large enough so that you see at least 10 errors and 10 good records, so, to that extent, your sample is too small.

With that, it would seem like checking another 32 would work, but that’s probably placing way too much reliance on what you’ve already checked. Rather, there are well established methods for determining a good sample size.

How large does it need to be? That’s dependent on a couple things. First, the size of the margin of error you desire and secondly, the proportion of records with errors. The size you need is n = p*(1-p)*(1.96/m)^2, where p is the proportion with errors and m is the size of the margin you’d like. I’ll take p = .20 and m = .04 to be somewhat conservative. In this case, we find n is about 385. You’d be well served by going back and checking several hundred records for errors.

cher3 · August 30, 2012, 8:04pm

Since we don’t know the real proportion with errors, would it make sense to use .10 to calculate a reasonable sample size, since that is the “allowable” proportion. By allowable, I mean that’s the proportion we can’t exceed if we are not going to be required to go back and check all 28,000 records? That gives a sample size of 138 with m=.05.

And, if we used that sample size of 138, what would it mean if the error rate in that sample turns out to be more that 10%?

President_Johnny_Gentle · August 30, 2012, 11:29pm

It would mean that you’re at much greater risk of accepting the data as good, even though it should be thrown out.

The problem is that by understating the value of p and running a low sample size, you’re at a high risk of drawing an incorrect conclusion. You should estimate conservatively to reduce this risk.

More specifically, I’m assuming that as an outcome, you’d rather go through all the records than to tolerate an error rate which is too high (say 13% or 14%). However, the power of your test is pretty low in these instances, meaning you risk exactly that. In fact, if the actual percentage of bad records is 13%, your power is only around one-third, meaning you’ll make the wrong decision for two-thirds of the possible samples you could draw. If the actual percentage is 15%, the power is still fairly low, running around .6.

The typical rule of thumb is that the power should be .8 or greater. If the actual proportion of errors is 15%, then around 250 records should be enough. My earlier suggestion of 385 records will (coincidentally) give a power of 80% for a true error proportion of 14%.

Also, as an aside, you are randomly selecting these samples, right? If you’re not doing that, none of the calculations are valid.

cher3 · August 31, 2012, 4:12pm

Thanks, everyone. This was very helpful.

Topic		Replies	Views
How do I calculate a really simple (and stupid) confidence interval? Factual Questions	10	1396	December 22, 2006
Margin of error = 4.5% Factual Questions	16	2479	October 3, 2000
Is my company being cheated? (Sadistics) Factual Questions	3	1173	July 12, 2006
Polls ... what does the % error mean? Factual Questions	14	1972	November 4, 2000
How to determine sample size for binary variable Factual Questions	7	7072	May 15, 2012

Statistics sampling question: am I doing this right?

Related topics