I have a database of 28,000 records and take a sample of 32 records. I find that 5/32 of the records contain an error, for an error rate of about 16%.
Now, I calculate the confidence interval for this sample at 95% and find that it is +/-17.3%. Is it correct to say that I am 95% confident that the true error rate is somewhere betwee 0% and 33.3%? The part that is bugging me is that the lower end of the confidence interval would put the error rate at less than 0%.
Hmmm. That’s the one I used to get +/-17.3%. I have the sample size set at 32 and the population at 28,000 and the percentage at 50%. Is there something I should be doing differently?
Okay, I see. What we are trying to demonstrate, partly, is that 32 is a pretty low sample size to use in making a case that the error rate is really as high as 16% or more.
Is there any problem with using the 15.6% percentage rate, since it came from that small sample in the first place?
Thanks. The other question we are trying to answer is whether is is really fair to say that there is evidence that the population has an error rate of 10% or greater based on that sample.
Is is true that, since the lower bound of the confidence interval would be about 3%, that it hasn’t really been reliably shown that the real error rate is greater than 10%? This may be a more subjective call.
I asked a related question, also driven by the desire to profile data error rates in a database. Here is the thread. **ultrafilter **was also the primary contributor to that thread.
If the real error rate was 10%, the probability of getting 5 or more errors out of a sample of 32 works out to [fires up Mathematica] about 21%. To put it another way, about one time out of five, a population with an error rate of 10% could produce the result you’ve seen here. I wouldn’t be terribly confident in excluding such a low error rate myself, given your data.
Thanks, I was wondering if I was using the right distribution.
We are trying to avoid having to go back and check all 28,000 records by hand, and if we can make the case that the error rate hasn’t been shown to be 10% or more within reasonable bounds, we won’t have to.
Lets temporarily ignore the actual number and talk about the underlying paradox.
Namely
A) It is possible that a given sampling survey could have a number of error records that results in a confidence interval that contains 0.
B) If the error rate is actually 0, it is impossible to get the number of error observed in part A.
This appears to be a contradiction.
The problem is in the impercise terminology. The interval you are estimating is not actually the “confidence interval” it is instead a “prediction interval”
=It doesn’t actually give a region where there is a 95% probability that the true error rate falls, instead it says, if the estimated error rate the I estimated was correct, and I repeated the experiment what is the range of values I’m likely to get. So in your example (assuming your calculations were correct) the confidence interval says, if the true error rate was 16% and I drew lots of 32 records at a time, then 95% of the time my estimate for the error rate based on that sample would be between 0 and 33%.
A true confidence interval is instead the region such that if the true value was outside this region there would only be a 5% chance of getting the results you observed by chance. In the case of truly normal data with a fixed variance, the two regions are equivalent, and so are often used interchangeably. But in your case where your variance depends on the error rate (with 0% error rate you are guranteed to get 0 errors) the two regions aren’t equivalent.
In order to actually compute an interval that we are 95% confident that the actual error rate lies, we would need to use aBayesian confidence interval. But to do that we would need to have some grasp of the underlying distribution of possible error rates which is too much to ask for.
Okay, then. What do you think of the sample size they used to come with the estimated error rate of 15.6% (5/32)? Could we argue that it was too small to be a reasonable sample of 28,000 records?
The 28,000 records is largely immaterial to the size of the sample you desire. A confidence interval for a proportion is largely independent of the population size so long as the population size is enough larger than the sample size (the textbook we use recommends 20x greater).
However, a good rule of thumb is to make sure the sample is large enough so that you see at least 10 errors and 10 good records, so, to that extent, your sample is too small.
With that, it would seem like checking another 32 would work, but that’s probably placing way too much reliance on what you’ve already checked. Rather, there are well established methods for determining a good sample size.
How large does it need to be? That’s dependent on a couple things. First, the size of the margin of error you desire and secondly, the proportion of records with errors. The size you need is n = p*(1-p)*(1.96/m)^2, where p is the proportion with errors and m is the size of the margin you’d like. I’ll take p = .20 and m = .04 to be somewhat conservative. In this case, we find n is about 385. You’d be well served by going back and checking several hundred records for errors.
Since we don’t know the real proportion with errors, would it make sense to use .10 to calculate a reasonable sample size, since that is the “allowable” proportion. By allowable, I mean that’s the proportion we can’t exceed if we are not going to be required to go back and check all 28,000 records? That gives a sample size of 138 with m=.05.
And, if we used that sample size of 138, what would it mean if the error rate in that sample turns out to be more that 10%?
It would mean that you’re at much greater risk of accepting the data as good, even though it should be thrown out.
The problem is that by understating the value of p and running a low sample size, you’re at a high risk of drawing an incorrect conclusion. You should estimate conservatively to reduce this risk.
More specifically, I’m assuming that as an outcome, you’d rather go through all the records than to tolerate an error rate which is too high (say 13% or 14%). However, the power of your test is pretty low in these instances, meaning you risk exactly that. In fact, if the actual percentage of bad records is 13%, your power is only around one-third, meaning you’ll make the wrong decision for two-thirds of the possible samples you could draw. If the actual percentage is 15%, the power is still fairly low, running around .6.
The typical rule of thumb is that the power should be .8 or greater. If the actual proportion of errors is 15%, then around 250 records should be enough. My earlier suggestion of 385 records will (coincidentally) give a power of 80% for a true error proportion of 14%.
Also, as an aside, you are randomly selecting these samples, right? If you’re not doing that, none of the calculations are valid.