Statistical sampling for beginners.

What’s the basic formula for doing random sampling? For example, if I’m testing canned food for botulism, out of X many cans, how Y many would I need to sample to be 99% sure that none were contaminated?

How many cans are we talking? Since we are talking destructive testing this could get ugly unless we are talking 100K+ units at which point you are looking at like 12K test samples for 99% confidence if I recall correctly.

Most of the time testing in food production is handled prior to packaging as packaging materials and product can be tested before marrying them up.

This might help:

For the OP’s example, it would depend on the prior assumption of how common botulism is.

Oh, so the probability of finding what you’re looking for factors in too. I.e., a fifty-fifty chance per sample vs. a one in ten thousand chance. I guess I’m really a beginner!

Actually, come to think of it, in testing for botulism the samples aren’t likely to be independent, either, so we’d want to know how dependent they are. Like, if botulism tends to crop up at the batch level, with a large number of cans (how many?) filled from each batch, then you only need to test one can from each batch.

One issue no one has yet addressed is the issue of sample confidence; that is, the confidence that you have that the distribution you see in your sample of Bernoulli trials is representative of the population as a whole. For a 50% confidence level this means that if you pull the same size sample again, you’ll get an answer within the original sample distribution half the time. At a 90% confidence level, you’ll get an answer within the original distribution sample distribution 9 times out of 10. The reliability this provides depends on the binomial distribution (which is an approximation of the normal or Gaussian continuous distribution) and assumes that the resampling is with replacement (i.e. that you aren’t skewing the sample by potentially removing a lot of positives or negatives). In reality, with Bernoulli trials you typically are not replacing tested units, but with a large enough sample size the covariance is minimal and it has essentially no impact on the result.

Here is a useful calculator for selecting sample size to predict a minimum reliability at a given confidence level, or alternately, determine the minimum reliability achieved for a given confidence level by testing a certain number of samples. In the example provided, obtaining a 99% reliability at a 90% confidence level requires 230 successful trials and no failures. This is equivalent to you then have to have 388 samples with one failure, or 581 samples with two failures. At 50% confidence, the numbers to achieve 99% reliability are 69, 168 and 268, respectively, so you can see how changing the confidence level makes a difference.

Note that this assumes that the population follows the binomial distribution, which is reasonable only for a very limited set of circumstances. In general, the Weibull distribution is used to represent component reliability; however, extensive testing is needed to determine the shape and characteristic life parameters for the Weibull distribution.

Stranger