We have to prove 95% success with 90% confidence when testing our product. One co-worker argues that 20 successful tests gives you 100% confidence. Apparently, it is not that simple. (The logic is that the next 20 times could all be failures.) So, someone suggested the Binomial Distribution method of which there are excel spreadsheet calculators online. The results show 45 successful tests are needed, and 1 failure pushes this number to 77 successful tests. This demands a ton of manpower (and womanpower) to physically carry out 45 tests per section of product that is miles long, divided into manageable sections. In short, each section would need 45 successful test iterations!?! Is there a better way? Any way to take credit to reduce subsequent sectional tests based on (presumably) successful testing of prior sections of product?
Anyway, something bugs me about this approach for my case. The examples always refer to number of samples (n) from a large population (N). And, as I read it, Wikipediaconfirms this is the approach when this criteria is met. In actuality, I have no idea how many samples (n) are needed* and the population (N) is infinite, isn’t it? *In all examples, “n” is the number of samples taken, not required - as in my specific case…my point being that, in the former case, it seems “n” is arbitrary. But, it my case, “n” seems crucial to meeting the required 95% success rate with 90% confidence.
There’s something comically paradoxical in all this immutable mathematicae. If 1 failure occurs in 45 tests, then what confidence can I have in 77 tests??? Or, are the experts hedging their bets by not pushing for 90 tests?
Your first coworker’s logic is obviously nonsensical. If you haven’t tested every single product (with a perfect test), then how can you ever be perfectly confident that all of them are good? There is always some nonzero probability that the bad product was outside your sample. I’m not sure what the comment about 20 subsequent failures means.
If 95% of your products are good, and you sample n products at random, then the probability that all n are good is 0.95^n. For example, if you sample 20 products, then that probability is 36% (~1/e, incidentally) that all n are good.
If you sample 45 products, then the probability that you see zero failures is 0.95^45 ~ 10%. So I think your math is right.
If you sample 77 products, then the probability that you see zero failures is 0.95^77 ~ 2%. The probability that you see one failure is 0.95^760.05^1(77 choose 1) ~ 8%. Summing those, the probability that you see zero or one failures is again ~10%, so I think your math is again right.
If you test a population more than once, then think carefully about how the tests interact. You can’t e.g. keep resampling until you pass, and then call the batch good; all of the observations that you’ve made must be considered.
In your binomial distribution, N is your sample size, and n is the number of good or bad products in your sample. Neither is infinite. The experts aren’t making this up, and there is no paradox. This is solid math that models yield very well in many (but not all) real-life situations.
There are a lot of questions here and I’m not sure I fully understand all of them but I’ll do my best.
TLDR: Unfortunately its even worse than you thought.
Let’s start with the basics of the binomial distribution and how to calculate a required sample size. Using wikipedia’s nomenclature, n is the number of tests performed, k is the number of failures.
The first is to work out what combination of successes and failures will demonstrate with high confidence that the error rate isn’t any higher than 5%. This requires more than just doing enough runs to have an cacluated error rate less than 5%. Using your co-workers suggestion you would estimate 0/20 or a 100% success rate. But suppose the true error rate was actually 7%, you could still get 0/20 failures, in fact you would get it about 23% of the time. You so you can’t show these 20 runs client to guarantee a 5% error rate, he would say that you were just a bit lucky.
What you want to be able to your client is that if your error rate was 5.001% there was only a 10% chance that you would have gotten results so good (this is your 90% confidence interval). This is where you get your calculations of 0/45 or 1/77. These are derived by noting that if you have a 5% failure rate, than there is a 0.0994% chance that you will not observe any failures in 45 runs. Similarly there is a 9.7% chance that you will observe 1 or fewer error out of 77 runs with a 5% error rate.
But unfortunately it doesn’t end there. Suppose you actually have a 3% error rate. If you run 45 runs, you will probably have around a 75% chance of having at least on failure, most of the time you won’t succeed. So in order to have a good shot of actually succeeding with your demonstration you will actually need many more runs. In order to have something like a 73% chance of success, you would probably need to run something like 353 tests (and hope that no more than 12 come up as failures). If you actually a much lower failure rate than 3% than you can get away with fewer. With a 1% true error rate, 45 samples will be enough 64% of the time, and 77 samples will be enough 82% of the time.
Now your final question seems to relate to the problem that there are sections of a long product, rail road. How to proceed depends on several factors.
does the required 5% error rate refer to a particular section, or to the product as a whole (so that a failure of any one section of many sections will lead to a failure of the whole).
If its only 5% for a section then you only need to test sections in total the 45 (or whatever number) times, you don’t need to test each section that many times.
If the 5% failure rate refers to the entire product than that means that you are going to have to have a much lower than 5% error rate on each section (probably about 0.05/(# sections)) which could massively increase your required testing. Or alternatively test the entire product 45 (or whatever number) times
2) Is there any reason to believe that certain sections may have higher error rates than other sections.
If not, than you can probably just repeatedly test the same section and assume equal failure rates elsewhere. Otherwise, you will have to spread your testing across the multiple sections.
This was probably not the answer you were hoping for, but I hope this helps.
True, and that tradeoff is intrinsic to this kind of sampling: the closer your actual defect rate gets to the limit, the more you’ll have to sample to determine whether it’s above or below. With luck, whoever set the specification considered that tradeoff (of increased inspection effort vs. increased effort maintaining the process at a better yield than you might actually need) in some intelligent way.
The OP might search the words “acceptable quality level” or “acceptance sampling”. This is a pretty established thing, with lots of books, tables, software, etc. to help.
It sounds to me like you need a better way of thinking of your problem.
If you are only testing a portion of your product - then that only tells you something about that section - unless there is some sort of correlation between that piece and the others.
I’m not sure what you are testing that is miles long, but if it was something like a pipeline - you’d have to come up with a metric for testing the individual segments - as well as the joint and construction.
I think you are on the right track with the skepticism of your coworker - and you can easily construct simulations in excel to sort of understand what you need (the great thing about simulations - is you can test things that you don’t necessarily 100% understand the math - as long as your assumptions are correct).
If the individual pieces are not correlated with each other - you’d need to multiply the score you give each one by the score for each piece to come up with an overall score, but that assumes zero error would come from the joining(if any) of the individual sections.
Sometimes I find a nice long walk to try and think of different ways to describe your problem can be helpful. There is usually more than one way to skin a cat.
I followed your logic - until the second calc above. You started with 0.95^n, and then inserted extra terms 0.96^760.05^1(77 choose 1) = huh? After the basic equation, where did the extra terms come from, and what does “77 choose 1” mean? At first, I thought it was a parenthetical, but then I noticed the asterisk meaning it is part of the formula?
Sorry, I mean COMBIN(77, 1) as Excel would write it, the number of ways to choose a combination of 1 item from a set of 77 items.
You have a typo in your copy of that expression; 0.96 should be 0.95. If one item is bad, then there are 76 good items. So that expression is the probability that 76 items are good (95% each), and one item is bad (5% each), multiplied by the number of different ways that one item in that set can be bad. Without the adjustment by COMBIN(77, 1), that would just be the probability that a particular one item (e.g., the first item that you sample) is bad, not that any one item is bad.
You may want to look up something like acceptance sampling. It sounds like you are trying to ensure that the quality level is greater than 95%, which is equivalent to an acceptable defect level of 5%. It is usually referred to as an acceptable quality level (AQL) of 5%. There are tables, such as MIL-STD 105, that give sample sizes for a variety of situations, based on the AQL.