Need help determining the best way to sample a population.
So I have 1000 automobiles, each with an unique VIN number. I have the following data for each car :
A> **Year of manufacture **: 2012, 2013, 2014, 2015 (4 possible values)
B> Size: Compact, Sedan, Luxury, Crossover, SUV, Truck (6 possible values)
C> Manufacturer: Firm 1, Firm 2, Firm 3 (3 possible values)
A car can be tested and found to be defective or not (2 value of the result 1 or 0) . The hypothesis is that factors A, B or C or a combination of them effect the test.
What would be the best way to design a sampling plan (with known confidence) to find out what (A, B or C) effects the outcome ?
I am not looking for you to solve the above for me. Just point me to some resources / examples. I have access to Minitab, if needed.
I’m no expert, but the Wikipedia page on logistic regression mentions a bare minimum “Rule of Ten”. That is, you need to sample at the very least 10k/p cars, where k = 3 is the number of variables and p is the proportion of defective cars. So it seems your answer will depend on a rough estimate of the number of defective cars (which makes sense: say none of the cars turn out to be defective; then obviously you learn nothing about the effect of the factors, even if you tested all 1000 cars. On the other hand, if 10% are defective, then you can get away with sampling 300 cars.)
Following that through, it implies that for any modern manufacturing defect rate you’ll need to sample them all. And may need another 10K or 100K examples to examine before you can make any kind of reliable statistical argument.
Informally: when you’re looking for just a couple needles known to be somewhere in a haystack, you’re going to have to handle a lot of straw to find them. If the pile is nearly one half needles and one half straw you can work with a much smaller sample.
The first two deal with the probabilities and statistics of your problem. As already noted by others, if your defect rate is small you will need a lot a data to end up with enough defects in your sample to say anything useful.
The third link discusses a complication that you have set yourself up for by not having a specific hypothesis going in. In particular, you are doing a large number of tests simultaneously, giving yourself many independent rolls of the dice to find a spurious signal. You also have many non-independent tests happening simultaneously, which makes a direct analytical treatment difficult (more below).
For the independent cases, you have 72 different categories of car. That’s 72 chances for random fluctuations to give you a result extreme enough to catch your eye. If you want to be 95% confident that you are looking at a true signal in this undirected search, you can’t apply a standard test to each category and the repeat it 72 times since 1-in-20 of these attempts (5%) are expected to show a signal of at least 95% confidence level just by chance.
The rough correction for this would be to require 99.93% confidence in each test rather than 95% confidence (N.B.: 5%/72 = 0.07%). You also have to consider that you are actually doing 144 tests, not 72, since you are looking for both higher and lower rates. (Whether this needs a separate correction depends on how you calculate the confidence level of each individual test. Concept: one-sided vs. two-sided confidence intervals.)
You also have some non-independent categories. That is, you have 72 fully independent categories, but then you have other ways of grouping, like “even years vs. odd years”, or “firm 1 vs. firms 2 and 3”. These also provide more chances of seeing a spurious signal from fluctuations, but the same data shows up in multiple categorizations so the corrections are not straightforward.
All told, this is the sort of problem where I very quickly resort to Monte Carlo methods to estimate anything. Regardless of the test statistic you eventually decide to use, you can just simulate a zillion random draws assuming the null hypothesis (all defect rates are equal) or some other model to quantify how likely it is that you are looking at a spuriously high (or low) test statistic.