I have a bunch of automated tests that run on software. Each of which takes some amount of time and can return one of a handful of results, mostly Pass and Fail.
The full set of tests takes many hours to run, but what I’d like to find is a subset that is the most useful per unit-time, so I can do something like run tests for an hour, and have the most useful hour’s worth of tests run.
A test is useful if it both fails and succeeds sometimes. A test is more useful if it fails when other tests don’t, basically, if its results are orthogonal to the results of other tests. A test is also also useful if it’s the fastest running test in a group that tends to fail together, since it’ll find a failure faster than the others.
I have a feeling/hope that there’s a fairly simple statistical method for figuring this out. What should I be looking at?
It is not important that I find an exact answer to this problem, if such a problem is undecidable. A partial answer or a heuristic will still very likely improve things greatly.
I’m not a statistician so I can only provide some ad-hoc ideas.
I think what I’d do is this: find the test that is most correlated with the remaining tests, and throw it out. Repeat until you have the desired number of tests.
To calculate the correlation, add up the pairwise correlation between test N and all other tests. The pairwise correlation is just the count of all the tests with matching results divided by the number of tests.
This doesn’t take running time into account. If that’s a significant factor, you could incorporate that as a weighting factor. You might also consider functions other than summation–a max function might work as well.
Another thing: what kind of numbers are we talking about here? There are algorithms that might work better if we’re talking about small numbers. For instance, if there are 10 tests and you want the best 3-test subset, then you can check every possible combination against some metric and pick the best (since 1098 isn’t very much). If there are 1000 tests and you want the best 100, then you need a different algorithm.
Perhaps principal component analysis (PCA). I’m not sure how well it would work with pass/fail data, but that may be an area to look into.
PCA will group tests that give pretty much the same result (highly correlated info, whether positively or negatively correlated), and will also separate which groups of tests give the most information (both pass and fail). Using PCA effectively is one part science and two parts art, so it is not a plug and chug type of exercise.
Or simpler may be to just look at the pairwise correlation between the tests.
Denote the outcome of each test by a Bernoulli random variable, and let K denote their covariance matrix. If you sample from a determinantal point process with kernel K, you’re effectively assigning higher probability to subsets of the possible tests where each test is informative and each pair of tests is uncorrelated. See Alex Kulesza’s PhD thesis for the details.
It’s hundreds of tests, each test running dozens of times a day (on a different hardware variation). It may grow to thousands of tests and hundreds of pieces of hardware. I have data stretching back at least a year. Longer if I want to go dig through old archived log files.
I’m not sure I understand pairwise correlation. Let’s say, as an example, I have three tests, A, B, and C. and 5 days of results
A: PPPPF
B: PPFFP
C: PPFFF
Now, I can see that test C isn’t providing any useful information in this case. It’s failing when either of the other tests fails.
My guess is that the pairwise correlations are:
(A,B): 40%
(A,C): 60%
(B,C): 80%
So, average out, and A is the least correlated with the others. Is that right?
It seems like I only care when a failure is correlated. That is: when everything passes, I don’t really learn anything about the tests, so I should maybe throw those out and consider the cases where one test passes and one fails, or both fail.
ultrafilter, I didn’t understand any of that. I think I’m aiming at something a little simpler than a PhD thesis to get started at least.
You’ve got it. My suggestion was to add up the correlations that each test has with every other test. So:
A: (A,B) + (A,C) = 1.0
B: (A,B) + (B,C) = 1.2
C: (A,C) + (B,C) = 1.4
As I said, you throw out the one with the largest correlation sum–in this case C. So that matches your intuitive view of it being the least useful. You can’t really repeat the process again with just two elements, but if you had more you could keep throwing out tests until you got to the desired number (or runtime).
BTW, the “covariance matrix” (in this case) is just the NxN matrix with each element being the (X[sub]i[/sub], X[sub]j[/sub]) correlation.
I wasn’t sure about that based on your description. My second suggestion was going to be the following: iterate through all possible acceptable subsets of your tests. Then for each one, compute the number of trials that failed at least one of the tests. Then pick the combination that fails the most trials.
You could do this stochastically and probably come up with a decent “good enough” result. Obviously you can’t go through every combination if you have hundreds of tests.
IIUC, domination is key to your criterion more than anti-correlation. Here’s an alternative approach which might have merit:
Assign each failure a weight of one. A KEEP pile to which you will add tests is initially empty.
For each test, assign a score equal to (weighted) failures detected per hour.
Place the highest-scoring test (or a few such tests) into the KEEP pile.
Reassign failure weights. Configurations whose failures are already detected by tests in KEEP would get low scores. In the simplest case, weights would be 0 or 1, but it might be better to assign intermediate scores for configurations that only few KEEP tests detect.
Repeat from (2) until Goldilocks is pleased with the KEEP pile.
If a computer fails a test, does it matter how many tests it fails? In other words, is this a situation where any failure of any sort calls for a complete replacement of the failing component, or is a failure on one or two tests just something where you say “Eh, good enough for now, but keep an eye on it in case it gets worse”?
I like septimus’s idea. That fits my intuitive understanding of what I’m looking for. I see there’s a wikipedia page on Stochastic Dominance. Any idea where I could learn a bit more about it with some examples to work through?
Chronos, for the purposes of the subset of tests, if any test fails, that counts as a failure for the whole thing. In the larger body of tests, there are some that we expect to fail occasionally due to circumstances beyond our control, but those tests would not be part of the subset.
The subset is intended as a test before checking in a code change. If anything fails, the developer has to investigate and fix the failure in their code.
I’d treat this as a discrete optimization problem, to find the k of n tests that minimize the number of false-passes (where none of the k tests fail, but at least one of the n tests does) over your historical dataset. This doesn’t feel like a statistics problem to me.
We could find that set by a brute force search, but the search space gets huge really fast. We could start with a greedy search: find the best one test by that metric, and then find the best set of two tests that includes the previous one, and then the best set of three tests that includes the previous two, and so on.
Or, instead of always adding the best choice, at each step, we could score all the tests according to the number of false-passes that they catch, and randomly choose tests to add or drop, with probability weighted by that score. We’d choose the weights with a form like m^a, where m is the number of false-passes caught and a is an exponent. If a is zero, then the algorithm just chooses random tests. If a is big, then the algorithm becomes a greedy search. As we ran the algorithm, we would slowly increase a with each step, until the solution stabilized.
This is guaranteed to find a locally optimal solution. It’s not guaranteed to find a globally optimal solution. The more steps you run, and the more slowly you increase a, the closer your solution will get to globally optimal. This is simulated annealing, a well-studied and generally useful class of algorithms. Many other standard discrete optimization techniques might be applicable here too.
If you want to incorporate test runtime, then we could instead minimize something like a weighted sum of the number of false-passes and the total test running time, just by changing the definition of our score to consider that.
I realize I’m jumping in late but I think the OP is looking for the most efficient series of tests. If every problem that is caught by Test A is also caught by Test B, then there’s no point in running Test A when you’re also running Test B. You want to find a series of tests that catch the maximum number of problems with the minimum amount of overlapping catches.
I suspect it should be posed as a “fuzzy” problem, with solution a compromise between thoroughness and efficiency.
For example, if you finish with a test reportoire that includes test A, and many defective pieces fail only A (though they also failed other tests which have been excluded from the repertoire), there might be a concern that some inadequacy in A might allow defects to pass unnoticed — after all your experimental database is not exhaustive. Including one or two other tests might address this problem, even though they would be redundant tests if your experimental database were considered fully reliable and exhaustive.
I think my approach is more easily adapted to such fuzzy criteria.
If you care about that, then instead of minimizing the number of false-passes, minimize a weighted sum of the number of false-passes, the number of failures with only one test failing, some possibly-nonlinear function of the total number of test fails in each run and over all history, etc. Your objective function will already have a second term if you care about runtime, so that would just be a third. There is no particular need for that objective function to have a simple mathematical form here.
The fuzziness is captured by your choice of weights in the function to be minimized. That separates the problem of specifying what you want (which requires human judgment) from the problem of achieving it (which is entirely objective; though when an exact solution is computationally infeasible, a bit of judgment is still required on how best to approximate).