Statistics & probability help

I am monitoring a system for an occasional event. Over the last four months, it has occurred 233 times out of 227,000 opportunities. I have recently changed the conditions to try reduce the incidence of this event. I don’t want to have to wait another four months to know if I have had an effect. If the event now occurs x times in y opportunities, how can I calculate whether the probability of the event has changed in a statistically meaningful way?

At the moment, x = 2 and y = 6,030. The event is occurring at only one third the previous rate, but the sample size is surely too small to meaningful. But as the sample grows, if the rate stays at 1 in 3,000, then I can have increasing confidence in the reduction in rate. How can I calculate that confidence?

Many thanks.

I’m sure I’ll get corrected in a minute… but I used a chi-squared to compare the two samples.
http://people.ku.edu/~preacher/chisq/chisq.htm

p=.09, which isn’t good enough yet. The standard threshold is 5%, you need p=.05 or less.

If we can assume that
[ol][li]The occurrence or non-occurrence of an event on one trial has no effect on the occurrence or non-occurrence of events on later trials.[/li][li]Nothing except the change that was made affects the probability of an event occurring.[/ol][/li]then you’re drawing from a binomial distribution and the z-test can be applied. Here the test statistic is p[sub]1[/sub] - p[sub]2[/sub] and the standard error is sqrt(p[sub]1[/sub](1 - p[sub]1[/sub])/n[sub]1[/sub] + p[sub]2[/sub](1 - p[sub]2[/sub])/n[sub]2[/sub]). With n[sub]1[/sub] = 227000, p[sub]1[/sub] = 223/n[sub]1[/sub], n[sub]2[/sub] = 6030 and p[sub]2[/sub] = 2/n[sub]2[/sub], I get that z = 2.6719, and the p-value for the test of H[sub]0[/sub]: p[sub]1[/sub] = p[sub]2[/sub] versus H[sub]a[/sub]: p[sub]1[/sub] > p[sub]2[/sub] is 0.0038. In other words, you did something right.

The chi-square test has some issues when the frequencies being considered are small, so it really shouldn’t be used here.

No, I’m pretty sure you would need a Chi-Square test here. The differences in sample sizes require a degree-of-freedom parameter, which the Z-test lacks.

I’d type in the formulas, but don’t have the necessary math symbols in this editor! Do a quick search on Chi-square test tutorial and you’ll probably find it.

You’ll have an answer (yes the odds are the same, or no the odds are different) to some level of confidence that you chose…TRM

The two-proportion z-test for unequal variances (see here) does not require equal sample sizes. However, it does require that p[sub]2[/sub]n[sub]2[/sub] > 5, which we don’t have here, so it’s not applicable. But the chi-square test isn’t applicable either for the reasons I mentioned above, so it’s probably best to just use some kind of simulation-based test.

Thanks for your help, but even the explanations are beyond me. I only have four numbers here. I had hoped I could just plug them into an equation somewhere and out would pop a value for p.

I see there is a CHITEST function in Excel - I will play with that for a bit.

Vague memories from the AP Biology class I took my junior year of high school put me in the chi-square camp. You’d be taking the original results as your norm to be deviated from, and the test will tell you whether (a) any changes you’re seeing are due to random chance vs. some kind of outside influence and (b) whether your sample size is large enough for the results to be statistically useful. This was counts on fingers ten years ago, though, so I could be completely misremembering.

According to your OP, the four numbers you have to work with are 233, 227000, x and y, where currently x = 2 and y = 6030, but presumably we can let these grow a bit more.

If you want to use the z-test like ultrafilter is suggesting, wait until x is greater than 5, then go here.

On that page, plug in 227000 for the sample size of group 1, with frequency 233. Plug in y for the sample size of group 2, with frequency x.

Click “Calculate”, and read off the 1-Tail Actual Confidence Level.
Note however that I just found the above page on Google, so I can’t guarantee it’s correct.

Thank you - x will probably be greater than 5 in about a week, and I can certainly wait until then.

Since Amarone can reliably measure all events, can he or she not treat this as a census, rather than a sample, go to his or her boss and say “Dude, I fixed the problem and reduced it by X percent”, so he or she can collect his or her well deserved “Attaboy” or “Attagirl” as the case may be?

amarone can measure all the events that have happened so far.

But what if I (male) am the boss and am trying to ascertain if my employees have fixed/reduced the problem?

whether you are boss or minion, you still need to take into account that statistical significance (p-values) only says something about the relation between a random sample and the population it was drawn from; to be exact, it refers to the likelihood that the difference found was caused by chance (meaning that it does not actually exist in the population at large). Since neither set of cases is a sample (let alone a random one), any statement concerning the significance of the difference found is essentially meaningless. If you sample all the men and all the women in the world and you find that the men are taller on average by .0000000000001 of an inch, then that difference is real, it may be small but it is real and there’s no point in qualifying this difference by saying it is or is not statistically significant.

As long as your sample is small for the second simulation, you can run the following test:

  • We can safely assume that the probability of occurence (p0) in the first simulation is 233/227000 = 0.001 (the confidence interval for this estimation is very small).
  • Assuming that the probability of occurence in the second simulation is the same (this is the H0 hypothesis), then the number of occurence follows a Poissin law with parametre p0n, where n is the number of sample in the second simulation (6030 in here) (p0n = 6.18 in here, that means you expect 6.18 occurences on average).

From this distribution, the probability of 0,1,2,3,4 occurence are 0.002, 0.014, 0.054, 0.1351 and 0.26 repectively. Summing the probabilities of 0,1,2 occurences gives us a probablity of 0.07 that you obtain 2 occurences or less. In other words, it is not so surprising that you obtained only 2 occurences. So you can’t reject H0. No significant reduction yet.

You can’t indeed apply a z-test yet: it would be incorrect because of the low number of occurences (actually it would have told you that you could reject H0). As others before, I would suggest that you triple your sample size for the second distribution. You should be safe then. At this point, you could also compute the confidence interval of the second probability. You compute p = x/n (number of occurences in n samples). Then the confidence interval is equal to p +/- 1.96sqrt(p(1-p)/n). You can check if the original probability p0 is in this interval. If not, you can conclude that there was a significant reduction.

As I understood, amarone wants to use the events that have occurred to date as a sample of the total set of events that will occur in the future. This seems perfectly legitimate to me – like flipping a coin 20 times to see if (on future flips) it can be expected to come up heads as often as tails. (Yes, taking the first 20 flips isn’t a random sample of future events, but this doesn’t matter if we’re making the assumption that the events are independent of each other and the probability of each outcome is unchanged from flip to flip.)

Yes. I have to believe there is a legitimate way of determining the probability that the data of the small sample size legitimately represents a material difference from the first set of data.

To elaborate on the circumstances:

  • a system performs typically 2,000 operations a day. Over a 4 month period, we were seeing a failure rate of about 2 per day. The numbers were as I gave before: in 227,000 operations we saw 233 failures. This was Jan 1 - April 30 this year.

  • we made a change to the system on May 1 to try reduce the failure rate. The operations being performed are still the same, at about the same rate. In the first four days of May we had a failure rate of 2 in 6,030 operations. Adding in yesterday’s results, those numbers are now 2 and 9,570.

I am sure there must be a statistical way of determining the confidence level that we have improved the system.

Some posters have responded that I need to have x > 5 before I can run tests (“x” is currently 2). That does not seem right. If we continue to run the system and get only 2 failures in the next 200,000 operations, then clearly we have improved the system dramatically. If the test for that does not work because x is still not > 5, then I posit that this must be the wrong test.

I was actually having a similar conversation with a professor last week, and he pointed out that using a cutoff like this would bias the sample, since you’re stopping exactly after one of the events of interest. It’d probably be better, if you want to use a criterion like this, to get 5, and then figure the average interval between events, and take that many more.

The original criterion was actually that you take enough that your original error rate would give you an expectation of 5 errors. If in your original data, you got 50 errors in 1000 runs, then you would expect 5 errors in 100 runs. If you then make a change, and do 100 runs after the change with no errors, then you probably can confidently conclude that you’ve reduced the error rate.

That makes more sense - and I am already there. With over 9,000 operations now behind us, I would have expected 9 failures by now, and we have had 2. I will go follow the link.

Confidence llevel = 98.8%

Good point, I hadn’t thought of that.