Help with Statistical Significance

Chessic_Sense · June 30, 2011, 4:13pm

I’m trying to see if a certain process if effective in changing outcomes or if it’s failing to effect a desired change.

When the process does not occur, I score 17 out of 39, or 43%.
When the process is occurring, it results in 328 out or 807, or 40%.

I want to know if this difference is statistically significant. It’s been so long since I took Stat 101 that I don’t remember what I’m supposed to do. I know I need a null hypothesis (that the process doesn’t work) and that I need to do calculate a p-value, but I don’t remember how it works when I’m not sure what the “real” distibution should be. That is, since my null sample is so low (n=39), I can’t be sure that 43% is a valid estimation of the null effect.

I vaguely recall having to combine the two samples to get an average and assuming that’s more accurate, but I don’t remember what the rules are for that procedure. I can’t see how that wouldn’t result in the latter sample dominating the average. With that technique, I’d always come out with a result that closely reflected the latter average.

So how do I deal with this possible sample error in the “base” probability? Do I just run with it? Do I add my results together and then test the 40% result against it?

Paging statsman…

Swords_to_Plowshares · June 30, 2011, 4:18pm

What’s your alpha? How sure do you want to be?

Chessic_Sense · June 30, 2011, 4:22pm

Can’t I decide that after the calculation? The difference between 90% or 95% CI isn’t really important. Is there somewhere in the calculation that you plug this in, and thus need to know it now? If so, let’s go with .1.

UTejas · June 30, 2011, 4:42pm

It sounds like you are looking for a Binomial Test.

Using scipy.stats in Python to run the binomial test, I entered:



import scipy
from scipy import stats
scipy.stats.binom_test(17.,39.,328./807.)

The test is for 17 successes out of 39 where the expected p for success is (328/807).

The returned value is 0.745, so you cannot reject the null hypothesis at 90% confidence.

ultrafilter · June 30, 2011, 4:47pm

You don’t need to specify alpha before you do the calculations, but you do need the value to decide whether your results are significant. If you don’t have any strong reason not to use it, alpha = .05 is fine for most tests.

This page describes the test for a difference in proportions quite nicely. In addition to a null hypothesis, you need an alternative hypothesis. After that, it’s a question of plugging stuff into formulas.

One important caveat is that the test is only strictly valid if you pick your alternative hypothesis before you see any data. Otherwise, you’re using the same data to generate a hypothesis and to confirm it, which leads to bad juju. It’s probably not so horrible here, but you can get in some real trouble doing that sort of thing in more complicated scenarios.

cjepson · June 30, 2011, 4:56pm

Here’sa good place to do such calculations. In your example, “Group 1” would be the observations during which the process occurs and “Group 2” would be the observations during which it does not occur. “Outcome 1” is success and “Outcome 2” is failure. Enter the appropriate numbers of observations in the four cells (328, 479, 22, and 17, starting from the top left cell and going clockwise). If you choose Fisher’s Exact Test, you get the same p-value that UTejas got (nowhere near significant).

This assumes, by the way, that the observations are independent of one another.

Chessic_Sense · June 30, 2011, 5:14pm

But that result is backwards, isn’t it? The expected p isn’t 328/807; it should be 17/39. The 17 is for a control group. The 328 is the experimental group. If you reverse the numbers, then you get .047. That is, there’s a 4.7% chance that we’d only get 323 successes out of 807 trials.

But the real difference is whether expected p is .40 or .43. That makes all the difference, and that’s where I’m lost. Can you explain why you said the expected p is 328/807?

Swords_to_Plowshares · June 30, 2011, 5:17pm

The way I did it was to treat the 2 different proportions as p1 and p2 (so .43 and .4) and see if the difference between them was statistically significant.

Null hypothesis: p1 - p2 = 0

Alternate hypothesis: p1 - p2 > 0

Test statistic: (.43-.4)/(sqrt(.43*.4*(1/39+1/807))) = .441

z value for 90% confidence: 1.39

Since the test statistic is less than the z value we fail to reject the null hypothesis.

Buck_Godot · June 30, 2011, 5:50pm

Actually neither is exactly right. Both the 17/39 and the 328/807 are estimates of the true control probability and process probability respectively. Both of these estimates are likely to include some error, so treating either as a true fixed value is incorrect. So the question is whether the observed difference is larger than the difference of the two errors. That said, given the large numbers of samples in the process set, it is not likely to be very erroneous, so you could probably get away with treating it like as constant as UTejas did.

Plowshares method is a little bit better, but assumes normality. Still since your sample size is large enough the binomial probabilities will approach a normal distribution.

The Fisher exact test proposed by cjepson is probably the correct way to go as it makes no distributional approximations.

There is also the issue of whether you want to use a 1-sided or 2-sided test. This depends on what you conclude if the process probability was substantially larger than the control. But this is moot since it appears that even with a 1-sided test and quite large alpha you have no significance.

UTejas · June 30, 2011, 5:55pm

Yes, I am sorry for playing a bit fast and loose; I should have been more clear. The n=807 value is large enough that the problem is essentially reduced to that of a binomial distribution.

For smaller sample sizes, you would want Fisher’s Exact Test.

Swords_to_Plowshares · June 30, 2011, 6:24pm

Buck Godot is right. I probably should not have assumed normality. Still, no matter what test you do I am pretty sure the null hypothesis is always going to be upheld because the margin is just so large.

Chessic_Sense · June 30, 2011, 7:26pm

I don’t understand. Suppose there were a statue in town that people say is a blessing. While the statue has been there, 43% of the 39 babies that have been born have had a spectacular singing voice as adults. A traveler comes along, says “I don’t believe in your witchcraft” and destroys the statue. After 807 babies are born, only 40% have a beautiful voice.

Now let’s suppose that the statue really is a blessing. The true probability really was .43 and now it really is .40. According to the math, no statistician would ever be able to prove it, because no matter how many babies are born from here on out, it just strengthens the case that the statue was NOT a blessing, because it keeps making it seem like the “true mean” was .40 all along.

In other words, if my process really does improve things (i.e. lowers the bad from .43 to .4) then as time goes on, it’s only going to strengthen your argument that the process does nothing. If my process really doesn’t improve things, then we’d expect to see the rate creep back up to .43. How can that be? Is there really no way, without eliminating the process, that we can prove it’s effective? With the statue destroyed, are we doomed to “never know for sure”?

UTejas · June 30, 2011, 7:43pm

Chessic_Sense:

I don’t understand. Suppose there were a statue in town that people say is a blessing. While the statue has been there, 43% of the 39 babies that have been born have had a spectacular singing voice as adults. A traveler comes along, says “I don’t believe in your witchcraft” and destroys the statue. After 807 babies are born, only 40% have a beautiful voice.

Now let’s suppose that the statue really is a blessing. The true probability really was .43 and now it really is .40. According to the math, no statistician would ever be able to prove it, because no matter how many babies are born from here on out, it just strengthens the case that the statue was NOT a blessing, because it keeps making it seem like the “true mean” was .40 all along.

In other words, if my process really does improve things (i.e. lowers the bad from .43 to .4) then as time goes on, it’s only going to strengthen your argument that the process does nothing. If my process really doesn’t improve things, then we’d expect to see the rate creep back up to .43. How can that be? Is there really no way, without eliminating the process, that we can prove it’s effective? With the statue destroyed, are we doomed to “never know for sure”?

The problem is that 39 samples is really not all that big. You really have two categories, 17 in one and 22 in the other. You are trying to resolve a difference that is, quite frankly, fairly small (3%). There is no way to be sure unless you’re willing to state that the 43% is absolutely correct and beyond reproach–that is to say, it is the true value. Then, as you note, with 807 samples that come back as 40%, you can be 90% confident that something has changed.

Failing that type of assumption (a very poor one indeed), you are comparing a small difference in two small samples. Just from eyeballing it, it seems quite unlikely that any difference would be statistically significant. After only 39 babies, no statistician could be confident that the true percentage is 43%. Due to the discrete nature of your data, even a single different baby–16/39 or 18/39–would be a change of 2.3%!

nivlac · June 30, 2011, 7:57pm

Of course you did assume normality. What do you think a z-value represents?
As stated by the OP, if you make the necessary assumptions about the observations being from a random sample, it is a simple t-test for the difference between two proportions. Since the question was about statistical significance in the difference, it’s a two-sided test (and you use a pooled proportion in the standard error calculation). The t-statistic is 0.37 which has a p-value of over 70%. If the interest was whether there was a statistically significant drop in the proportion, then it’s a one-sided test and the p-value is still over 35%. Those p-values are much too high to conclude that there was any statistically significant difference in the proportions. So, no, your process did not lower the proportion.

cjepson · June 30, 2011, 8:20pm

Chessic_Sense:

I don’t understand. Suppose there were a statue in town that people say is a blessing. While the statue has been there, 43% of the 39 babies that have been born have had a spectacular singing voice as adults. A traveler comes along, says “I don’t believe in your witchcraft” and destroys the statue. After 807 babies are born, only 40% have a beautiful voice.

Now let’s suppose that the statue really is a blessing. The true probability really was .43 and now it really is .40. According to the math, no statistician would ever be able to prove it, because no matter how many babies are born from here on out, it just strengthens the case that the statue was NOT a blessing, because it keeps making it seem like the “true mean” was .40 all along.

In other words, if my process really does improve things (i.e. lowers the bad from .43 to .4) then as time goes on, it’s only going to strengthen your argument that the process does nothing. If my process really doesn’t improve things, then we’d expect to see the rate creep back up to .43. How can that be? Is there really no way, without eliminating the process, that we can prove it’s effective? With the statue destroyed, are we doomed to “never know for sure”?

Keep in mind that in traditional tests of significance, the criterion is a p-value of .05. That means that we are not going to declare the null hypothesis (i.e., the hypothesis that the statue is not a blessing) unsupported unless there is a 5% or less chance that we would have seen as much of a difference as we did, if that hypothesis were true. That’s a pretty stringent criterion, which means that even if you have a “real” effect, it may not be found statistically significant. The reason for this is that the purpose of statistical testing is to distinguish real effects from apparent effects, and if your effect size is small (e.g., 40% vs. 43%), you need strong evidence to support the claim that it’s real and not due to chance. If you’re comparing proportions as in this example, that basically means you need a huge sample size… which means a lot more than 39 babies in the “statue” condition.

So, as long as we’re using traditional significance testing, you’re pretty much right… if we’re not able to collect any more data from the “statue” condition than those 39 babies, and the true difference is only 3 percentage points, then it’s going to be pretty hard to “prove” that the difference wasn’t due to chance, or to some other factor besides the statue.

ultrafilter · June 30, 2011, 8:29pm

Everything up to here is correct.

But this is dead wrong. The change of process did not make a statistically significant difference in the outcomes, but that is most emphatically not the same as no difference.

OP, I assume that part of the rationale for a new process was to save money. If you take the total amount you’ve spent on the 807 new outcomes with a 40% mean, and recalculate what you would’ve spent with a 43% mean, you may be able to argue that it’s a good change that way.

nivlac · July 18, 2011, 6:36pm

That’s just a nitpick. If you read my whole post, when I said “your process did not lower the proportion”, I obviously meant “your process did not lower the proportion a statistically significant amount”. And I don’t agree with the suggestion that the OP argue for spending money on a statistically insignificant lowering of the proportion because anyone astute in management can argue that there’s not enough confidence that the change will hold up if you go forward with the new process. I’ve worked on decisions of this sort throughout my career. To make a scientific conclusion based on data and then to go against that conclusion makes no sense.

JWT_Kottekoe · July 18, 2011, 10:48pm

The proper way to do this calculation is given above, but it is not necessary, since the result is insignificant by inspection. Under standard assumptions, the random error in 39 trials is about sqrt(39) or about 16%. You are talking about a 3% change!

Uncertain · July 19, 2011, 2:26am

There is not good evidence that the process has changed anything. Indeed, you can’t rule out, with any confidence, that it has raised the proportion. You can get a confidence interval on the odds ratio though.

Topic		Replies	Views
A quick stats computation Factual Questions	7	651	November 13, 2008
Tough Math and Statistics question Factual Questions	6	950	February 24, 2004
Stats: Correct Use of Binomial Distribution Factual Questions	8	1621	May 4, 2016
Stats question: t-tests and p-values. Factual Questions	10	4315	December 22, 2010
stat. significance test possible? Factual Questions	16	889	February 17, 2003

Help with Statistical Significance

Related topics