Statistical Method

Anduril · October 16, 2006, 3:05pm

What would be the best statistical method to compare the difference of means between two unpaired groups with different N whose data consists only of 100 or 0? I tried googling for the answer but I can’t find the answer (or more likely, my ignorance truly runs deep that I can’t recognize the answer when I see it). Hope somebody can help me.

Napier · October 16, 2006, 3:50pm

I think this would be the version of Student’s T-test that is for unpaired data.

Anduril · October 16, 2006, 4:58pm

Will it matter, though, if the data consists of only 100s and 0s? I’ve only used the t-test on data that is more variable. I’m wondering if a non-parametric test would be more appropriate and if it is, which one?

Anduril · October 16, 2006, 5:00pm

BTW, thanks for replying.

ultrafilter · October 16, 2006, 5:04pm

As far as I know, the t test assumes that the populations are normally distributed, so it may not be appropriate here. I’m not familiar with any other tests that might be.

Revenant_Threshold · October 16, 2006, 5:17pm

Yes, you need to be able to assume normal distribution to use a t test. I think a non-parametric equivalent would be a Mann-Whitney U… but I could easily be wrong.

OldGuy · October 16, 2006, 5:52pm

“Best” is not clearly defined. By convention, the word “best” in statistics means minimum variance as in best linear unbiased estimator. But you need to be more specific. A constant has a varaince of zero, so any constant cannot be beat on the “best” criterion alone. It’s the unbiasedness that makes it lose. That is, what is the best estimate of the mean of a population with a sample of 1 2 5? Well you can’t beat 10 since you know 10 = 10 for sure so it has no variance. OTOH, it’s not likely a very good guess in this case. The best linear unbiased estimator is 8/3 in this case, the sample mean.

In any case, your data is one of two numbers. If you assume that each data point is independent of the others (and you have to assume so or tell us how they aren’t), then one parameter characterizes each population – the probability that you get 100 (vs 0). The mean value of the population would be 100p (= 1000p + 0(1-p)) so it conveys the same information as p.

You’re asking then: Is p for one population different for (call it) q for the other? So what you want to look up is sampling from a binomial distribution.

Suppose you observe n out of N 100s in one population and m out of M in the other. Your point estimates are phat = n/N and qhat = m/M. The varianaces of these estimates are var(phat) = phat*(1-phat)/N and var(qhat) = qhat*(1-qhat)/M. If N and M are large enough, then you can relay on the law of alrge numbers that phat and qhat are approximately normal and use a standard Z test on their differences.

If N and M are not quite so large, you can get away with using a t-test. I’m not positive there is a theoretical justification for this like the law of large numbers though. Look up testing the difference of means of two samples of unequal size.

Anduril · October 16, 2006, 6:23pm

Thanks for the answer, OldGuy. In other words, what you’re saying is, if N and M are large enough, I should use Z test and if it’s small enough, t-test would suffice, correct? N or M ranges between 150 - 300, so can I use the Z-test then?

ultrafilter · October 16, 2006, 7:35pm

It looks like the two-sample unpooled t-test (formulae here) might be most appropriate given the information you have.

OldGuy · October 16, 2006, 7:41pm

If N & M are large enough then a normal or Z test is theoretically justified. If they aren’t quite so large, many people would use a t-test and not bat an eye though I’m not sure this can be theoretically justified. As for which to use, use the t-test, if N and M are large enough it converges to the Z test in any case. Use the Z test (by necessity) if your t-test tables don’t have entries for that large a number of degrees of freedon.

Now a real proper (Bayesian) test would ask the question Prob{p > q | N, n, M, m}, but to do this you would need a prior joint distribution for p and q.

phungi · October 16, 2006, 7:50pm

0 or 100 translates to bi-modal data, which is also called nominal… since you have independent groups (group 1 and group 2) and a choice of 2 responses. Based on your description of the data, I recommend a Chi-Square test. FWIW, I teach undergraduate and graduate statistics.

you_with_the_face · October 16, 2006, 8:03pm

If I were you, I’d forget about means. Just convert the data to proportions and use a chi-square test.

Anduril · October 16, 2006, 8:13pm

The data comes from random audits from the population from two separate time periods. (say: January scores - Group 1, February Scores - Group 2) Would chi-square be applicable in this instance?

you_with_the_face · October 16, 2006, 8:17pm

Yes.

The null hypothesis will be that there is no significant difference between audit results in January and audit results in Feb.

A two-sided alternative hypothesis will be that there is a significant difference between them (ie. a disproportionate number of 100’s in one group versus another).

Anduril · October 16, 2006, 8:55pm

Thanks to everyone…this is the reason I love this message board so much

Topic		Replies	Views
Why are there 2 t-tests? Factual Questions	7	1636	October 26, 2013
Need help with statistics Factual Questions	2	706	December 10, 2007
Statisical Confusion Factual Questions	4	856	January 15, 2008
chi squared vs t test Factual Questions	5	2819	March 20, 2003
Testing whether sample data are normally distributed Factual Questions	1	2943	April 11, 2008

Statistical Method

Related topics