Statistical Method

What would be the best statistical method to compare the difference of means between two unpaired groups with different N whose data consists only of 100 or 0? I tried googling for the answer but I can’t find the answer (or more likely, my ignorance truly runs deep that I can’t recognize the answer when I see it). Hope somebody can help me.

I think this would be the version of Student’s T-test that is for unpaired data.

Will it matter, though, if the data consists of only 100s and 0s? I’ve only used the t-test on data that is more variable. I’m wondering if a non-parametric test would be more appropriate and if it is, which one?

BTW, thanks for replying.

As far as I know, the t test assumes that the populations are normally distributed, so it may not be appropriate here. I’m not familiar with any other tests that might be.

Yes, you need to be able to assume normal distribution to use a t test. I think a non-parametric equivalent would be a Mann-Whitney U… but I could easily be wrong.

“Best” is not clearly defined. By convention, the word “best” in statistics means minimum variance as in best linear unbiased estimator. But you need to be more specific. A constant has a varaince of zero, so any constant cannot be beat on the “best” criterion alone. It’s the unbiasedness that makes it lose. That is, what is the best estimate of the mean of a population with a sample of 1 2 5? Well you can’t beat 10 since you know 10 = 10 for sure so it has no variance. OTOH, it’s not likely a very good guess in this case. The best linear unbiased estimator is 8/3 in this case, the sample mean.

In any case, your data is one of two numbers. If you assume that each data point is independent of the others (and you have to assume so or tell us how they aren’t), then one parameter characterizes each population – the probability that you get 100 (vs 0). The mean value of the population would be 100p (= 1000p + 0(1-p)) so it conveys the same information as p.

You’re asking then: Is p for one population different for (call it) q for the other? So what you want to look up is sampling from a binomial distribution.

Suppose you observe n out of N 100s in one population and m out of M in the other. Your point estimates are phat = n/N and qhat = m/M. The varianaces of these estimates are var(phat) = phat*(1-phat)/N and var(qhat) = qhat*(1-qhat)/M. If N and M are large enough, then you can relay on the law of alrge numbers that phat and qhat are approximately normal and use a standard Z test on their differences.

If N and M are not quite so large, you can get away with using a t-test. I’m not positive there is a theoretical justification for this like the law of large numbers though. Look up testing the difference of means of two samples of unequal size.

Thanks for the answer, OldGuy. In other words, what you’re saying is, if N and M are large enough, I should use Z test and if it’s small enough, t-test would suffice, correct? N or M ranges between 150 - 300, so can I use the Z-test then?

It looks like the two-sample unpooled t-test (formulae here) might be most appropriate given the information you have.

If N & M are large enough then a normal or Z test is theoretically justified. If they aren’t quite so large, many people would use a t-test and not bat an eye though I’m not sure this can be theoretically justified. As for which to use, use the t-test, if N and M are large enough it converges to the Z test in any case. Use the Z test (by necessity) if your t-test tables don’t have entries for that large a number of degrees of freedon.

Now a real proper (Bayesian) test would ask the question Prob{p > q | N, n, M, m}, but to do this you would need a prior joint distribution for p and q.

0 or 100 translates to bi-modal data, which is also called nominal… since you have independent groups (group 1 and group 2) and a choice of 2 responses. Based on your description of the data, I recommend a Chi-Square test. FWIW, I teach undergraduate and graduate statistics.

If I were you, I’d forget about means. Just convert the data to proportions and use a chi-square test.

The data comes from random audits from the population from two separate time periods. (say: January scores - Group 1, February Scores - Group 2) Would chi-square be applicable in this instance?

Yes.

The null hypothesis will be that there is no significant difference between audit results in January and audit results in Feb.

A two-sided alternative hypothesis will be that there is a significant difference between them (ie. a disproportionate number of 100’s in one group versus another).

Thanks to everyone…this is the reason I love this message board so much :smiley: