I have a population of people and they have two discrete characteristics. Let’s say they’re either Virginian or not, and either tall or not tall. I need to know if tall people should validly be suspected of being Virginians.
So we’ve taken a sample. Unfortunately, it’s not random- subjects volunteered. Many were Virginians. We know that there’s a correlation between Virginians and tall people.
But is it significant or not? I vaguely recall doing a binomial test, but I’m not sure if that’s called for here. I also remember something about putting the numbers in a 2x2 square and doing some math to figure it out. I can’t google it because I don’t know what it’s called.
Note: We’re not really interested in Virginians as a whole…just those that volunteer for these sorts of things. So we’re cool with our self-selected sample, right? I mean, we only care about predicting the characteristics of future volunteers.
A binomial test is used to determine whether a given population deviates from a know probability. Say determining whether a coin is fair by seeing if the number of heads in 50%. In your case you have no known probability but want to see if tallness is significantly more common in one group than another group. For this the two recommended tests are the Chi-squared test and theFisher’s exact test. In general the Fisher’s exact test is better, but the Chi squared test is easier to calculate if you have a large sample size. Here is anonline calculator for both tests
ETA: These tests conditioned on the number of cases in each of the two categories, so even if you are biased for Virginians or for Tall people (say your survey comes from a big and tall store in Richmnond) the test will work, provided you aren’t biased for tall (or short) Virginians
We’re skewed, but we’re not sure if we’re biased. We’re definitely biased for Virginians, but we don’t know about tall people. The bar is set pretty high. Only about 1% of people are “tall” by our metric. Is that a problem?
Cerberus - While we’re examining a self-selected group of Americans, we’re only interested in survey respondents. We don’t care about everybody else. We don’t care if Virginians are tall, just if our survey respondents are (and thus, will be) tall Virginians. So in that sense, I think we’re actually sampling 100% of the defined population. Right?
Let’s say you define some cut-off height as tall or not tall. This could be the mean height of some larger population from which the Virginians (Vs) were selected. From the V sample, count how many are tall based on your pre-defined tall definition. You can then calculate the probability that the observed number of tall Vs is as large as it is under the null hypothesis that the Vs are no different (or no shorter for a one-way test) than the general population. This p-value can be calculated using the binomial distribution. Then you can either reject or not reject the null hypothesis based on the p-value (reject if p-value is smaller than your significance level, etc.). That’s one way to do it.
None of this should be a problem from a theoretical point of view. Although with only 1% of cases being tall, you will need to either have a large sample size or a very large effect in order to detect a difference.
In general if you have N total samples, and p < 50% are Tall and q < 50% are non-Virginian, then the size of the effect you are able to detect will be more or less proportional to pqN.
This would require assuming that your estimate for that percentage in the Null population was exactly correct. Since you are likely estimating it from the finite set of non-V in your sample, it will not be completely accurate, and you need to take this extra variability into account in your test. To do it right you really need to use either the Chi-squared of Fisher test.
No. The original question was about the relationship of two populations: Tall People and Virginians. The leap you want to take is to address the question with a sample.
Random samples allow for the incorporation of sampling variation, non-random
sampling does not.
You can certainly address the issue in terms of the single sample used, but the answer only applies to that sample.