The question the OP asked was what Joe and Mary each knew about their own “population” not about the over-all population. Perhaps he should have said their sample and talked about Joe’s and Mary’s sub-samples from them. But leaving terminology aside, the OP asked what did Joe know about the proportion in his set of 1000 vs what does Mary know about her set of only 50. Unless both sets come from the same population, I don’t think the question really makes sense.
I agree with you that Joe and Mary have the same information about the over-all population since they have samples of the same size. But if they are Bayesian and have a prior about the population, we know that in Joe’s larger set, the mean value is likely closer to the true mean than is Mary’s. If they draw no samples from their sets, Joe’s ex ante estimate of 10% is more accurate than Mary’s ex ante estimate of 10%. By the time they draw 50 from their sets, Mary knows exactly what her set’s mean is and Joe is still unsure. So some place along the line as they draw more and more, Mary’s estimate becomes more accurate about her set proportion, but it’s after 30.
Joe has a population of 1000 individuals. Mary has a different population of 50 individuals. Computing the standard errors of their sample means and deciding whose is bigger is a textbook frequentist statistics problem that most students in an introductory class for non-majors would be capable of solving.
The fundamental issue here is that there are two different problems that you seem to be confusing:
[ol]
[li]Taking two independent samples of size 30 from two distinct populations and making an inference about the means of those populations.[/li][li]Taking two subsamples of size 30 from samples drawn from a large population and making an inference about the overall population mean.[/li][/ol]
Which one of those do you actually have in mind?
Neither one. Taking a sample of 1000 and a sample of 50 from a single very large population with a known frequency. Then taking sub-samples of 30 from each sample and trying to infer what the frequencies are in the 1000 and 50 samples and how accurate those estimates are. That’s the accuracy the OP asked about, at least how I read it.
I understand, but OP calls both Joe’s 1000 and Mary’s 50 populations. And unless they’re somehow related (e.g., from the same super-population) I don’t know how anything meaningful can be said about the relative accuracies.
The disconnect here is that OldGuy is assuming that the samples of size 1000 and 50 are drawn from a population with an a priori known mean. (In his left-handedness example, the base population has mean 10%.) In this case, before either Mary or Joe makes any measurements, Joe has a better estimate of his sample’s mean (N=1000) than Mary does (N=50) just by assuming the mean is the same as the base population.
This is true, but it doesn’t relate to the OP. The OP did not stipulate any such priors for Joe or Mary.
Consider taking samples from both populations and computing their means. These are both random variables, so they have standard deviations. At the level of this thread, it’s appropriate to identify the accuracy of an estimator with its standard deviation. That’s how you compare the accuracy of these estimators.
It’s mathematically true, but it’s not clear to me that there’s any statistical content here. When do you ever need to make inferences about the value of a parameter in some sample from a population where the parameter is known without examining the sample?
Right. The point OldGuy is making is true but has no mathematical bearing on the OP and no practical bearing on real life inference situations. I agree with your posts in this thread, but thought maybe I could repair the disconnect a bit.
Just for fun I considered the problem that OldGuy is proposing. In this case, Joe has a sample of 1000 people drawn from an infinite population where the probability of possessing a certain trait is 10%. Mary has an independent sample of 50 people drawn from the same population. If both people subsample 30 people from their sample and use the proportion with the trait in their subsamples as an estimate of the prevalence of the trait in their respective subsamples, whose estimate has a lower standard deviation?
This is essentially impossible to handle analytically, so I wrote some code to do it:
f <- function(N, p, n) { mean(sample(runif(N) < p, n, replace = FALSE)) }
sem <- sd(replicate(1e5, f(50, 0.1, 30)))
sej <- sd(replicate(1e5, f(1000, 0.1, 30)))
sem / sej
This puts Mary’s standard error very close to Joe’s, and I don’t think that any reasonable test would reject the null hypothesis that they’re equal.
You can’t use classical statistics to answer the OP. T-tests and essentially assume an infinite population from which you are sampling. Draw a sample of size N from a population with a normal distribution and you have a T distribution for the mean with N-1 degrees of freedom. But this is only literally true if the population is infinite. It is approximately true if the population is much bigger than N. If the population is only a bit bigger than N it is not true. The easiest way to see this is if the population is size exactly N, then you obviously know the population mean (and everything else about the population’s distribution) exactly.
The OP said the populations were size 50 and 1000. A sample of 30 out of a population of 50 is not going to give you a T distribution. And since the population sizes are vastly different, I don’t think any classical test is going to work to compare the accuracy. You need something Bayesian – you need a prior. If the priors are different for the two populations, comparing them isn’t going to make a lot of sense. The only sensible question I could think the OP was asking was if the two populations had the same priors.
Now yes I did assume I knew the mean of the populations exactly. That was the simplest prior to work with just to illustrate that the OP’s original intuition was incorrect. But I’m pretty sure you could assume many other priors in which the population means were not know, but were tight enough that Joe would have an ex ante estimate of his mean that was more accurate than Mary’s.
That may not be the question the OP was asking, but I can’t think of any other question he really might have been.
Without running the test, I’m pretty sure you’re correct that those standard errors are going to be quite close, but that’s not quite the statistic I proposed.
When Joe and Mary see n out of 30, Joe’s estimated mean for his population is not n/30 but (n+.1970)/1000. Mary’s is (n+.120)/50.
Doing a Monte Carlo for this with 1000 repetitions, I get a s.e. of .95% for Joe and 2.7% for Mary. It’s a bit too late tonight to check that I’ve done it completely accurately, but I’ll check tomorrow. I’ll also try to find the sample size at which it’s a tie.
I guess it’s a good thing that I never mentioned a t-test or any kind of hypothesis test at all in relation to the OP then. What the hell are you responding to?
Why? What justifies this choice of estimators? What’s the probability model? Why are you so hung up on interpreting these two groups as samples when they’re specified to be populations?
I admit to skimming OldGuy’s posts and merely being baffled by his conclusions. On reread, here’s a paraphrase of the sub-problem he’s solving:
Mary knows that 10% of all people are left-handed, and that these left-handed people are distributed more-or-less evenly throughout the population. She picks 30 people at random and discovers that 15% of them are left-handed. If she picks 20 other people at random, distinct from the first 30, how many of them should she expect to be left-handed?
Answer: 10%. She knows the answer a priori, and it’s not the happenstance 15% of her first random sample.
This is trivially true, and might be relevant in some other discussion. I don’t know why he thought this was related to the question OP was askiing.
I’m gpnna stick up a bit for OldGuy here (us ???Guys gotta stick together).
The OP’s question is, IMO, ill-posed. The way I read it, he had a sound conceptual question in his head, but his constructed examples / explanations didn’t really say what he thought they did and therefore don’t support the answer to his actual conceptual question.
So the various folks answering have to decide whether to elaborate the examples as given (OldGuy’s approach), or answer the question the OP probably meant to ask, but didn’t really (everybody else’s approach).
The question is not ill-posed. The question is literally something we could’ve put on the midterm back when I was teaching with the expectation that most of them would solve it in under five minutes. It’s really that straightforward.
OK I guess my example is causing confusion so let me change the example to be much more like a classical problem, but it will make the same point. Joe typically will be much more confident about his population mean.
To be classical let’s assume that the x’s in each population are normally distributed with an unknown mean and variance. Mary and Joe look at their samples of 30, and for convenience both see a mean of 10 and a variance of 25. They assume this is representative of each data point in their population. (Or we could assume only the population means are unknown and the variances are known to be 25 if we want a Z test rather than a t test.)
Mary knows 30 x’s in her population and does not know 20 x’s in her population. She assumes each of the unknown x’s has a mean of 10 and a variance of 25. Her estimate of the population mean is E[(300 + x[sub]1[/sub] + … x[sub]20[/sub])/50]. (The 300 is total of her 30 seen observations: 3010). Naturally her point estimate is 30. The variance her estimate is the variance of the the sum of 20 unknown x’s divided by 50[sup]2[/sup]. Assuming independence, this is var[(x[sub]1[/sub] + … x[sub]20[/sub])/50] = 2025/50[sup]2[/sup] = 0.2.
Joe does not know 970 of his 1000 x’s. The variance of his estimator is 970*25/1000[sup]2[/sup] = 0.02425. Joe is much more confident about his population mean than is Mary.
At some point as the sample sizes increase, Mary will become more confident. Clearly if they both sample 50, Mary knows her population mean exactly. Mary is more confident when (50-n)*var/50[sup]2[/sup] < (1000-n)*var/1000[sup]2[/sup]. This occurs for n = 48. Note that this does not depend on the population variances provided they are the same.
That’s not a standard statistical procedure. OldGuy, if you have anything that says that what you’re doing is reasonable–a textbook, a journal article, anything at all–then please post it, because right now I can only assume that you’re bullshitting in a vain attempt to cover up the fact that you have no idea what you’re talking about.
OK obviously we’re not communicating. I assure you that that calculation is completely standard for a finite population. Most classical statistics assumes infinite populations, or approximately that the sample is a small fraction of the population. That is exactly the problem here. Mary’s population is very close in size to her sample.
Note that if Mary samples 50, she knows the population mean (and everything) exactly. Her sample mean is not t-distributed with 49 degrees of freedom. It is exact.