I understand that, when you take a sample of a population, the mean of the sample is an approximation of the mean of the population. And one standard deviation of the sample is an approximation of one standard deviation of the population.
But how close (or accurate) is the approximation? It would seem to me that it depends on the ratio of the size of the sample to the size of the population.
As an example, let’s say a population contains 1000 data points. Joe takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does his mean value approximate the mean of the population? I suppose so. But given that the sample size is only 3% of the population, I must assume there’s quite a bit of uncertainty in the value. Same goes for the standard deviation.
As another example, let’s say a population contains 50 data points. Mary takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does her mean value approximate the mean of the population? I would expect so. And given that the sample size is 60% of the population, I would assume it’s quite close. Same goes for the standard deviation.
So while both Joe and Mary have the same sample size (30), I would expect Mary’s mean value and standard deviation value to be much closer to the “real” values of the population compare to Joe’s values.
Is the above correct? Is there a way to mathematically handle this?
Yes there are ways to handle it depending on exactly what the data represents. As a simple case suppose the data is left-handedness so each data point is a 1 (yes) or 0 (no). Joe knows of the 30 people n are lefties, and assuming independence in the original sample, knows noting more abut the remaining 970 people, so the proportion of lefties is between n/1000 and (970+n)/1000 for sure. On the other hand when Mary sees m, she knows for sure it’s between m/50 and (m+20)/50.
So Mary has much tighter definitive bounds than Joe. On the other hand Joe could well have tighter confidence bounds than Mary. The reason for this is that Joe has a much bigger unknown sample than Mary does, and the bigger is an unknown sample the more likely is it to be about average.
Suppose that in the world (or whatever), the fraction of lefties is known to be p. And suppose that Joe and Mary’s populations were drawn representatively. Assuming all those things statisticians assume like independence then the unknown N people in Joe’s and Mary’s populations would have an expected number of lefties of pN and a variance of p(1-p)N. So Joe would estimate there are n + 970p lefties with a variance of p(1-p)970 and Mary would estimate m + p30 with a variance of p(1-p)30.
In terms of proportions, Joe would estimate (n+970p)/1000 with a variance of p(1-p)970/970^2 = p(1-p)/970. Mary would estimate (m+30p)/50 with a variance p(1-p)/30. The standard deviations would be the square root.
To put in some numbers suppose p = 10% and Joe and Mary get it just right with 3 each out 30. Both of there estimates would be 10%. Joe’s standard deviation would be 0.96%. Mary’s standard deviation would be 5.4%.
Mary is less sure about her sample in the sense just described, because her smaller sample is more likely to be unrepresentative than is Joe’s
This might seem counter-intuitive, but note that if Fred is sampling 30 out of the entire world, he knows for sure the answer is 10%.
This makes no sense. Joe and Mary have the same sample size, so the standard error for their estimates is the same. Not only that, but after you adjust for the finite population, Mary’s standard error is smaller than Joe’s. Mary is more sure of the population mean than Joe.
There is, but this isn’t the kind of question that statisticians are really concerned about. I mean, if you can take 30 samples of 50, why bother getting a statistician involved, when you could just measure the other 20 and know for sure what the mean is?
So most of the statistics techniques are based on assumptions of relatively large populations – in many cases the techniques are based on the assumption of a perfectly bell-curve shaped population, kind of implying near-infinite population.
Suppose you have some statistic that you know (for some magical reason) is normally distributed with mean zero and variance 1, and you want to verify that the mean is in fact zero. You do that by taking n samples {x_1, … , x_n} and perform the mathematical operations to compute the average A = (x_1 + x_2 + … + x_n)/n to compute the mean.
Because you know (again, by magic) that the true mean of the distribution is zero, you expect A to be close to zero, but since you are taking a finite number of samples and there is a bit of jitter, it would be surprising if it was exactly zero. You repeat the process of collecting samples and computing A over and over again and get different values of A, which form a distribution. The distribution of computed averages is a (scaled) student’s t-distribution.
So long as your population is significantly larger than your sample, your confidence interval depends almost entirely on the size of your sample, and almost not at all on the size of the population. Assuming it’s a good random sampling, a sample of 1000 out of a billion is just as good as 1000 out of a million.
That’s really the kicker. If you have a small sample compared to the size of the population, and poor experimental design, it is a lot easier to come up with a biased sample than it is if you sample half of the entire population.
I think this was the most important concept I learned during my one-quarter Intro to Statistics course I took in college, and the concept most misunderstood by the public as well.
Trying to estimate w_mean and w_stddev, say the statistics of weight in a population, Mary (the one with smaller population) will have better estimates, especially of the standard deviation.
I think OldGuy compared totals, e.g. the absolute number of lefthanders, rather than the frequency of left-handedness. Obviously such statistics would be proportional to population and aren’t what OP intended.
This is true. I believe the standard rule of thumb for “significantly larger” is that the sample is less than 5% of the total population. That agrees with this PDF I found which appears to be a section out of a textbook, and it discusses the “finite population correction factor” that is used when the sample is not less than 5% of the population.
To the OP: This general topic (estimating a population parameter, such as the mean or standard deviation, based on a random sample) is covered in the chapter on “Estimation” or “Confidence Intervals” in any decent statistics textbook.
I started with totals, but the final results were for proportions.
That’s why I called it counter-intuitive. The basic idea is that a large sample is more likely to have a fraction closer to the truth than a small sample. As Chronos pointed out, standard statistics assume that the population is much (infinitely) larger than the sample. Here it is not. A proportion in a small sample more easily differs from the truth than does the proportion in a large sample.
You’re confused about the difference between samples and populations, but your error is subtle enough that I’m having a hard time articulating exactly what it is. Regardless, your conclusion is incorrect: having more data in your sample and having a higher proportion of the population in your sample both decrease your uncertainty. Mary is more certain about the proportion of lefthanders than Joe is.
I did a simple Monte Carlo simulation of OP’s exact problem with Joe’s 1000-sized and Mary’s 50-sized populations both selected randomly from the same Gaussian distribution, and random 30-sized samples taken. I ran 10 million trials (100,000 samplings of 100 populations), calculating the mean squared errors of their estimates (for mean and std dev) for their respective populations.
Assuming my code was correct (:eek:), the MSE of Joe’s estimate of the mean was 1.94 times Mary’s; the MSE of his estimate of the std-dev was 1.80 times Mary’s.
This seems intuitive to me. Note that if the players took samples 66.7% bigger, Mary’s MSE’s would become zero!
Of course, when you’re talking about sample sizes significant in comparison to the size of the entire population, you also have to be careful about whether you’re choosing with or without replacement. If I pick one random person out of 50, and then do it again, and repeat 30 times, I’m extremely likely to have sampled some people several times each. And if I repeat it 50 times, it’s almost a guarantee. In that case, even though my sample is the same size as the population, I still have some amount of sampling error.
What you say is correct if the populations have the same size, then increasing the sample size increases the accuracy. But they don’t according to the OP teh populations are not the same size. Joe’s population from which he sampled is 1000 (a small town), Mary’s population is only 50 (a tiny town). If both of these belong to a super-population, the country, the world, etc., and we know that the fraction of left-handers in the super poulation is 10%. Then Joe’s larger population will have a tighter distribution about 10% than Mary’s does. That means Joe has an inherent advantage in estimation. He can be more accurate about his population than Mary with the same sample size because he has a prior with a smaller variance.
As we increase the sample size, at some point Mary becomes more accurate. Obviously once the sample size hits 50, Mary know exactly and Joe does not. Also as I said in my original reply if you want the 100% confidence bound, then of course Mary’s is smaller than Joe’s as there is less (20 vs 970) people she doesn’t know about. But other confidence bounds like 67% (roughly 1 standard deviation) favor Joe as I computed.
If you take n observations without replacement from a population of N individuals and you’re trying to estimate proportion whose true value is p, the standard error of the sample mean is equal to sqrt(p(1 - p)/n) * sqrt((N - n)/(N - 1)). That first factor is the standard error for an infinite population, and the second factor is the appropriate correction for the finite population.
Joe and Mary both have n = 30 and p = 0.1, which means that the first factor is roughly 0.055. However, for Joe, N = 1000, so the second factor is roughly 0.985. For Mary, N = 50, so the second factor is roughly 0.638. That makes Joe’s total standard error roughly 0.054, and Mary’s roughly 0.034.
The error you’re making is in assuming that the population means are random variables. They’re not; they’re fixed values. There’s no sense in which they have a variance around 10%. If you want to regard the populations as samples from the larger population, then you can do that, but at that point Joe and Mary have both observed 30 individuals from the same population, so their standard errors are equal. There’s never a point at which Joe gets more information from observing a smaller fraction of the population.