Simple statistics question

Crafter_Man · December 12, 2014, 8:10pm

I understand that, when you take a sample of a population, the mean of the sample is an approximation of the mean of the population. And one standard deviation of the sample is an approximation of one standard deviation of the population.

But how close (or accurate) is the approximation? It would seem to me that it depends on the ratio of the size of the sample to the size of the population.

As an example, let’s say a population contains 1000 data points. Joe takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does his mean value approximate the mean of the population? I suppose so. But given that the sample size is only 3% of the population, I must assume there’s quite a bit of uncertainty in the value. Same goes for the standard deviation.

As another example, let’s say a population contains 50 data points. Mary takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does her mean value approximate the mean of the population? I would expect so. And given that the sample size is 60% of the population, I would assume it’s quite close. Same goes for the standard deviation.

So while both Joe and Mary have the same sample size (30), I would expect Mary’s mean value and standard deviation value to be much closer to the “real” values of the population compare to Joe’s values.

Is the above correct? Is there a way to mathematically handle this?

OldGuy · December 12, 2014, 8:38pm

Yes there are ways to handle it depending on exactly what the data represents. As a simple case suppose the data is left-handedness so each data point is a 1 (yes) or 0 (no). Joe knows of the 30 people n are lefties, and assuming independence in the original sample, knows noting more abut the remaining 970 people, so the proportion of lefties is between n/1000 and (970+n)/1000 for sure. On the other hand when Mary sees m, she knows for sure it’s between m/50 and (m+20)/50.

So Mary has much tighter definitive bounds than Joe. On the other hand Joe could well have tighter confidence bounds than Mary. The reason for this is that Joe has a much bigger unknown sample than Mary does, and the bigger is an unknown sample the more likely is it to be about average.

Suppose that in the world (or whatever), the fraction of lefties is known to be p. And suppose that Joe and Mary’s populations were drawn representatively. Assuming all those things statisticians assume like independence then the unknown N people in Joe’s and Mary’s populations would have an expected number of lefties of pN and a variance of p(1-p)N. So Joe would estimate there are n + 970p lefties with a variance of p(1-p)970 and Mary would estimate m + p30 with a variance of p(1-p)30.

In terms of proportions, Joe would estimate (n+970p)/1000 with a variance of p(1-p)970/970^2 = p(1-p)/970. Mary would estimate (m+30p)/50 with a variance p(1-p)/30. The standard deviations would be the square root.

To put in some numbers suppose p = 10% and Joe and Mary get it just right with 3 each out 30. Both of there estimates would be 10%. Joe’s standard deviation would be 0.96%. Mary’s standard deviation would be 5.4%.

Mary is less sure about her sample in the sense just described, because her smaller sample is more likely to be unrepresentative than is Joe’s

This might seem counter-intuitive, but note that if Fred is sampling 30 out of the entire world, he knows for sure the answer is 10%.

ultrafilter · December 12, 2014, 9:17pm

This makes no sense. Joe and Mary have the same sample size, so the standard error for their estimates is the same. Not only that, but after you adjust for the finite population, Mary’s standard error is smaller than Joe’s. Mary is more sure of the population mean than Joe.

Quercus · December 12, 2014, 9:38pm

Crafter_Man:

I understand that, when you take a sample of a population, the mean of the sample is an approximation of the mean of the population. And one standard deviation of the sample is an approximation of one standard deviation of the population.

But how close (or accurate) is the approximation? It would seem to me that it depends on the ratio of the size of the sample to the size of the population.

As an example, let’s say a population contains 1000 data points. Joe takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does his mean value approximate the mean of the population? I suppose so. But given that the sample size is only 3% of the population, I must assume there’s quite a bit of uncertainty in the value. Same goes for the standard deviation.

As another example, let’s say a population contains 50 data points. Mary takes a random sample of 30 data points from the population, computes the mean of the sample, and computes the standard deviation of the sample. Does her mean value approximate the mean of the population? I would expect so. And given that the sample size is 60% of the population, I would assume it’s quite close. Same goes for the standard deviation.

So while both Joe and Mary have the same sample size (30), I would expect Mary’s mean value and standard deviation value to be much closer to the “real” values of the population compare to Joe’s values.

Is the above correct? Is there a way to mathematically handle this?

There is, but this isn’t the kind of question that statisticians are really concerned about. I mean, if you can take 30 samples of 50, why bother getting a statistician involved, when you could just measure the other 20 and know for sure what the mean is?

So most of the statistics techniques are based on assumptions of relatively large populations – in many cases the techniques are based on the assumption of a perfectly bell-curve shaped population, kind of implying near-infinite population.

leahcim · December 12, 2014, 9:53pm

This kind of thinking is what Student’s t-distribution comes from. The reasoning is basically:

Suppose you have some statistic that you know (for some magical reason) is normally distributed with mean zero and variance 1, and you want to verify that the mean is in fact zero. You do that by taking n samples {x_1, … , x_n} and perform the mathematical operations to compute the average A = (x_1 + x_2 + … + x_n)/n to compute the mean.

Because you know (again, by magic) that the true mean of the distribution is zero, you expect A to be close to zero, but since you are taking a finite number of samples and there is a bit of jitter, it would be surprising if it was exactly zero. You repeat the process of collecting samples and computing A over and over again and get different values of A, which form a distribution. The distribution of computed averages is a (scaled) student’s t-distribution.

Chronos · December 12, 2014, 10:00pm

So long as your population is significantly larger than your sample, your confidence interval depends almost entirely on the size of your sample, and almost not at all on the size of the population. Assuming it’s a good random sampling, a sample of 1000 out of a billion is just as good as 1000 out of a million.

leahcim · December 12, 2014, 10:07pm

That’s really the kicker. If you have a small sample compared to the size of the population, and poor experimental design, it is a lot easier to come up with a biased sample than it is if you sample half of the entire population.

suranyi · December 12, 2014, 10:18pm

I think this was the most important concept I learned during my one-quarter Intro to Statistics course I took in college, and the concept most misunderstood by the public as well.

septimus · December 12, 2014, 10:33pm

Trying to estimate w_mean and w_stddev, say the statistics of weight in a population, Mary (the one with smaller population) will have better estimates, especially of the standard deviation.

I think OldGuy compared totals, e.g. the absolute number of lefthanders, rather than the frequency of left-handedness. Obviously such statistics would be proportional to population and aren’t what OP intended.

ultrafilter · December 12, 2014, 10:45pm

Maybe you can only afford to sample 30 points. Finite population inference is very important when sampling is expensive.

Thudlow_Boink · December 12, 2014, 11:33pm

This is true. I believe the standard rule of thumb for “significantly larger” is that the sample is less than 5% of the total population. That agrees with this PDF I found which appears to be a section out of a textbook, and it discusses the “finite population correction factor” that is used when the sample is not less than 5% of the population.

To the OP: This general topic (estimating a population parameter, such as the mean or standard deviation, based on a random sample) is covered in the chapter on “Estimation” or “Confidence Intervals” in any decent statistics textbook.

OldGuy · December 13, 2014, 12:00am

I started with totals, but the final results were for proportions.

That’s why I called it counter-intuitive. The basic idea is that a large sample is more likely to have a fraction closer to the truth than a small sample. As Chronos pointed out, standard statistics assume that the population is much (infinitely) larger than the sample. Here it is not. A proportion in a small sample more easily differs from the truth than does the proportion in a large sample.

ultrafilter · December 13, 2014, 12:19am

You’re confused about the difference between samples and populations, but your error is subtle enough that I’m having a hard time articulating exactly what it is. Regardless, your conclusion is incorrect: having more data in your sample and having a higher proportion of the population in your sample both decrease your uncertainty. Mary is more certain about the proportion of lefthanders than Joe is.

septimus · December 13, 2014, 1:36am

I did a simple Monte Carlo simulation of OP’s exact problem with Joe’s 1000-sized and Mary’s 50-sized populations both selected randomly from the same Gaussian distribution, and random 30-sized samples taken. I ran 10 million trials (100,000 samplings of 100 populations), calculating the mean squared errors of their estimates (for mean and std dev) for their respective populations.

Assuming my code was correct (:eek:), the MSE of Joe’s estimate of the mean was 1.94 times Mary’s; the MSE of his estimate of the std-dev was 1.80 times Mary’s.

This seems intuitive to me. Note that if the players took samples 66.7% bigger, Mary’s MSE’s would become zero!

Chronos · December 13, 2014, 3:07am

Of course, when you’re talking about sample sizes significant in comparison to the size of the entire population, you also have to be careful about whether you’re choosing with or without replacement. If I pick one random person out of 50, and then do it again, and repeat 30 times, I’m extremely likely to have sampled some people several times each. And if I repeat it 50 times, it’s almost a guarantee. In that case, even though my sample is the same size as the population, I still have some amount of sampling error.

ultrafilter · December 13, 2014, 3:21am

When you’re dealing with finite populations, the samples are assumed to be without replacement.

Crafter_Man · December 13, 2014, 2:54pm

Thanks for the replies. Especially septimus’ simulation.

Stats has never been one of my strong points. Any recommendations for a book I can purchase? Stats for Dummies?

ultrafilter · December 13, 2014, 5:51pm

Freedman, Pisani & Purves. There’s not much math in there, but it’s the one of the best expositions of the ideas of statistics that I’ve seen.

OldGuy · December 13, 2014, 10:34pm

What you say is correct if the populations have the same size, then increasing the sample size increases the accuracy. But they don’t according to the OP teh populations are not the same size. Joe’s population from which he sampled is 1000 (a small town), Mary’s population is only 50 (a tiny town). If both of these belong to a super-population, the country, the world, etc., and we know that the fraction of left-handers in the super poulation is 10%. Then Joe’s larger population will have a tighter distribution about 10% than Mary’s does. That means Joe has an inherent advantage in estimation. He can be more accurate about his population than Mary with the same sample size because he has a prior with a smaller variance.

As we increase the sample size, at some point Mary becomes more accurate. Obviously once the sample size hits 50, Mary know exactly and Joe does not. Also as I said in my original reply if you want the 100% confidence bound, then of course Mary’s is smaller than Joe’s as there is less (20 vs 970) people she doesn’t know about. But other confidence bounds like 67% (roughly 1 standard deviation) favor Joe as I computed.

ultrafilter · December 14, 2014, 12:11am

OldGuy:

What you say is correct if the populations have the same size, then increasing the sample size increases the accuracy. But they don’t according to the OP teh populations are not the same size. Joe’s population from which he sampled is 1000 (a small town), Mary’s population is only 50 (a tiny town). If both of these belong to a super-population, the country, the world, etc., and we know that the fraction of left-handers in the super poulation is 10%. Then Joe’s larger population will have a tighter distribution about 10% than Mary’s does. That means Joe has an inherent advantage in estimation. He can be more accurate about his population than Mary with the same sample size because he has a prior with a smaller variance.

As we increase the sample size, at some point Mary becomes more accurate. Obviously once the sample size hits 50, Mary know exactly and Joe does not. Also as I said in my original reply if you want the 100% confidence bound, then of course Mary’s is smaller than Joe’s as there is less (20 vs 970) people she doesn’t know about. But other confidence bounds like 67% (roughly 1 standard deviation) favor Joe as I computed.

If you take n observations without replacement from a population of N individuals and you’re trying to estimate proportion whose true value is p, the standard error of the sample mean is equal to sqrt(p(1 - p)/n) * sqrt((N - n)/(N - 1)). That first factor is the standard error for an infinite population, and the second factor is the appropriate correction for the finite population.

Joe and Mary both have n = 30 and p = 0.1, which means that the first factor is roughly 0.055. However, for Joe, N = 1000, so the second factor is roughly 0.985. For Mary, N = 50, so the second factor is roughly 0.638. That makes Joe’s total standard error roughly 0.054, and Mary’s roughly 0.034.

The error you’re making is in assuming that the population means are random variables. They’re not; they’re fixed values. There’s no sense in which they have a variance around 10%. If you want to regard the populations as samples from the larger population, then you can do that, but at that point Joe and Mary have both observed 30 individuals from the same population, so their standard errors are equal. There’s never a point at which Joe gets more information from observing a smaller fraction of the population.

Topic		Replies	Views
Margin of error = 4.5% Factual Questions	16	2483	October 3, 2000
Statistics question: polling margin of error Factual Questions	26	1848	June 10, 2008
Very Basic Math That you Still Don't Get Factual Questions	30	1893	November 27, 2001
Probability Calculation: Is There An Easy way To Do This? Factual Questions	4	1966	June 5, 2004
How is margin of error calculated for polls? Factual Questions	27	3536	January 17, 2010

Simple statistics question

Related topics