Very Basic Math That you Still Don't Get

Rather than clutter up this thread, I’d thought I’d start a new one along similar lines.

In statistics, sampling is done in order to make meaningful inferences about a given population. From my understanding, the more samples that are taken, the better (or more acurate) the inferences one can make regarding the population in question.

Now, in the statistics classes I have taken, I always heard that a good rule of thumb is to take at least 30 samples before attempting to make any kind of inference on a given population.

My questions are:

  • On what basis is the “at least 30 samples” a good rule of thumb? Experience in doing statistics? Normal distribution? What?

  • If one is trying to make a meaningful inference on a given population, and one does not know whether the population is normally distributed (or at least one doesn’t want to make that initial assumption), how many (minumum) samples are necessary to make some kind of inference?

IANAstatistician, but how can such a general statement be made without tying it to the size of the population? Sounds pretty ridiculous to me.

Maybe you’re misremembering, and it was some percentage of the size of the population.

I realize that you’re only defining a “rule of thumb”, but still, it’s gotta be different for populations in the thousands vs populations in the millions, no?

In stats, 30 is the cut of for several things, and is a good rule of thumb for must samples, but as you said, the more the better.

From what I remember from statistics classes and a logic class, 30 is a bit low, but not too far off.

Actually, it all depends on how accurate and precise you want to be. For example, if you want to be within 5% of the “actual” statistic 95% of the time, you will need a sample of a certain size. There are complicated equations to calculate this, but they all have an upper limit. That is, twice the population does not mean you need twice the sample size.

Again, I’m recalling a class I took 9 months ago, but I believe that for the accuracy and precision stated above, you only need a sample size of 70. That’s right, even if you want to poll the entire US, you only need to call 70 people (at random, of course). We used a rule of thumb that went like this:

sample size = (70 * population size)/(70 + population size)

Notice that the sample size approaches 70 as the population approaches infinity. Again, this specific rule only applies for the accuracy and precision above. There are tables (sorry I don’t have a cite yet) that give you the “magic” number for different accuracies and precisions.

OK, here is a page that explains this in the context of doing a phone survey and here is a calculator that can do the math for you.

Apparently I was a little off as this site says you need a sample size of 385 to be 95% sure that you are within 5%. Oh, well - I’ll dig out my notes tonight.

It does depend on the number of the population as a whole. If you, for example, have a country with 1,000,000 people in it of which, sy, 60 % agree with a bill and 40 % oppose it, it’s certainly not good to do a poll among 30 people only; you could fairly be unlucky enough to pick a sample with 20 NO sayers in it, destroying your result and suggesting a majority of the population opposes (there’s a formula to calculate the probability for this unlucka event, but I dunno what it looks like). The bigger your sample is, the lower is this probability.

I don’t know the numbers for the US, but in Germany with its 82 million population polls usually have samples of 1,000-2,000 guys asked in order to determine the will of the poeple.

One area where this gets really important (involving hundred of millions of dollars) is the field of determining the viewers’ rates for TV programs. Statisticians have been spinning their wheels of trying to find a way to determine the rates as exactly as possible without having to survey hundreds of thousands of people.

30 samples or more: You can use the given(tabulated) z-values because as the sample gets bigger, it will tend to approximate a normal distribution or sample mean will approximate the population mean.

Less than 30 samples: Use the t-values instead of the z-values to approximate the sample size to a normal distribution and the sample mean to the population mean.

Minimum number of samples needed: I think it depends on the confidence level you want to have (approximately how far from the true mean would you like X% of the population to be), and how much Type 1 error are you willing to have.

I hope someone with better knowledge of statistics will correct me if I’m totally wrong(or even partially). I’m just a student currently taking the class.

Let n be number of samples, s be the standard deviation, B is your “within” value, and z[sub]a/2[/sub] be the z-value.

Then n= (s[sup]2[/sup] * z[sub]a/2[/sub][sup]2[/sup] ) / B[sup]2[/sup]

Sorry for being unclear. When I said “sample size” I meant “number of truly random, viable responses”. No phone surveyor is going to get 385 honest, unbiased, randomly selected answers on his first 385 tries. Mathematically, though, this would be enough. Quoting the above link:

Well, I am a statistician…

To quote Henkle, Wiersma, & Jurs; Applied Statistics for the behavioral Sciences, p. 312

This means that the sample size is dependent on what you want to find. For example, if I want to see if a specific reading program increases test scores, I might need a different sample size than if I want to see if a specific drug has negative side effects. If what I want to find is hard to see, I need a bigger sample size than if the effect is easy to see.

The sample size is also dependent on the test/hypothesis tested. A one sample T-test has a different sample size equation for the same effect size than a 6 cell Chi squared test. Also, if the T-test is one tailed or two tailed the calculations are different (I think there is a difference between the two groups = two tailed test vs. I think that group B will be higher = one tailed test.)

In general, Karl’s equation is close. The equations all have the population error variance (sigma squared) and the squared difference between the tested and critical z scores in the numerator and the squared effect size in the denominator. Standardizing the effect size removes the population error variance. The Z value divided by 2 in Karl’s equation is only used for two tailed tests of H[sub]0[/sub].

The equation for the two-sample case in a one-tailed test of the hypothesis is as follows (provided that my coding and the {sym} code works…

      2[sym]s[/sym][sup]2[/sup](Z[sub][sym]b[/sym][/sub] - Z[sub][sym]a[/sym][/sub])[sup]2[/sup]
n = --------------
    (effect size)[sup]2[/sup]

where [sym]s[/sym][sup]2[/sup] is the population error variance,
Z[sub][sym]b[/sym][/sub] is the standard score in the sampling distribution with H[sub]a[/sub] corresponding to z[sub]a[/sub] for a given power (from look-up tables),

Z[sub][sym]a[/sym][/sub] is the critical value of the test statistic in the sampling distribution associated with H[sub]0[/sub] for a one tailed test at t given [sym]a[/sym], and ES is the effect size, as determined by the study.
(Definitions from same text.)

I chose the two-sample case for a one-tailed test because it is most common. If I wish to determine if Method A is better than Method B, I’m only looking for one outcome; I’m not too concerned about Method B being better than Method A. For example, I want to determine if taking Zinc ameliorates cold symptoms faster than not. I have one desired proposed outcome; I don’t suspect that Zinc will make the cold symptoms WORSE than doing nothing at all. This is a one tailed test.

To carry out this study, I’ll need two groups. One group will take the Zinc and one will take nothing. Hence, I have a two-sample, one-tailed test.

(In the real world, researchers would use multiple groups and a double blind methodology.)

Now, does this make any sense whatsoever?

As a former Statistics teacher I feel the need to STRONGLY correct several former posters in the Fight Against Ignorance. Believe it or not, the sample size you need to make inferences for a given population has almost NOTHING TO DO WITH THE SIZE OF THE POPULATION. You heard that right. If I want to conduct a telephone poll for my town of 8000 people and then conduct the same poll for the United States as a whole (285 million+ people), I will need the exact same sample size to reach the same level of confidence in my results. This assumes of course that all responses are truly random and normally distributed. I know this is hard to believe but it is true and thinking otherwise is one of the most common errors that I have seen in statistics students.

In another thread recently Chronos, while talking about approaching the spped of light, said you could go 0.9999999999(and so on) of the spped of light but you could not reach 0.9999999999(insert infinite number of nines here). As long as I keep adding one more 9 by hand I’m ok till the end of the universe but I can’t describe it as an infinite number of 9’s?

Also, I once asked somewhere around here about doing other mathematical functions with infinity. I got some great answers but I can no longer find the thread and I forgot the explanations.

Infinity - Infinity = ??? (should be zero in my head)
Infinity * 0 = ??? (zero in my head again but I think it’s ‘undefined’)
Infinity / Infinity = (I’m thinking 1 but ‘undefined’ tickles my memory)
Infinity / 0 = ??? (isn’t that always undefined?)

My other big gripe has always been with Imaginary numbers. I guess I’m just hung-up on the semantics of it…to me mathematics is admitting it’s a bogus concept! However, that has recently been discussed on this board and shown to be quite useful so I’ll leave it alone.

hoghlighting mine

It seems to me the part I highlighted is the hangup. I don’t see how a response could ever be random unless I’m asking the respondant to pick an arbitrary number (even then it probably isn’t random since most people will probably tend to keep their choice small as opposed to something like 6.83 * 10[sup]28[/sup] ).

As far as ‘normally distributed’ we know this can’t be the case in a phone poll of a small population of the United States unless you select randomly from all 250 million…even then I’d have a hard time believing the resulting stat was reliable. What if your poll was about opinions on abortion and your ‘random’ sampling managed to select a disproportionate number from bible belt locations?

It would seem to me a random sampling is a bad idea in such cases but a sampling based upon some demographics might be better (i.e. 20% will be polled in the Southeast, 20% in the Northeast, 20% in the midwest, 20% in the West and 20% in the Northwest). That’s just a guess since I dropped Stat in college and I imagine you’d have to somehow weight for population densities but I guess it goes to show how this isn’t exactly intuitive.

I was going to add to my already too long post, but decided against it, but after reading shagnasty’s post, (breathe) that my explanation is for determining the number of cases necessary in a group to determine if a difference truly exists between those groups.

More towards the OP, diminishing error returns fives us very rough estimates of sampling size (the number cases necessary to extrapolate to a larger popoulation), but unfortunately, things just aren’t that easy.

In my masters work, I had the unfortunate chance (no pun intended) to take not one but two courses on survey sampling and survey errors. I hated both courses with extreme passion. At any rate, I can tell you that Leslie Kish (the Guru of Survey Sampling) has a paperback text book that killed enough trees to create over 600 pages and is chock full of derivations and formulas for determining how to sample populations to minimize error of extrapolations. It covers EPSEM, stratified, rts, random stratified and cluster sampling among other methodologies I’d rather forget. I don’t recall “30” being a tight assumption of sample frame numbers.

As stated above, diminishing returns will give us a number that we can be comfortable with (see RGillen’s most recent post). Notice that most poll reports on news casts state that 704 or so people were polled and that there is a 2.5 point or so (usually percentage point) error.

This is the only one I feel I can answer, because I suck at math, but the thing to remember is that there’s infinity, and then there’s lesser infinity. Which I guess means that although both numbers are frigging huge, one is less huge than the other. On a very very very small scale (compared to infinity) imagine the difference between 999999999 and 999999998. The numbers look like they’re about the same, but in fact there is a difference of one. Same goes with really huge numbers. Unless you can be absolutely sure that the two values of infinity are the same (in which case I think they’d be finite, cuz you can define them) there will be a difference between the two that will not be zero.
And I have to say, I just now psychicly predicted the first song my SO’s computer was gonna play…weird…
So someone else may correct me about the infinity stuff, but that’s how I think of it, and what does it matter anyways, if I’m psychic? :slight_smile:

Infinity is not defined as a number in Algebra or Arithmetic, so it’s kind of strange to say things like “Infinity - Infinity”, but I think I can still show why it’s not just 0, in terms of set theory.

How many integers are there? Infinity. So how many numbers do you have if you start with all the integers, and take away all the integers? None, of course. Infinity - Infinity = 0.

How many odd numbers are there? Infinity. So how many numbers do you have if you start with all the integers, and take away all the odds? Infinity - Infinity. But aren’t you left with all the evens, of which there are Infinity? Infinity - Infinity = Infinity.

How many integers are there besides 4, 8, and 11? Infinity. So what happens if you start with all the integers, and take away all the integers except 4, 8, and 11? Infinity - Infinity = 3.

There are twice as many integers as there are odd numbers. Infinity / Infinity = 2.

Bottom line:
You can’t treat Infinity like a regular ol’ number. That’s why Infinity 9’s is qualitatively different than 8 9’s or 80 9’s or 8[sup]8[sup]8[sup]8[/sup][/sup][/sup] 9’s.

Can someone give me a crystal-clear explanation of the significance of the “invention” of zero? I mean, how can that be invented, and what did it change about man’s understanding of mathematics?

*Originally posted by Spritle *

Spritle and everyone else,

Thanks for all the info. If I understand it correctly, to make some kind of meaningful inference on a given population, the sample size selected should be based on the confidence level (how much error one is willing to allow in the sample) and what kind of effect one is looking for. This also assumes that the sampling is truly random and has a normal distribution. Is that basically correct?

originally quoted by eponymous

Spritle or anyone else,

I didn’t see this question specifically addresed - do the same criteria apply? Or would one need to do numerous different samples for the normal ditribution assumption to apply? I guess what I’m trying to understand is what to do if you don’t use the normal distribution assumption for a given population. Or is that a valid approach to take?

Aaaaaah…Spritle, thanks for telling me that I’m not so far away(which means at least some fraction of what they give in class stays in my head). I’m only taking an introductory course and that is the equation that we are using right now.

It’s not entirely obvious that math can be done without zero, but in fact, the numbers {1, 2, 3, …} were good enough for most everything way back when. I think the biggest impact of the invention of zero was that it paved the way for negatives, fractions, irrationals, and imaginaries, all of which are very important these days, both in pure math and its applications. It also led to a place-based representation of numbers, which required the memorization of fewer symbols than the Roman system. This also led to the simplification of arithmetic (ever tried doing long division with Roman numerals?).

If you’ve got some time to read, check out this site.

I know this is not actually answering your question, but this book, despite initial appearances, is one of the most interesting books I think I’ve ever read. It deals mightily with the history of zero and its impacts on religion (why it was invented by Arabs but rejected by Europeans and so on), science, philosophy, etc. More than you ever wanted to know about nothing.

I read it too long ago to adequately explain zero’s significance, but probably wouldn’t if I could- the book is one anyone with any interest in the subject should really read for themselves.