Questions about doing statistics on sampled data

We often take data at where I work. We’re engineers, and unfortunately we do not have a good grasp of statistical concepts. I vaguely remember taking a stats course as an undergrad in 1988… :frowning:

When we perform multiple measurements on something we will calculate the standard deviation (s[sub]n-1[/sub]), multiply it by 2, and then proclaim, “We believe 95% of the population lies between ±2s of the average.”

We’ve been doing this for years. But over the past few weeks I have been doing some reading on confidence intervals and estimating population parameters. I am now thinking our simple “±2s” technique is wrong. Or at the very least a misapplication.

I want to improve the way we analyze our data. So here is where I am at:

  1. We take data at work. As an example, someone will measure ten resistance values for ten 500 ohm resistors. We usually don’t know anything about the population. Often our sample size is small (e.g. N = 10).

  2. I believe what we are trying to do is estimate the mean of the population and estimate the standard deviation of the population. Would you agree that this is the goal of taking measurements?

  3. Based on what I’ve read, it looks like we should be using the Student’s t-distribution to estimate the mean of the population. (Given a confidence value – example 95% – the Student’s t-distribution will calculate a confidence interval for the actual mean value.) Would you agree with this?

  4. When reading about Student’s t-distribution and confidence intervals, all of the articles I’ve read talk about estimating the mean of the population, not the standard deviation of the population. Yesterday I did some more google searching and discovered the Chi-Squared distribution can be used to estimate the standard deviation of the population from sampled data. Is this what we should be using?

Am I on the right track here?

Yes, it sounds like you are on the right track.

Bear in mind you also have repeatability to factor in.

If you measure a specific individual resistor 10 times, will you get the same value all 10 times? To what precision?

If you have 2 measuring tools (e.g. ohmmeters) and you measure a set of 10 specific individual resistors first with tool A then again with tool B, how will the results from tool A compare to tool B?

And what does that do to your numbers when in actual production sampling you’ll use each tool to measure 5 separate individuals and combine the two sets of 5 into one set of 10 for further analysis?

How do you keep tool A & B calibrated to a real world standard or to each other?

We have a doper who’s an expert in this area. I’m failing to recall his handle just now. I’ll be back if it comes to me.

You should be quite specific about the question that you are trying to answer. If it is only what is a good estimate of the mean of the population, then you are on the right track.

Example: a population of resistors. Measure 10, get the mean and the standard deviation. Use that to make a confidence interval for the mean. A lot of people just use +/- 2, instead of using the t distribution because it is easier.

Most people don’t care too much about the estimate of the standard deviation. Yes you could use the chi square distribution to make a confidence interval for the standard deviation, but to what end?

What LSLGuy is describing has to do with gage repeatability and reproducibility (gage R&R). Too many organizations use something to take measurements, but they never check whether the measurement system (gage) is measuring what they think that they are measuring.

Which leads directly to GIGO errors. Hence my pointing the issue out.

Creating good output starts with creating good input, then using good processes to A) understand the real meaning of questions you might ask, B) ask relevant questions to your real needs actually understood, and C) answer the question you actually asked, not something else nearby.

Actually, even that’s sort of backwards. A & B are prerequisites which guide what data you gather, while C governs what you do with it once you D) have it, and E) understand its limitations.

The confidence interval is not the range that contains x% of your data. It is the range where you can be x% confident that the true population mean is within that range (e.g. point 3 in this pdf).

Your SDx2 works okay for quick calculations, at the t-test level it might be mean plus or minus 1.96 * s/sqrt(n)

T-test would be appropriate. It doesn’t sound like chi-squared would work for your data (if using ohms, a continuous variable), correct me if I am wrong.

A important first meta-question is: What problem are you trying to solve by changing your methods?

Reporting twice the standard deviation the way you are isn’t too bad for most “pedestrian” applications. Is the cost of complicating the procedure worth it? What are you aiming to gain or improve?

The points raised up-thread are the tip of the iceberg. I’ve seen many cases where folks have foolhardily moved from “simple, approximate, and perfectly functional statistical analysis” to “massively intricate, technically highly accurate, yet still subject to GIGO so still approximate, but perfectly functional statistical analysis”. Thus: more complexity for no gain (and sometimes for negative gain since they may falsely assume they have a “better” answer.)

There may be reasons for you to move to a more precise approach (what are they?), and there may not be reasons to move, especially if the resulting precision is an illusion due to not having overhauled all the procedures enough. This is more than fighting the hypothetical, as how you go about improving things depends on what you are trying to accomplish by making a change.

Good question. And I am not sure I have a good answer. :frowning:

We perform failure analysis on hardware. Sometimes we take data. And when we take data, we provide the raw data along with the average and standard deviation.

When presenting or discussing the data with the customer, we imply that “±2s covers 95%.” As far as I know, the result isn’t really used for anything. But I am just concerned we are implying the wrong thing. I did some google searches and discovered there are more “correct” methods for estimating the population.

So should we use more elaborate methods like the t-distribution, Chi-Squared, etc.? I don’t know. I guess I am just concerned that, sooner-or-later, someone is going to challenge our results, and (perhaps correctly) point out the misapplication of our current technique.

Take a look at these four distributions:

http://i.imgur.com/4psDfbk.gif

A = Gaussian (normal) distribution
B = something narrower and with wider tails
C = something with a high-side tail
D = something bi-modal

Let’s say you’re concerned with the weight of some widget. Does the production process lead to weights that are distributed as A, B, C, D, or other? My guess is that you typically don’t have that answer for your situation. Here’s how much it may or may not matter for these distributions. (Note that the scale of the x axis doesn’t matter in the above figure, just the “shape” of the distributions. You can slide and linearly rescale the x axis to set the means and standard deviations however you want.)

If you took a large number of samples from each case above and executed your current process, you’d be claiming that 95% of the widgets weigh between +/- 2s. In reality, +/- 2s contains on average this much of the distribution:

A: 95.5%
B: 95.1%
C: 94.7%
D: 99.96%

So for three of these cases, claiming 95% may be perfectly fine for simple QA/QC purposes. (We’ll come back to D.) If you truly new the underlying distribution, you could calculate a different version of “+/- 2s” to achieve exactly 95%. But a bigger effect anyway is likely to be your small sample size. Taking Distribution A as an example, what if you only collect 50 samples? How much of the true underlying distribution is actually contained within your reported window? Here are ten sets with N=50 each, where I’m listing the actual percentage contained when you claim 95%:

A @ N=50: 95.9%, 94.1%, 93.8%, 93.0%, 96.5%, 94.1%, 95.0%, 94.6%, 95.0%, 94.6%

Maybe that’s good enough for your purposes. What about N=15?

A @ N=15: 96.1%, 96.4%, 91.6%, 95.5%, 93.1%, 93.7%, 94.7%, 94.9%, 86.8%, 98.7%

One of these is all the way down at 86.6%, meaning that in the long run there will be 2.6 times as many “failures” as you predicted for that case.

There are similarly simple procedures for estimating the “uncertainty in the uncertainty” which would be able to warn a customer against this possibility in cases where too few samples are taken to get a sufficiently close estimate. (Saying this another way: the customer may have a tolerance they’re trying to meet, but what’s the tolerance on knowing if they’ve met the tolerance, when faced with finite sample size?)

Getting back to Distribution D: This one is also a perfectly reasonable distribution. Maybe half the widgets come from batch A and half from batch B, or maybe half are produced with Alice on duty and half with Bob on duty. The bimodality of the distribution gives a large standard deviation, and going out twice that distance captures essentially the entirety of the possible values, with only 1 out of every 2,500 falling outside a range that you claim should only miss 1 in 20. Whether that’s harmful or not depends on the context. (E.g., a manufacturer might budget for 5% wastage yet will only have 0.04% wastage, making that part of the budget off by a factor of 125. If waste disposal is expensive, that could be a large budget flaw.)

Don’t forget testing variation either. A lot of times they’ll take 10 samples down to the lab and have some lab coated expert run duplicate tests according to published methods using a single meter calibrated using primary standards from the National Institute of Standards and Technology, then in the field they’ll have a half dozen different techs with a half dozen different meters run tests based on the shortcuts the last guy showed them. These tests won’t always agree with each other.