PDA

View Full Version : Testing whether sample data are normally distributed


Stan Doubt
04-10-2008, 04:52 PM
I am looking for some good practical guidelines about testing whether data are normally distributed. I have probably tended to err on the side of using nonparametric statistics. I tend to just plot a histogram, check for some known "gotchas" and then essentially just do the Shapiro-Wilk test which is built into my statistics package (currently STATISTICA v. 7), except when there are tied (more than one sample with the same result)or censored (below detection limit) data. I also usually try log transforming the data to see if that makes a difference. Perhaps fortunately, I have not run across too much data that appear to be normally distributed minus the censored values. A new project has guidelines which require us to test for normality, specifically citing the Anderson-Darling and Ryan-Joiner tests. Rather than continuing to push the buttons and check boxes in a stats package, I'm trying to do this the right way and understand the underlying strengths and weaknesses of these methods. If it makes a difference these are samples from groundwater monitoring wells.

In researching this a bit further, I have looked around online and found the suggested tests and a few other tests that are not built into my stats package, including the D'Agostino-Pearson K-squared test. Some of these appear to be available in excel add-ins or in R, which I have not used but have been meaning to try to learn.

Anderson-Darling: I read differing recommendations about appropriate sample sizes for this test, and I'm not sure whether whether the test should be used with censored data.

The Ryan-Joiner test appears to be available in MINITAB, but I cannot find out much more about it; I did not download the trial software or try and find a reference for this test from the primary literature yet.

The D'Agostino-Pearson K squared / omnibus test seems very promising for use where there are tied data. I found an Excel add-in called SOLVERSTAT that purports to compute this statistic.

ultrafilter
04-10-2008, 08:07 PM
The Jarque-Bera test (http://en.wikipedia.org/wiki/Jarque-Bera_test) comes highly recommended. But, if you want to play around with R, here's a nice test to implement:

Standardize your data, and compute some statistic. Anything other than the sum or sum of squares is good. I'd probably use the skewness or kurtosis (which don't come pre-defined in R, so you'll need to grab the package e1071).
Simulate 10,000 samples of the same size as your original sample from a standard normal distribution. For each of them, compute the same statistic.
Calculate what proportion of the simulated statistics are further from the mean of the simulated statistics than the statistic on your real data.

At the end, you have the p-value from a two-sided test, and you can treat it as you would any other p-value. If you want a one-sided test, you can modify step 3 appropriately.