Testing whether sample data are normally distributed

Stan_Doubt · April 10, 2008, 8:52pm

I am looking for some good practical guidelines about testing whether data are normally distributed. I have probably tended to err on the side of using nonparametric statistics. I tend to just plot a histogram, check for some known “gotchas” and then essentially just do the Shapiro-Wilk test which is built into my statistics package (currently STATISTICA v. 7), except when there are tied (more than one sample with the same result)or censored (below detection limit) data. I also usually try log transforming the data to see if that makes a difference. Perhaps fortunately, I have not run across too much data that appear to be normally distributed minus the censored values. A new project has guidelines which require us to test for normality, specifically citing the Anderson-Darling and Ryan-Joiner tests. Rather than continuing to push the buttons and check boxes in a stats package, I’m trying to do this the right way and understand the underlying strengths and weaknesses of these methods. If it makes a difference these are samples from groundwater monitoring wells.

In researching this a bit further, I have looked around online and found the suggested tests and a few other tests that are not built into my stats package, including the D’Agostino-Pearson K-squared test. Some of these appear to be available in excel add-ins or in R, which I have not used but have been meaning to try to learn.

Anderson-Darling: I read differing recommendations about appropriate sample sizes for this test, and I’m not sure whether whether the test should be used with censored data.

The Ryan-Joiner test appears to be available in MINITAB, but I cannot find out much more about it; I did not download the trial software or try and find a reference for this test from the primary literature yet.

The D’Agostino-Pearson K squared / omnibus test seems very promising for use where there are tied data. I found an Excel add-in called SOLVERSTAT that purports to compute this statistic.

ultrafilter · April 11, 2008, 12:07am

The Jarque-Bera test comes highly recommended. But, if you want to play around with R, here’s a nice test to implement:
[ol]
[li]Standardize your data, and compute some statistic. Anything other than the sum or sum of squares is good. I’d probably use the skewness or kurtosis (which don’t come pre-defined in R, so you’ll need to grab the package e1071).[/li][li]Simulate 10,000 samples of the same size as your original sample from a standard normal distribution. For each of them, compute the same statistic.[/li][li]Calculate what proportion of the simulated statistics are further from the mean of the simulated statistics than the statistic on your real data.[/li][/ol]
At the end, you have the p-value from a two-sided test, and you can treat it as you would any other p-value. If you want a one-sided test, you can modify step 3 appropriately.

Topic		Replies	Views
Statistical Method Factual Questions	14	984	October 16, 2006
Is there an example of an exact normal distribution in nature? Factual Questions	33	16182	July 24, 2011
Statistics Question: Which statistical test to use? Factual Questions	14	1192	February 9, 2008
If I have a mean, a standard dev, and an "n", can I generate a data set? Factual Questions	17	1721	September 17, 2009
chi squared vs t test Factual Questions	5	2819	March 20, 2003

Testing whether sample data are normally distributed

Related topics