How might I generate a data set that conforms to a certain mean, standard deviation, and an n (number of items in data set)?
I assume that you want to generate normal data, since that distribution is completely specified by the mean and sd. What software are you using? There is a data analysis add-in for Excel that can do it pretty quickly, and SAS and R (the latter is open source freeware) can do it fairly easily too.
Hm. Is using a plugin the only way?
The excel plugin is free. Depending on your version, you might need the CD handy. It’s under Excel Options > Add-Ins and select “Analysis ToolPak.” In 2007 you have to click the Windows logo in the upper left.
Once it’s installed, use the “Random Number Generation” tool.
No. Most software packages have a builtin function for providing uniformly distributed random numbers, and there are several methods for converting uniformly distributed random numbers to normally distributed random numbers.
I’ve used this method Inverse transform sampling - Wikipedia, but this one Box–Muller transform - Wikipedia and this one Ziggurat algorithm - Wikipedia are also available.
I guess what I’m asking is whether there’s a relatively simple way to do it without software.
(if I have a small n; n < 6)
You might be able to get a book of random number tables - the CRC mathematics manuals still have those (I think - my copy of the CRC manual dates from 1982 or so)
Yes, you’ll need a table of random numbers before you do anything. The Rand put out a famous one years go, “One Million Random Digits” if I’m not mistaken. Do you want data from a normal distribution?
Thank you, but no. I’m practicing doing an ANOVA analysis from a set of textbook questions in a book about statistical analysis of experiments (not homework), and I’m just trying to figure out how they calculated sum of squares for S/A without having the individual observations. They only give the mean, SD, the n, and the a (number of test groups), but I thought you needed to have the individual observations to do it. I’m really perplexed by this.
Then you do want normally distributed data–one of the assumptions underlying the standard ANOVA model is that your residuals are normally distributed. You can do it in Excel using the NORMINV function, where the parameters are rand(), your mean and your standard deviation; that’s much simpler than any method not involving a computer.
How would I create the full data set though? I don’t see how to do that using NORMINV.
I think I understand what you’re trying to do. You’re trying to recreate the data set for which you only have summary statistics? Is that correct?
In that case, you won’t be able to get what you want for a couple of reasons. First, the summary statistics you have are just that, statistics, not pparameters. The latter are what you want, but you can’t get those from just looking at the sample results. And even if you could, there is no guarantee that the data set would be exactly the same, point for point, as the one used to calculate the statistics you have. Just as the data is random (i.e., they vary), so are statistics random (i.e., they vary, from sample to sample).
Yeah, that’s right. How weird. I’m unsure how they arrived at the answer in the back of the book without the observations.
I do this all the time – generating data for teaching purposes.
In Excel the formula is =round(norminv(rand(),10,2),1)
This gives a normally distributed data value with mean of 10, SD of 2 rounded to one decimal place. Alter as required.
I have used similar formulas to create data sets from rectangular, triangular, skewed, bimodal and other shaped distributions.
One thing to note is that you are in effect sampling from the distribution. there is no guarantee that your sampled data will have its mean and standard deviation exactly the same as you specify in your formula. But you can get close and you can tweak your data afterwards if you really want.
You’re right, you need individual scores to calculate S/A using either method, as it looks at how each score deviates from it’s own group. You however only need sums or means to calculate the between-subjects SS.
Could it be as simple as this?
Let “sum” denote the sum of the x values and “sumsq” denote the sum of the squares of the x values.
-
You know the mean and n, and mean = sum / n. So you can solve for sum without knowing the individual x values.
-
You know the standard deviation (sd) , and sd = sqrt((sumsq- nmean^2)/(n))
So sumsq = nsd^2 + n * mean^2
(I’m using ^ to denote raising to the power). So you can solve for the sumsq without knowing the individual x values as well.
The Amazon customer reviews are great.