I am just wondering what does significant difference exactly mean in statistics.
Supposed I used the dependent t-test on the scores of a pretest and a post-test.
(same respondents tested before and after, say, a program)
I arrived at t value which is greater than or equal to the critical value from the t-table.
Therefore, I rejected the null hypothesis and I am lead to the conclusion that there is a significant difference between the pretest and post test scores.
Now what? What if there is a significant difference?
Does it just show that there is a difference? Or can I infer more from it?
Can I say that the program affected the scores just because there is a significant difference? How can I tell if it’s a positive or negative effect?
I’ve been self-studying statistics and textbooks don’t respond much to my questions. Hopefully you guys/gals could help. Thank you.
Yes, you have a statistical significant difference… so you’re confident (to the confidence level you chose) that there really is a change. The actual size of the ‘effect’ is the next thing you should be looking at.
If your result is significant, that means you have found convincing evidence that there in fact is a difference between the scores. Without further information, that’s all you can tell.
Assuming a significance level of .05, what this means is, if there were actually no difference in scores, your observed result has less than a 5% chance of being seen.
The positive/negative effect depends on whether you performed a one-sided or two-sided test. A one-sided test would tell you whether the scores improved or declined. A two-sided test merely says they changed.
As a warning, a significant result does not mean that the result is meaningful. In other words, it just says there is a difference. It doesn’t say whether that difference is large enough to matter.
Thank you guys.
So significant difference just tell that there is indeed a difference.
Not that much information specially if one is looking for the positive and negative effect of a program.
This question is the trickiest one. You might also, for instance, see the scores increasing just because they took the test a second time, without any influence from the program at all. To rule that out, you’d need to run a control trial without the intervening program. And depending on the precise setup, you might have other confounding variables that you’d also need to control for.
In your example, you gave a test to a group of people on two occasions. The average score on the test was different across the two occasions. This difference, however, could be due to chance. That is, the people might not have really changed on whatever it is the test is supposed to measure; the difference in scores might be the result of something else… for example, if the test is a multiple choice test, people might just have made more lucky guesses on one occasion. The statistical procedure is meant to tell you how confident you can be that the difference in scores that you observed reflects a real change over time in something (hopefully, the thing the test measures), rather than just a chance variation. That is the nature of the inference that the test allows you to make. When you perform the statistical procedure, the result is a “test statistic” with an associated “p-value”. The p-value is the probability that you would get a test statistic of that size or larger totally by chance. So, the smaller the p-value, the more confident you can be that your result was not due to chance.
As the previous poster pointed out, the statistic doesn’t tell you what the change is due to, it only tells you that it was not due to chance. In general, it’s how the experiment was designed that tells you what the difference is due to.
If you’re just doing something simple like a t-test, then you can tell the direction of the difference just from the observed mean scores at each occasion. If the mean score at post-test is higher than at pre-test, and the t-test is significant, then you can say that the scores increased significantly.
This was insightful. So that was what the 0.05 and 0.01 is all about. Just to confirm if I am understanding this right. Suppose I chose my p-value to be 0.05 and I found out that there is a significant difference, I can say that there is just a 5% probability that the change between the scores is due to chance. Right?
More or less. What you can actually say is that if there was no difference the probability that was less than a 5% chance we would get an observed difference that was this large.
As to the likelihood that your observations were due to chance, that depends on the underlying likelihood that the alternative hypothesis was true. For example if an experiment says that I can predict dice rolls better than chance p<0.05, I am still going to believe that it is more likely that I am just in the lucky 5%, rather than that I am psychic.
Cohen’s d is the normal number for this. For dependent it is: (M1-M2)/SD of the difference scores (as opposed to SEM). For independent it’s: (M1-M2)/SPooled (not S2Pooled!).
The number you get can be interpreted by:
d < .20 = no/minimal effect
d >= .20 but < .50 = small effect
d >=.50 but < .80 = medium
d >= .80 = large
Err… why? In dependent, they are essentially serving as their own controls. The pretest is how they (assuming humans) do before some test, and post is after experimentation. All that would do is give you 3 pretests vs. 1 post. Error is much reduced compared to independent.
The pre and post scores could also represent the means of many, many trials. Statistics is probabilistic. You could do the same study 1000x but you don’t have the money or time. That is why you don’t say “this experiment proved x,” but you can say that it strongly supports the hypothesis.
Every experiment has potential confounds. Those should be controlled for beforehand, not after the experiment. And every experiment should assume that the chance of Type I error is real and try to minimize it, but you can’t go to great lengths to “prove” the effect is real.
alpha = .01 .05 .10 is a choice you make from the outset. Most people use .05. While .01 may seem more appealing because it is more certain and reduces Type I, but on the other hand Type II error goes up, e.g. you might not find an effect, but one actually exists. So .05 is a good compromise.
As said, you can tell which direction the effect went (increase or decrease) simply by looking at the means.
By the way, how are you doing your tests? By hand, Excel, software package, etc.?
If you want to show that the program has an effect on test scores, you ought to be comparing the group who took the test before and after the program with a group who took the same tests at the same time with no program. As is, this experiment doesn’t have a control group.
The whole point of a dependent aka repeated measures aka paired design is that that is unnecessary. If 10 participants gets you sufficient power, then adding 30 more is probably just going to be a waste of resources. You could do that but why? the pre-group is the control and we can see whether the experimentation works, if there is no significant shift then the null is retained. If this is done sometimes, it isn’t in any field I know of.
And if a “control” is necessary, the go ahead and run the usually inferior independent design, and then what is the point of the existence of dependent design?
We’re talking about a one-group pretest-posttest design, right? This is a quasi-experimental design that is subject to artifacts such as history and the testing effect. These artifacts are typically removed by doing a pretest-posttest control group design, which is exactly what Chronos and ultrafilter are proposing. All this is pretty basic research design stuff, really. See wikipedia.
Thank you for the reply guys. I’m just self-studying these things. I’ve had a course on elem stat before but that’s just about mean, median, mode, sd, etc. I believe these kind of tests will be the ones useful for research so I’m trying to study them now and make meaning out of textbooks and the internet.
So..
Suppose I conducted pretest then program then post test. Same people.
By doing the t-test I will be able to tell if there is a significant difference.
Then by looking at the means of the scores from the tests, I will be able to say if the change was positive or negative.
I will not be able to confirm though if it was the program giving that change.
For that I’ll need two groups, one with the program and one without.