Can you help me understand the statistical concepts in this depression study?

I’m looking at a meta-analysis of depression treatments, in which they compare the effects of drugs vs talk therapy for depressed patients across the world.

I’m having trouble understanding one particular part of the abstract (and repeated in the results):

(emphasis mine)

I’ve only had one very basic stats course in college, and Wikipedia only helped a little. Is this a correct understanding:

We looked at a bunch of different studies. At first glance, it seems like no particular treatment is better or worse for depression in general. But for people who suffer from OCD, talking works a lot better than drugs. For people who suffer from dysthymia, we initially thought that drugs are more effective (this 0.30 “g” number, which is apparently some statistical voodoo called Hedges’s g). However, once we adjusted the studies for confounding variables using something called “multivariate meta-regression” (I have no idea how this works, except that they use more variables on curves to try to make dots line up or something…), the effect disappears.

My questions, specifically:

  1. How exactly does a multivariate meta-regression analysis work?
  2. If the purpose of any analysis is to determine the effects, or at least correlation, between variables, and meta-regression is a better way to do this, why aren’t all studies meta-regressions?
  3. I never understood “statistical significance” despite my stats class. All I got was that if you plug certain numbers into a certain math formula, sometimes the result comes out significant and sometimes not, but insignificant doesn’t mean false/wrong/invalid and significant doesn’t mean true/correct. So for the purposes of this study, when an effect ceases to be statistically insignificant, does that mean – practically – that it has no noticeable effect on patients over pure chance, kinda like a placebo without the placebo effect?
  1. Running regressions means adjusting the data for a certain variable. Say you want to find out if a pill makes kids grow taller. You have one group that takes the pill and another does not. Then you find out that the group that took the pill contains more males and since males grow taller than females you want to adjust for that. So you run a regression by taking the over amount of males in the one group and adjusting for how much taller males typically get. Then you do the same thing for other variables. Doing it for more than one variable makes it multi-variate. Meta means that they are doing it for other people’s already published studies and not doing a whole new study.
  2. The more regressions you run the more the data is messed with the less powerful your studies are in terms of significance. It is hard to find data and expensive to have lots of subjects, so you do the best you can in terms of samples. All studies can’t be meta-studies since someone has to come up with the original data sets.
  3. Statistical significance does not mean true or false it just means that whatever effect is going on could not be measured by the particular study. The more subjects you have for your study the lower the changes have to be in order to be measured by the study. If you get enough subjects and enough studies then you can make a true or false designation, depending on the size of the sample versus the size of the population.

Thank you. That clears it up wonderfully :slight_smile:

So 3) Even if a study produces statistically significant results, it only has greater validity if it samples a great enough portion of the population? How is that usually measured, and what is a “big enough” sample size?

Here’s the simple way I think about statistical significance (p value). It’s basically a measure of how likely it is that the result you see could have been due to random chance. A p value of 0.1 means that if you did the experiment over and over, you’d expect to see that much difference - or more - 10% of the time just due to random chance. Therefore, a low p value means that there’s probably something going on that’s not just random chance.

As for sample size, think of it this way: if I flip a coin ten times and get 7 heads, there’s a decent chance that that was just luck - I wouldn’t conclude that the coin is weighted. But if I flip it ten million times and get 7 million heads, then I could be quite certain that the coin is unfair. More replicates = more power, where power is the ability to correctly detect a difference between possibility A and possibility B.

I should also add that mine is a pretty superficial understanding of statistics. It’s enough for me to get the job done, but I’m no mathematician. I’m sure others can provide more detailed, more correct explanations.

Got it. But a very low p value doesn’t necessarily mean the results are “correct”, right, just that they’re unlikely to have risen by chance? There could’ve been other unaccounted for factors, confounding variables, sampling errors, blah blah?

OK, got it. But when is enough, enough? If it was 7000/10000 flips? 700? 70? How do you know when you need to stop, and how big your sample size SHOULD be?

Yes!

The way the p-value really works is that you set up a model of the “Null-hypothesis” which models the distribution of the data if the thing you were trying to test turned out to be false. An example of a null hypothesis might be that the coin has a 50% chance of coming up heads, or that there is no difference between drugs vs talk therapy. the p-value then says the likelihood that you would get the kind of data you got assuming your model was correct. If the p-value is low then it was unlikely that you could have gotten these results if the null model was correct and so you conclude that null model is likely incorrect. They you usually conclude that it was incorrect because the coin didn’t have a 50% flip rate or that there was differences between talk and drugs etc. However there also could be other problems with the null model. It could be that the model assumed that the data was independent and that in fact there were dependencies between data points, or there was a bias in sampling etc. So it really comes down to carefully construction a Null model that fails if and only if the thing you are trying to prove it true.

This gets into what is known as study design, which can be quite complicated, but basically falls into three categories

  1. Unplanned: You take the data you have an analyze it. You then run your test and if you get a significant p-value then you know that you had enough samples. If you don’t then you conclude either than you didn’t have enough samples or that the difference you were testing for was too small. You may even be able to reverse your hypotheses to say something like, “we can demonstrate that no difference greater than x exists”

  2. Fixed sample size design: Here you make an assumption about the effect size you want to detect and design you sample size so that you are likely to be able to detect it. An example would be “Assuming that the coin is biased to produce heads at least 60% of the time, then 173 coin flips is enough that 80% of the time we should be able to reject the possibility of the coin being fair with a p-value<0.05” The 80% in that above sentence is called the power of the test. You then run the test and see if you get a significant result. If you don’t than you can be fairly (80%) confident that the actual difference is less than the size that the study was designed for (60%)

  3. Adaptive: This is like the fixed sample size design but you incorporate the idea that you may stop early if you find that you have enough samples to reject the null hypothesis, for example if you find that 10/10 coins came up heads you may stop flipping coins. The thing about this design is that since you are really running a whole bunch of tests as you go along, you are more likely to by chance run into a test that looks significant, so you need to model your test as part of the null hypothesis. As such that statistics can get extremely complicated.

The problem with meta studies is that it since they are using available rather than collecting data in a controlled manner, so the high likelihood covariates that aren’t accounted for in the Null model. For example what you would really like to do this study is to take a group of patients and as they come in randomly assign them to talk therapy or drugs and see how they do. With this design the only systematic difference between the two groups is how they were assigned. Everything else is just random.

But in a meta study you may have one study done in a nursing home, another study done on a college campus, a third study that was done in an inner city hospital for another purpose and has a strong imbalance between the number receiving drugs vs those receiving talk therapy. As you can imagine this can get very messy and it is difficult to prove that everything has been taken into account in the null model.

Yes!

The way the p-value really works is that you set up a model of the “Null-hypothesis” which models the distribution of the data if the thing you were trying to test turned out to be false. An example of a Null-hypothesis might be that the coin has a 50% chance of coming up heads, or that there is no difference between drugs vs talk therapy. The p-value then says the likelihood that you would get the kind of data you got assuming your model was correct. If the p-value is low then it was unlikely that you could have gotten these results if the Null model was correct and so you conclude that Null model is likely incorrect. Then you usually conclude that it was incorrect because the coin didn’t have a 50% flip rate or that there was differences between talk and drugs etc. However, there could also be other problems with the Null model. It could be that the model assumed that the data was independent and that in fact there were dependencies between data points, or there was a bias in sampling etc. And each of these could cause the unusual data that was observed. So it really comes down to carefully construction a Null model that fails if and only if the thing you are trying to prove it true.

This gets into what is known as study design, which can be quite complicated, but basically falls into three categories

  1. Unplanned: You take the data you have an analyze it. You then run your test and if you get a significant p-value then you know that you had enough samples. If you don’t then you conclude either than you didn’t have enough samples or that the difference you were testing for was too small. You may even be able to reverse your hypotheses to say something like, “we can demonstrate that no difference greater than x exists”

  2. Fixed sample size design: Here you make an assumption about the effect size you want to detect and design you sample size so that you are likely to be able to detect it. An example would be “Assuming that the coin is biased to produce heads at least 60% of the time, then 173 coin flips is enough that 80% of the time we should be able to reject the possibility of the coin being fair with a p-value<0.05” The 80% in that above sentence is called the power of the test. You then run the test and see if you get a significant result. If you don’t than you can be fairly (80%) confident that the actual difference is less than the effect size (60%) that the study was designed to detect.

  3. Adaptive: This is like the fixed sample size design but you incorporate the idea that you may stop early if you find that you have enough samples to reject the null hypothesis, for example if you find that 10/10 coins came up heads you may stop flipping coins and conclude that the coin is biased. The thing about this design is that since you are really running a whole bunch of tests as you go along, you are more likely to by chance run into a test that looks significant, so you need to model your test as part of calculating your p-value. As such, the statistics can get extremely complicated.

The problem with meta studies is that it since they are using available rather than collecting data in a controlled manner, so the high likelihood covariates that aren’t accounted for in the Null model. For example what you would really like to do this study is to take a group of patients and as they come in randomly assign them to talk therapy or drugs and see how they do. With this design the only systematic difference between the two groups is how they were assigned. Everything else is just random, and so doesn’t really need to be taken into account.

But in a meta study you may have one study done in a nursing home, another study done on a college campus, and the baseline outcome of these two groups may be entirely different. Then a third study that was done in an inner city hospital for another purpose and has a strong imbalance between the number receiving drugs vs those receiving talk therapy. As you can imagine this can get very messy and it is difficult to prove that everything has been taken into account in the null model. So meta studies are pretty much always seen as not as good as designed studies. The one advantage they do have is that they tend to have very large sample sizes.

Sorry for the double post, not sure what happened.

Also too late to edit, the sentence

should also have a p-value associated with it so something like

“we can demonstrate that no difference greater than x exists (p=0.05)”

Buck already addressed all this nicely, but I’ll just add my perspective. For my work (biology), the cutoff at which a result is considered “significant” is totally arbitrary, but for most purposes, it’s generally set at p<0.05 - in other words, only a 5% chance that the difference is due to chance.

It’s often pointed out that this means that, on average, we should expect about 1 out of every 20 published experimental results to be wrong, which is sort of true in theory. In practice, however, there are other factors:
-published p values are often much lower than 0.05.
-most publications require, and most peers would only believe, results that are confirmed with two or more experimental approaches. If you get a good p value using two completely independent experiment types, odds are good that your result is real.

Most of my work takes the first approach Buck mentioned: test as many samples as time/money/importance of the experiment warrants, and see if you get a significant result. In my case, all it costs is the virtually free labor of one poor little grad student, so we don’t worry too much beforehand about how big our sample size should be. If it turns out we really need a larger sample, then the poor little grad student gets to repeat the experiment on more flies.

However, there are “power calculations” that can give you an idea of what sample size you need to pick up some known difference size at a certain confidence level, if you’re into that sort of thing.

This site has a good sample size calculator and basic explanations of the variables.

This is also the wrong interpretation for other reasons as well. In spite of the fact that there is a tendency to think of it this way, to say a p=0.05 means there is a 95% chance that they hypothesis is not strictly speaking correct. The p-value gives the probability the results you got could occur by chance if your null hypothesis was correct.

It would only be true that this represented 95% liklihood of the alternative to be correct if there was a priori a 50% chance that the hypothesis was correct. In reality depending on the state of the science, it could be that none of the hypotheses on the shelf are correct and only the lucky 5% were published. Or it could be that they are all correct but some didn’t have the power to get above 0.05.

ObligatoryXKCD to illustrate this point.

If you know the power you want and the population you have then you plug those into a formula and you can get the sample size you need to have.
Most depression studies don’t work that way since they involve people filling out surveys and are expensive and time consuming. So what they do is plug their sample size and the population size and their results into the formula and it tells you what your chance of the results being caused by chance are. P should be equal or less than 5, but if you are doing a preliminary study or what that is really time consuming to test you might be able to publish with a higher p value, depending on the discipline.