This question has been floating around in my brain for a while, and most recently was sparked by this thread where we had a discussion about a water taste test. There are some nice points made there about statistical and testing techniques.

My question is: when no statistical significance is found for a result (i.e. there was no difference in joint pain relief between those who took glucosamine and those who got a placebo), does that really mean that the treatment (glucosamine) had no actual effect for anyone? Is it possible that there were some test subjects who really did get a result (relief) but that they (their “type”–those who can get such an effect) were not present in statistically relevant numbers to affect the statistical results?

I have the feeling that this question is not as clear as I’d like it to be, but I’m hoping someone will understand and have the answer.

Absolutely. Absence of evidence is not evidence of absence is the point that is particularly relevant here. Experimental design and statistics are very strong on the idea that inconclusive results do not prove much at all. In fact, it is very hard to disprove something using the scientific method.

Imagine that scientists descend on a town with what they believe has 100 cases of a severe, epidemic disease. They administer their treatment and 98 of the people die and two of them have a full recovery.

That doesn’t sound like a very good treatment to write a successful journal article about. Now flash forward five years when forensic evidence shows that the 98 people that died had a different variant of the disease and the treatment the researchers found was the only successful treatment at near 100% effectiveness on the variant that the success cases had.

This type of thing happens all the time in science and medicine. Experimental theory is crucial to the design of good experiments and the general public and the media to to have a very poor understanding of how that works.

I can give you an extremely brief introduction to experimental statistics. Let’s say you did an experiment with two groups; one got a drug and one got a placebo. Now it is time to run the statistics to compare the to.

Experimental statistics dictates that your adversary is something called the Null hypothesis. The null hypothesis states that the outcome of the two groups was equal. It is your job as a researcher to show that there was a difference between the groups.
There are many standard statistical tests that are accepted by academia and software packages can do a good job of them these days. When you run your results, your are looking for the difference you expected (people got better with real treatment) but you are also looking for confidence that result is real. Any two random bunches of data will likely have some differences. To say that the null hypothesis is likely false means that you need to get 95% confidence on that result to publish in most reputable journals. Higher confidence is better. Statistical tests can spit out both the difference in results and exact confidence level those results are the result of real differences between the results and not just by chance. Note that one experiment in 20 (5%) will show that snake oil X is better than snake oil Y. That is where repeated experiments come into play.

Let’s say that your highly effective treatment for melanoma shows a difference but at only the 92% confidence level. What does that mean? It means that you have to do another experiment because journals will not publish that on principle. Your experimental design is likely flawed in some way and you need o better identify the people that would benefit from the treatment and the divide them into real drug and placebo groups. Prior experimental design flaws may mean that you selected people that weren’t appropriate at all for the treatment for the first experiment.

This type of question is coming up a lot more in clinical trials. With genetic testing in general becoming cheaper and more common it is at least theoretically possible to pre-select your patient population to include only those that (for example) have a particular variant of a gene that you suspect is going to make them more likely to experience a benefical effect from your drug, or experience fewer side effects or whatever.
Another factor to be aware of is the makeup of the study population with respect to “racial” and/or ethnic background. It is a known fact that the prevalance of various alleles is different in different ethnic groups, and this can have a bearing on the effectiveness or lack thereof of some drugs. Well designed studies take this into account so that variable effectiveness between ethnic groups will be obvious if it is present.

The 5% (significant) and 1% (highly significant) rules for statistical testing date from the early days (RA Fisher and friends), and are not magical. There is nothing special about the fractions 1/20 or 1/100 … people like them, they’re easy to explain, they provide a standard level of significance.

The basic, (usually) un-stated assumption in the reports of study results is that the study planning was correct in regards to sample size. Power is the ability of a design to detect a difference in response between groups. Pre-study planning (Power/Sample Size Planning) is done before the study, using hypothesized levels of response, in order to determine the minimum number of subjects required to detect such a difference.

The proper scientific perspective towards the design is neutrality: one doesn’t “root against the null”. From a conservative clinical standpoint, one should, if anything, favor the null hypothesis.

The tests involve both null hypothesis and alternative hypotheses. It is important the the appropriate choices are made with regards to this selection.

Tests have prices of admission with regards to admissable designs and underlying structure of data. Violations of said assumptions may render the usual interpretation of said tests misleading or distorted or incorrect.

The number of tests performed, and the relationship of these tests may require adjustment of the raw test p-vaues.

The selection of the endpoints (the bases for comparison between groups) is important, and the study is generally planned around a single endpoint (the primary endpoint), with a few others of interest (secondary or tertiary endpoints). Use of a posteriori end-points (endpints that weren’t used in the planning) are generally reserved for hypothesis generation, unless a posterior power analysis establishes the basic power of the study in regards to that endpoint.

Even in cases where a significant difference is found, the actual difference in endpoint might not be clinically meaningful. One must compare the scale of the difference to clincial considerations.

The basic summary result for a test is the p-value. The p-value is a conditional probability, the probability of observing a difference more extreme than our single observed difference if the null hypothesis is correct and the study were repeated (replicated) a large number of times. P-values sufficiently close to zero suggest that the obeserved difference is highly unusual if the null hypothesis is true. This is where the 5% and 1% values come in, as arbitrary standards.

Clinical and logistic considerations can damage a trial: imprecise or mis-classification of clinical endpoints; faulty randomization of subjects; excessive or imbalanced drop-off of subjects; non-compliance of subjects to the study protocol; breaking the study blinds by clinical personnel or study subjects; insufficient follow-up time.

Finally, the study populations used are generally not suitable for easy generalization to the general population. Study eligibility is generally restricted to allow an optimally “clean” interpretation of results. In the current model, Phase IV (post-license) studies serve as a check on the safety of licensed medications as they are used in general populations.

So there was a recent study (the link over here is now broken, unfortunately) that found that only those with the worst pain got any benefit, and only from the combination of Glucosamine + Chondroitin. So I stopped taking my G+C to see what would happen. I sure notice the difference (it’s my thumbs that hurt–genetic or years of guitar playing?). Does the study suggest that I’m being taken in by the placebo effect, or does the stuff really help me? Is there any way to know for sure? Is it possible that taking the supplements will prevent/minimize further damage? It seems clear that nothing will repair the damage already done.