Alex_Dubinsky, sample size is important, sure, but let’s zero in on an actual test design. How should that test be designed, and what would constitute a success or failure?
First, we need to define what is being tested. Just one thing at a time. And let’s pick one claim of the proponents that you feel has a chance of being true.[ol][li]Define just one claim, []How can we test it, and []What result would consitute a success or failure?[/ol][/li]Let me try with a very hypothetical example. Claim: ear candling improves hearing. Procedure: have N randomly chosen subjects take an audiometer test, then be subjected to an ear candling procedure, then take the audiometer test again. A similar number of controls will do the same but omit the ear candling part (or be subject to a “fake” procedure, whatever that might be).
Now, before this test starts, what would amount of hearing improvement would you require as proof that the candling had a positive effect? 10%? 50%? And how many subjects should have this improvement? And would the hearing ability measured be a composite of all frequencies or just one? If one subject had a decrease in hearing acuity, how would that be taken into account?
Let’s take another example. Claim: ear candles remove ear wax. Procedure: dividing patients into two random groups, have each examined by a doctor for earwax values. (I don’t know how docs define amount of buildup, but there must be some scale, say 1-10). Half of the patients undergo ear candling, the other half, none (or a fake procedure). Then re-examine all ears without the docs knowing which group a patient is from.
If the patients who underwent ear candling have an average improvement of 30% less wax, would that be a successful result? If not, what would be the value you would use?
Remember that an improvement in such a test would have to be statistically significant, and the successful outcome criteria must be defined in advance. So a 3% improvement would not be scientifically valid, and fishing for a positive result in the stats after the test would not be the way to go, unless you want to be laughed at.
Also remember that in the above examples, we have not controlled for the placebo effect, so your proposed improvement values must be improvements over the control group, not just over no test at all. (It’s entirely possible that all test subjects show an improvement because of the psychological factor that accompanies them dudes in the white coats.) Alternatively, and best of all, we would use 3 groups: ears that are candled, ears that are “fake” candled, and ears that have nothing done to them at all (active ingredient, placebo, controls).
Better yet: since each subject is blessed with 2 ears, we could divide the groups by ears, not by bodies. A bonus – we have twice the subjects! 
Your thoughts?