Help me to design an experiment

This is a binomial distribution with p=0.1

Use this online calculator to find the probability of achieving some specified level of success. E.g., the probability of getting 8 successes by accident in 10 trials is vanishingly small.

BTW, there’s about a 1-in-100 chance of scoring above 3 success by accident out of 10 trials, which is unlikely but possible, and a 48% chance of scoring 11 or more successes in 100 trials.

Well there’s no rigidly-defined number that means “significant” - for any given chance of a hit and # of trials you can calculate the likelihood that you’ll get a particular number of hits in that number of trials so you can say something like “I am 99.5% confident that more than X number of hits in Y number of trials will not be achieved by random guessing.” Whether 99.5% (or whatever value) is “significant” is up to the people designing the experiment.

I went through and did a bunch of calculations comparing a 10% and 20% chance of success over a large number of trials (100). I don’t have the numbers in front of me but if I recall correctly, with 10% (random guessing), the odds of seeing 27 or more hits in 100 trials was small - about 0.2%. So you can say it would be pretty unlikely to see 27+ hits in 100 trials by guessing.

However, and it’s an important however, with a testee claiming 20% accuracy, he’d only have about a 5% chance of getting 27+ hits in 100 trials. So the testee is also pretty unlikely to get 27+ hits even if he’s got his claimed 20% accuracy, simply because he’s not much better than random chance. Expecting the testee to hit that mark is a tough sell.

You can get into Bayesian updating aka conditional probability and say “GIVEN that the testee actually got 27+ hits, what is the likelihood that he actually has dowsing abilities with a 20% accuracy as opposed to that he was just picking randomly” - because while either level of accuracy is long odds against getting 27+ hits, if you’re a dowser at 20% that gives you about 25 times the chance of hitting that mark compared to a 10% guesser. So if you do see 27+ hits, based soley on that outcome, it is more likely that the testee has a higher accuracy than random chance.

Looked at another way, this goes back to what several of us have pointed out - if your testee’s claimed accuracy is close to random chance, you’ve got to run a lot of trials for the difference to stand out.

Honestly I’d be somewhat surprised if a testee only claimed 20% - if you display 10 containers (9 empty and 1 full of water, nothing hidden) and ask the testee to demonstrate I bet you a nickel that they’ll have 100% accuracy over any number of trials. It’ll only change when they can’t actually see which one has water in it.

This was somewhat addressed by the master here. The only specific number I saw in the article is that “largely arid Texas, for example, has aquifers under 81 percent of its surface.”

Keeping with a theme in this thread of defining criteria for success, it depends what you mean by “hard to dig and not find” and, for that matter, “water”. Does a 20% chance of not finding an aquifer in Texas meet your definition of “hard to not find”? Does it have to be a high-producing aquifer to count? Is dampness at the bottom of the hole enough? You’re probably talking about more than 80% of Texas in that case. Hell, they found water on the moon. I’m sure a dowser somewhere is claiming credit for the time his stick pointed up.

I t’s about people changing my words, then attacking me for things I never said.

No, the point is that it can skew the results either way. It might work against the testee, or it might work in his favour.

And once again you distort my point. The above is the exact opposite of what I said.

This is not correct.

See for example the scientific criticism of the testing of Uri Geller. One of many things that was criticized was the way in which random targets were selected for Geller to guess. The method used by his testers was to open a dictionary to a random page. The scientists who responded to the original article all agreed that this is a bad thing to do.

[quote]
Isn’t it funny how every single person that you ever discuss this issue with who is familiar with the scientific method tells you that you have no idea what you’re talking about? [/'quote]

See, that’s the thing. They have to change my statement beyond recognition before they find anything wrong with it. As you did above, when you implied that I support dowsing.

Oh, I think that most Randi fans are aware that he’s a liar. They are just happy to let him lie, as long as he attacks dowsers.

I’ve pointed out that Randi’s methods should be avoided, because they don’t work.

No, I’ve given solid facts. But just to recap -

Randi conducted a test on dowsers. By chance they should have got about 10%. They actually got 22%. Mathematically, that is a positive result for the dowsers. This is almost certainly the result of some error made by Randi. The nature of the error is not certain, but possibly his method of randomization was part of it.

Other parts of Randi’s methodology exist just to cover his ass when things go wrong. They are just wordgames to try to hide his failures.

My experience is that if you show people that test, most of them will take no notice, but a small number will look and spot that the dowsers got higher than chance, and will consider his wordgames to be dishonest.

This is only one of many times in which he has screwed up. And that is why his methods should not be used.

But, Mr Moderator, that is what I AM doing. My objections to Randi go right to the heart of the OP.

First point: the OP requests advice on how to conduct a test. He has been offered advice on the basis that this is the way that Randi does it. It is thus neccessary to point out that Randi’s test went wrong, and got a false result, which he had to fudge. The OP shoul;d be informed that this is a very bad way to conduct a test.

Second point: the OP wants to convince a self-professed dowser that it doesn’t work. He seeks advice about how to do this. Randi, however, only seeks to impress people that hate dowsers. He never even tries to get people to change their minds. Following Randi’s methods is a sure way for the OP to fail in his intent.

I do not intend to attack Randi, just to offer advice relevant to the OP.

22% is less than the 35% you were quoting earlier; does that change your feelings at all, or is your opinion independent of the facts?

The Randi test wasn’t the best-designed test, I agree, (note: for a detailed write-up of the test, see here) but the basics could be carried over to the OP’s test.

I see the following problems with the Randi test:

Randi’s test consisted of searches by 16 dowsers for 3 items, where each dowser could at his discretion run between 5 and 10 trials.

I would test only 1 dowser at a time for 1 item at a time, and keep separate statistics for each dowser, and fix the number of trials in advance.

Other than that, I think the test was well-designed, and the underground pipes seems like a good method. Though I do wonder if some sound of running water could have led to the higher results for water (of course, the results could have been due to chance, given that a re-run of the test resulted in the dowsers scoring worse than expected).

The active pipe was chosen by selecting a random number from a bag: rolling dice might be better, but unless the same number kept coming up, I don’t think this would have affected the result.

Care to expand on the bold section–how his test went wrong?

This is beyond your “point” about statistics (claiming that the result was “positive”)–you seem to be arguing the “positive” result was an artifact of bad experimental design. ** What was that flaw in the design? **.

Or even expand on the underlined section–how was the result fudged?

To agree with you for once–if your contentions were true, they would be relevant to the OP. If a part of any experiment tended to produce false results, it should not be emulated. That is why I discussed the issue.

That is also why I’ve been pointing to examples of good experimental design–something you have refused to discuss.

If you want to characterize your arguments as “explaining flaws in experimental design,” then support your contentions.

You have not specified either how the experiment failed, or how the results were “fudged.” That is of no use to the OP–it’s simply a bald statement.

You have stated that you feel unqualified to design experiments.

I therefore ask as a honest question: If you don’t think you’re able to design an experiment, because you’re not trained in science or maths, and you apparently can’t point to specific flaws in an experimental design, why should the OP listen to a word you say? What makes you think you’re qualified to criticize someone else’s experimental design?

I’m not interested in debating Randi, and will not do so in this thread. I am here to discuss experimental design–that is why I’m asking you to back your points up. If you can’t support your contentions about his tests, then your problems seem to be with Randi as a person-and whether valid or not, those are of no relevance to this thread.

And here we have another example of a strawman from a Randi supporter.

See my earlier post.

Note that I described a HYPOTHETICAL case.

Because it gave a result of 22% when you would expect a result of 10%.

I’ve answered all of your questions already. Please read my earlier post.

Hold on now. Let’s be clear–are you talking about a “HYPOTHETICAL” case or a real one? Hypothetical cases don’t have any results.

Also, that’s not a fault in the experiment. That is a conclusion–that the result is not what you expect. It may be a conclusion that suggests a flaw in an experiment–but If your statement was true, you’d expect to be able to identify a flaw in the experimental procedure. Care to point to one, or even suggest one?

Second, if you’re talking about a real case, and real results, cite them.

Your assertion is simply insufficient (especially as you’re not trained in maths) to support the claim that the results were 22%, or that the chance result was 10%, or that that was significant to a relevant level of confidence as determined by a properly performed statistical analysis.

Why do you insist that everyone is a Randi supporter? Please stop doing it to me, I consider it insulting.

As to your hypothetical, I agree that your 35% was a hypothetical case, and withdraw my comment.

He’s switching between discussing a hypothetical case and a real one, and I agree that it’s really confusing. The 22% comes from this case. Note that the overall level of success was 13.5% on 111 trials, with p=10%. The 22% was created by considering only a limited subset of outcomes (which is part of my objection to the Randi test).

I don’t think any other masochists still reading this thread feel you have answered anything. Your statement that the test gave a result of 22% when 10% was expected is an indication that something might have been wrong with the test, but says nothing about how it failed, which is the only topic that really pertains to the OP.

Here’s an example of what we hope to get from you: say I conduct a test by flipping a coin 10 times. I expect to get 5 heads. But in my test, I get 6 heads. How was my experiment flawed?

The most likely response is that my experimental design did not include a large enough sample to give me a meaningful level of confidence. This is a flaw in the number of coin flips I conducted, or my interpretation of the results related to the level of confidence possible from the sample.

Other possible flaws in the test could be something mechanical, like the coin was unevenly weighted.

So explain how the dowsing test was flawed (bonus points if you can do it without mentioning Randi’s name). Some suggestions have already been made, including the number of samples or the possibility of hearing water in the pipes. Do you feel the flaw in design was something like this?

It’s confusing, and it does nothing to show what the “problem” is in the real study.

I’m familiar with the study–it’s generally well performed and well analyzed. If ** Peter** wants to point out some specific flaw in its methodology, let him go on and do so.

In general, cherrypicking is improper–but I see nothing wrong with the methodology. In the results you cite, the result of the whole test is the one pointed to, and analyzed as falling below expectations. There is no cherrypicking as to the principal analysis or the results.

There is some reference to the sub-divided results–but as I read it, not to suggest it is significant, or even tested for significance (which I agree would be improper, given the design).

The results simply cite the 22% figure to show that even the “best” result achieved fell far below the claims of the dowsers. And that is a very proper statement–it points out that even at their best, cherrypicking results, putting everything in their favor, the dowsers couldn’t achieve anything near the results they claimed.

Peter is referring to someone else testing the subset of results, and claiming they were significant. That was the error–not Randi’s analysis. The second analysis is cherrypicking–and not all that effectively doing so, when the statistics come out at a level of significance that leaves what is, in context, shockingly low, given the claimed abilities.

I mentioned a hypothetical case.

Someone else mentioned a real case, with a 22% result. After they mentioned it I discussed this real case.

Actually, this test is a good example of how the experimental design turns on the type of effect being claimed. The number of runs per dowser (5/10) are only enough to evaluate a gross power being claimed–one that (for example, as was claimed by the dowsers in question) gets 90% hits.

5/10 runs per person is not enough to test for a small effect–simply because you just don’t have enough tests to get a significant result as to any one dowser. If you’re looking for an effect that is 1% better than chance (so, on average, getting 11 hits out of 100, instead of the chance result of 10 out of 100), you wouldn’t see any difference over a run of five or ten tests (and would in fact need to run several hundred/thousand to detect any such effect to a reasonable level of confidence).

Similarly, the overall results are useful and appropriate to rebut the claims made-that there is a dowsing ability that many people, including all the test subjects, share in a relatively similar fashion; if you were testing for a rare and weak power, overall results wouldn’t be appropriate–since you’d only expect (maybe) one out of the sixteen tested to have an actual ability to dowse.

Again, for that, you’d need to run more tests per person.

But those are not flaws with this test–this test is well designed, given what it was looking for–and that was (necessarily) defined by the claims of the test subjects. There is no good reason to complicate the design of a test, or make it more expensive (by having many more runs per person) if the claimed power canbe detected by a relatively simple test design.

Further, if the claim is a rare and weak power, analysis of overall results would be inappropriate–since you’d expect an already weak effect by testee #3 to be outweighed by the results from #1,2, and 4-16, who have no dowsing ability. Hence, a “re-analysis” of all the results (which ** Peter ** is trying to point to as “proof”) is not a particularly good way to look for a weaker effect–since the experimental design is not particularly suited to detecting weak effects, or ones that only a few of the test subjects share.

Yet another strawman.

Are you not discussing the real case anymore? The one where, as you repeatedly pointed out, Arthur C. Clarke re-analyzed the results? My mistake.

Since, as I think you can clearly see, the mixing of “hypothetical” and real experiments can be confusing, you’d be well advised to be clear about the point you’re actually making. Why don’t you just state it again, to help everyone be clear.

Further, if you are discussing a hypothetical, then the results simply don’t support your arguments–since hypothetical experiments don’t have results. You can’t show a flaw with experimental procedure based on results you make up–since the results are not a result of that experimental procedure.

So what are you actually arguing?

And that’s another strawman. They never stop, do they?

Just to make this clear, I was not objecting to your use of the term “re-analysis.” I fully agree that it was analysed by Arthur C. Clarke, a trained mathematician with proper academic credentials, who is rather more credible than Randi.

My objection was your statement that I am pointing to the results as “proof.”

I do not do any such thing. The results are not proof of dowsing. I do not believe in dowsing. I do not support the dowsers’ claims. I have said this already, and you know it.

The results show that Randi can’t run a test, that is all. And its only one example of such, there’s plenty of others.

What results of which actual tests are you talking about Peter? Please be specific. I don’t really like to be in this hijack, but your continued assertions are getting to me, since you haven’t provided ANY cites at all. Please be clear about what you’re stating and don’t mingle imaginary scenarios with actual results.