Well you can’t absolutely rule out luck. What you can do is come up with an experiment where the likelihood of a successful outcome by random chance is small enough that everyone agrees that it indicates that some other factor besides luck may be at play.
Let’s suppose you and the dowser agree on 1 in one million - meaning you’re going to do a set of trials such that the likelihood of the dowser getting them all right by guessing is 1 in 10^6. If they can do that then they win the bet, if they fail then they lose.
With one full bottle out of ten, there’s a 0.1 chance of guessing correctly in a given trial. Run six such trials and the likelihood of someone guessing all six correctly is 1 in 10^6.
The important part is that everyone has to agree IN ADVANCE exactly what level of success has to be demonstrated and exactly what series of hits/misses constitutes this. Using my previous example, you’d all have to agree “The dowser will do exactly six trials. In each trial there will be one full container and nine empty containers. If the dowser identifies the full container in a given trial that counts as one hit. The dowser must obtain exactly six hits in six trials to succeed. Any other result is not considered a success.”
The reason for this is to avoid the fallacy of naming the target after you shoot. A simple example:
I flip a (fair) coin 20 times and get this result:
HHHTTHTHTTHHTTTTTHTH
That exact sequence has approximately (0.5)^20 chance of happening which is just about 1 in 1,000,000.
But I got it by banging on the keyboard randomly. There’s nothing amazing about it - there are 2^20 possible outcomes and that just happens to be one.
On the other hand, if I specified beforehand that I was going to use my telekinesis to make the coin come up in that EXACT pattern of heads and tails in twenty tosses and then we saw that pattern, that would be fascinating.
Is that any help? I don’t know that there’s any commonly accepted probability where experimenters agree “That’s unusual”; one in a million sounds right but if you want to be very stringent, run more tests. Do nine and it’s one a billion - I’d be pretty impressed with someone who can do that under properly controlled conditions.
Making sure that the experiment is properly controlled is the big challenge, I think. You have to prevent actual cheating as well as anything that would give the person subtle clues. I’d also make it double-blind so that not only does the testee not know what’s under each container but the person running the experiment doesn’t either.
James Randi’s book “Flim Flam!” gives details of some dowsing challenges that he’s conducted. I’m sure there are skeptics organizations online that have some suggested protocols as well but some things that Randi did, in addition to the basics of having unmarked sources of water for the dowsers to find, include having them demonstrate their ability on a plainly visible source of water and having them check the test area before any containers are brought in to confirm to their satisfaction that there’s nothing in the area that would throw them off (for example so you don’t hear “I didn’t realize there was a bathtub downstairs, that confused me during the trials”).