Statistical criteria for demonstrating ESP

For those not familiar with Zener Cards:

If a test consists of exactly 25 guesses : 25 cards (i.e., you can’t quit after 15 if you’re ahead):

  1. How many tests would a subject need to undergo to arrive at a statistically meaningful result?

  2. What percentage score (100% being perfect) would one need to average in order to demonstrate a statistically meaningful result indicative of ESP? In other words, would 60% be very impressive? Would 30% be impressive at all?

  3. Does the fact that there are five possible shapes have any bearing on the significance of the results? For instance, if the test were 25 coin flips, one would expect an average score of 50% instead of 20%, and so the range of possible better-than-average scores would be > 50% and <= 100%, correct? Would it be more difficult to establish successful criteria for the coin-toss test than for the five-card test?

  4. Could the test be applied to a single subject, or would the idea of a “control group” (one or more subjects guessing randomly) be an important part of evaluating any results?

For the moment, I’m going to ignore the fact that there are only five of each symbol. We’ll come back to that later.

No. In both cases, the distribution of scores can reasonably be expected to follow a binomial distribution. For the coin, it’s binomial (25, 1/2), and for the Zener cards, it’s (25, 1/5).

I’ll answer these questions together because they’re very closely tied together. In general, results are said to be statistically significant if there’s a .05 or lower probability that they could have happened by chance assuming nothing unusual is going on. With a 25 card deck, 8 successes is a meaningful result.

So here’s the rub: Get 8,000 people to take your test, and you should expect 400 of them to get 8 or more cards right. Get those 400 people to take a second test, and you expect 20 of them to get 8 or more cards right again. After a third round, you should expect 1 person to have a score of 8 or higher again. Is there any reason to believe that he has ESP and he’s not just lucky?

That’s the deeper problem here. No amount of Zener card testing will ever prove that ESP is responsible for your results, and not something else. If you can identify another possible cause, you can control for that, but you can never identify all possible causes. What counts as strong evidence for ESP is a subject for a debate.

I don’t think a control group is important here. We know the theoretical distribution of the results, so we can test against differences between our subjects and that.

So, what happens if we allow for the fact that there are only five of each symbol? We can’t use a binomial distribution any more, because it assumes independent trials. There are more complicated models that will work instead (Markov chains, for instance), but nothing above really changes.

If the subject notices, without seeing me, that I’ve stopped actually looking at the cards and started reading a book of crime scene photos, I’d count that as signifigant.

How many trials must be run before the results can be attributed to something about that one individual?

If he gets more than 8 cards right after 10 trials, does it mean anything? What about 500 trials? Even if it doesn’t mean he’s clairvoyant, would it mean he’s extraordinarily and unusually “lucky”?

I think the point is that for any one person, taking one test is not signficant, because they could be the one lucky person out of 8,000 that ultrafilter referred to. So you would need to have some sense of how many people have ever taken this test to see what would be significant for a single person.

Here’s another way of looking at it. Nailing a 25-card test successfully demonstrates that there if nothing was going on there is a 5% chance of an outcome that good or better. (BTW, thank you Ultrafilter, p-values are so often defined incorrectly.) However, if you ran 20 tests, you would (roughly speaking), expect one of them to get nailed. (Each has a 5% chance, you do 20 of them…) Therefore, your criteria must be much much stricter to account for the many many many (hundreds of thousands?) times these tests have been administered.

My personal feeling is that you’d have to hit p=0.000000001 or so. Nothing less is going to convince anyone. And you’d better have the most rigorous experimental design imaginable, as there is so much chance of cheating and biases.

For the purposes of this thread, and at the risk of sounding like I’m understimating the difficulty inherent in such a task, let’s assume this is the case. Indeed, let’s assume that none of the participants are cheating.

I’d say, “no.” But leaving behind ESP for a moment, I wouldn’t expect a single test result to be significant, even if it were 25 successes. I would want all 8000 people to take x number of tests and then to be ranked according to the average number of successes. I would not be surprised if one or more of those 8000x tests resulted in a perfect score; I would, however, be very surprised if one person had a perfect score in all x tests.

But I think I see your point: it’s still a long way from even the best possible score to the conclusion of ESP. If we turn the hypothesis around and attempt to demonstrate that all the subjects in the trial are randomly guessing, is there a mathematical way to evaluate it? We say that after x tests, the average of any subject’s scores will be within a certain expected range. So it seems to me that if the scores stay within that range, we’ve supported, or taken a step toward proving, the hypothesis. But if one subject’s average score is greater than the upper bound of that range, we still haven’t shown the hypothesis to be false–in the end, it’s still a matter of chance.

So if we keep repeating the experiment, and this one subject’s average scores keep beating our expected best, is there a way we can eventually show the hypothesis to be false, to show that, in fact, this person is not guessing?

Let’s say it’s luck: if we keep repeating the experiment, is there a point at which we can start evaluating luckiness? For instance, could we say that, after many rounds of tests, if a person still maintains an average score of 99% successes, that person is luckier than another whose score was lower? It seems like from a mathematical standpoint, since if you kept repeating the experiment forever you’d eventually encounter every possible result, we can’t say that one result is luckier than any other. From a human standpoint, even a skeptical one, a consistent score of 99% would be an astonishing anomaly. But from a mathematical standpoint, the best we could do is say that such a result has a very low probability of happening?

Sequent, you’ve hit right to the essence of statistical reasoning. No, you can never say with mathematical certainty that person A is not just lucky. But you can use inductive reasoning to define just how lucky they would have to be to get that result.

The definition of the p-value is subtle, but very powerful. What you can say is that if person A is just lucky, they are exactly this lucky. For example, if they got 99% on 300 tests in a row, and the p-value for that worked out to 0.000000000001, this is what you can reasonably say.

“If person A is just lucky, we would expect to these results 1 in 1 trillion times. That is so so so unlikely, that we conclude person A is not just lucky, there is another factor at work. Because we have set up our test so rigorously and eliminated all other possibilities, we believe that other factor is ESP.”

In science, a p-value of 0.05 is usually taken to be the standard. That is interpreted as: If there were no other factor than chance, I would see these results 1 in 20 times. That is unlikely enough that I believe these results are not due to chance, but due to [theory]. (Note that if you look at 20 studies published at a p=0.05 value, you roughly expect one of them to be false, that it really was just luck.)