Say you have a test were you have no idea of the sample size. The only data you know is that you’ve had X failures and no successes. What is the best probability you would expect in that situation?
For example, say you’re in the airport watching people walk by. You’ve noticed that 100 people have walked by and you didn’t know anyone. You have no idea how many people are in the airport and no idea how many people you know. What would be the best probability in that situation that you will see someone you do know?
If I guessed 1 in 2, that would be a very bad guess since it’s very unlikely to get 100 failures in a row. If I guessed 1 in 1 billion, that would be more likely to have 100 failures in a row, but there’s probably a better probability that would give 100 failures in a row.
Maybe a more generalized way of asking is to say what is the best probability that would have a 95% chance of having 100 failures in a row? I’m not a statistician so I might not be saying that right.
Your question has no definite answer. The results of your sample size speak for themselves. Until you reach a sample size where you have a few people you do recognize, then for all practical purposes, the probability of knowing anyone in a crowd of people in the airport is exactly zero.
Basically, what you are asking is “How accurate are my results?”. I think that question can only be answered in relative terms, based on the sample size.
I understand that it’s impossible to come up with the exact probability in that instance since the sample size is unknown. But I would think you could figure out what the best probability would be. Not the best guess of the exact probability. The best probability that would give those kind of results. I’m sure I’m not asking this correctly.
This is so that when my wife says, “You know, I’ve seen 100 people walk by and I didn’t know any of them. What’s the chance that I’ll see somebody I know?” I can look smart by saying, “Well dear, given that you’ve had 100 failures, the best probability that we can expect is 1/1000(?), but it’s probably worse than that.”
There is no best probability; however, for repeated independent trials, the expected number of successes is np, where n is the number of trials, and p is the probability of success. So if you’ve had zero successes in k trials, it’s reasonable to assume that p < 1/k.
The simple answer to your wife’s question is “Well dear, given a sample size of 100 with no successes, the chance you’ll see somebody you know is exactly zero.” And this would be quite accurate, based on the test as you performed it.
Well, are you asking for an exact number or not? I can’t tell.
In any case, it seems to me that your question is really “what are the chances that my data are accurate?”, for which there simply cannot be an exact number. One may say “Based on my sample size of 47 million, my results are more accurate than your sample size of 100.” But this is only comparative, not absolute.
You’re trying to invent data that just isn’t there.
Well if the sample size is not too small you can use a normal approximation to obtain a 95% confidence interval for the population proportion of
(p - 1.96[symbol]Ö[/symbol]{p(1-p)/n}, p + 1.96[symbol]Ö[/symbol]{p(1-p)/n})
where p is the sample proportion and n is the sample size. This does not work well when, as in you example, p = 0, since it gives you an “interval” of 0.
You could say: suppose the proportion of people at the airport that I know is p. Then the probability that I do not know any of the 100 people who have walked past me is (1-p)[sup]100[/sup]. This is less than 5% whenever
(1-p)[sup]100[/sup] < 0.1
i.e. 1-p < 0.97
So if p is greater than 0.03 it is very unlikely that you would not know any of the 100 people. You can be 95% sure that you know fewer than 3% of the people at the airport. This is not as low a percentage as you might guess.
No, there’s not enough information. How many people do you know? Have you lived like Ted Kaczinski your whole life? The probability is real low.
You could take a stab at the upper bounds, but you have to make some guesses that greatly affect the accuracy. The people in the airport - are they in the same town as you? How many people do you know in that town, and how many live there total? If you guess that half the people in the airport live in your town, and you know 200 out of the 300,000 that live there, then you could expect to see around 1500 locals, therefore 3000 people total, before seeing one you know. This makes the approximation that the people you know are equally as likely to be in the airport as anyone else. And it’s heavily dependent on the proportion of all of the people that you know.
As ultrafilter and Jabba quite eloquently pointed out, once you have some successes, then you can perform all kinds of calculations involving probability and confidence and so forth.
But in your example, filmore, you have had zero successes. You can’t make any more assumptions about probability or confidence until you have met at least one person you recognize at the airport.
You might want to look up the “Wald Sequential Probabaility Ratio Test” or “Sequential Analysis”.
Wald’s test is a Bayesian technique that addresses when to stop collecting data and draw a conclusion based on current results from a series of single test cases.
If you make some reasonable assumptions about the distribution of people you know at the airport, you can probably determine when you’ve observed enough people walk by to decide that the probability you will know someone walking by is less than a certain probability with a certain confidence.
Unfortunately, a nice link with a nice example I found by someone at Indiana University was also marked “Draft, please do not quote at this time.”.
I’m sure I’m asking this using the wrong terms which is causing lots of confusion. I was talking with someone at work and he said to figure out the probability which would have a 95% chance of having 100 failures in a row. He did it like this:
(1-p)^100 = .95
p = 1 - (.95)^.01
p ~= .0005 = 1/2000
So he says the best probability that you could expect to have a 95% chance at 100 failures in a row is 1/2000. The actual probability of the airport situation is probably much worse since any probability less than that has an even better chance at 100 failures in a row. Like 1/1000000 is very likely to have 100 failures in a row. But 1/2000 would be the largest probability that would give you a 95% chance of 100 failures in a row.
Anyway, I think that answers what I wanted to know. I wasn’t trying to find out the actual probability of the situation. I wanted to know what the greatest probability would be that would be likely to give 100 failures in a row. Sorry about not explaining it more clearly.
Wait a minute. Don’t you have a sample of 100? That is if you call “not knowing the person who walks by” a success then you have 0 failures in 100 tries. My statistics table says that in a sample of 100 with 0 failures the true failure rate for an infinite population would be no more than 3/100 at 95% confidence.
There is a problem that you can’t be sure your sample is random. That is, all those in the population didn’t have an equal chance of walking past you.
I read the question to mean the OP had not, a priori, decided on a sample size. He just there and watched, and realized that at some point 100 people had walked by, and nobody he knew happened along. A subtle point, but crucial. Hence, sequential tests, which allow you to use the data to refine your current estimates until you are confident enough with them.
Hmmm. This discussion sounds familiar.
(Scroll down to KarlGauss’s post.)
In the current thread, Jabba’s right on.
If all you know is that you don’t know any of the n people who have walked by, the only kind of answer you can give is “I am 95% sure that the probability of me knowing the next person to walk by (or any other random person) is less than p.” The other thread tells you how to find p (it’s about 3/n).
If you want to know the probability that you’ll see someone you know, you have to have some estimate of how many more people you’ll see (call this m), and the only answer you can give is of the form “I’m 95% sure that the probability of seeing someone I know is less than q,” where q = 1 - (1-p)^m = 1 - (1-3/n)^m.
This doesn’t help very much if you’ve seen less than half of the total # of people that you will see. For instance, if you’ve seen 100 people and you think you’ll see 50 more, all you can say is “I’m 95% sure that the probability of me seeing someone I know is less than 78%.” In that case, you could think of “seeing 50 people” as 1 trial. You did that twice, and got the result of “not knowing anyone.” If you do it a third time, will you know anyone? It’s hard to tell - you have almost no data.
If you want something more precise than this, you’re out of luck. As Jpeg Jones and others have pointed out, you need some successes before you can try to pinpoint a particular value, instead of just saying what it’s less than.