Expected bias on unfair coin

Taken from this thread, and specifically this post.

Quick summary: if you have S successes out of N launches under your belt, then (S+1)/(N+2) is your best estimate of the future probability of success. Obviously this is the same problem as figuring out how biased your coin is based on some number of flips. The problem has been driving me nuts as I can’t find a good explanation of the derivation.

Stranger mentions that it’s a first level Bayesian estimate, and for a while I played around with Bayes’ formula, making no progress. I know how to apply the formula in some circumstances, but in this case I had a hard time even figuring out what the priors and stuff were.

I eventually ran across the wiki page on Maximum Likelihood, and in particular the section on Discrete distribution, continuous parameter space. It solves the same problem–but comes to the conclusion that our best estimate is S/N, not (S+1)/(N+2).

But then I applied an idea I was bouncing around with my earlier Bayes stuff. Instead of finding the maxima for p (by finding the zeroes of the derivative), I figure out the center of mass. Or first moment, I think? Anyway, integrate x*f(x) and divide by the integral of f(x) from 0 to 1. Lo and behold, it seems to work. For instance, if my likelihood function is p[sup]3/sup[sup]4[/sup] (3 successes out of 7 trials), then the CoM is .0015873/.0035714, or 4/9. That’s what I’d expect from the (S+1)/(N+2) formula, so I think I’m on the right track here.

However, I didn’t use Bayes. And I don’t know how to integrate p[sup]s/sup[sup]n[/sup] in the general case (I used Wolfram Alpha for the numerical answer). Is there some easier way to get there? Is my approach even a legitimate alternate method? What’s going on? Thanks.

I think that the +1 and +2 in that formula are just because the prior being used is 1/2. Certainly the formula gives the correct results in the limits of no data (i.e., where all you have is the prior) and abundant data (where it’s almost the same as s/n).

On the other hand, that formula seems to fail for the case of small but nonzero numbers of flips. If I flip a coin once and get heads, I’m not going to conclude that the coin has a 2/3 chance of coming up heads on the next flip. Any reasonable real-world prior should assume that fair (or nearly-fair) coins are far more common than biased coins, and so after a single flip, I’m still going to expect that the coin is probably fair. My assessment of the probability of a heads should go up slightly, since I’ve ruled out the possibility of a two-tailed coin but not of a two-headed coin, but two-headed coins are so rare that this makes almost no difference.

I think what I’m saying here is effectively that there should be a prior on the prior. The assessment of 1/2 on the first flip, as the formula states, depends only on the assumption that heads-bias is just as common as tails-bias: It could apply just as well to a universe where there were no fair coins, just equal numbers of two-headers and two-tailers. But after that first flip, we need to assume more about how the biases are distributed.

Yeah, I came to more or less the same conclusion in this post, though I think it’s actually a tad more complicated than that.

Right. The 1/2 comes from the assumption of a flat distribution, but it could have come from (as you say) a universe with only a 50/50 mix of double-head and double-tail coins. However, that would give a different formula. A single head would tell us that all future flips will be head–not true of a flat distribution.

It’s also not really fair to apply a flat distribution to the case of rockets, but maybe it’s close enough for small numbers of launches.

If you are told that you are flipping a coin for which both heads and tails are possible, then you can think of it as having seen one heads and one tails before any flips occur. So after seeing S heads and N-S tails, you have “observed” S+1 heads and N-S+1 tails for N+2 total trials. Now your best estimate is S+1/(N+2).

If were not told that both heads and tails are possible, then you do not have this prior information. It is in some definitional.

I don’t doubt the intuitive justification for the answer–I’m just wondering what the actual mathematical justification is, and what assumptions it’s based on. The original claim is that it’s Bayesian in nature, but I seem to have come to the same answer without Bayes.

After a bit more snooping around, I did find this page on Wikipedia. It points out that the expected value of p is (h+1)/(h+t+2), or equivalently (S+1)/(N+2). However, it doesn’t actually do the derivation, nor does it say whether using the expected value is somehow “better” than the so-called “maximum a posteriori estimate”.

(S + 1)/(N + 2) is the maximum a posteriori estimate you get by starting with a Beta(1, 1) prior.

Ok, thanks. I see from Wolfram Alpha that the Beta(1, 1) prior has a constant PDF. I guess that’s equivalent to the expected value calculation I did above, without really knowing what I was doing.

I still don’t quite get how to go from Bayes’ theorem to this formula. The form is obviously similar, but Bayes’ formula is based on events, not parameters. Got any suggestions on reading material?

Any decent probability textbook will cover the density-based version of Bayes theorem. I know that Bean has it (see Section 4.1.10).

I’ll derive a related result with no calculus or anything except two identities about Pascal’s triangle:C(N+1,S) + C(N+1,S+1) = C(N+2, S+1)
C(N+1,S) / C(N+2, S+1) = S+1 / N+2 This will not be a rigorous answer to OP, but may suggest a different insight.

Suppose the urn started with only N+1 balls. A priori you assume that N+2 cases (0 red balls, 1 red ball, 2 red balls, …, N+1 red balls) are each equally likely. After N observations, you now know that there were either S+1 or S red balls in the urn initially. Simplest is to reject the now-impossible cases and consider that S and S+1 each had a 50% chance.

In the former case there werea = C(N+1,S+1)
ways to sequence the N balls. BUT you observed only one of these ways. Your observation therefore had an a priori chance of (1 / 2a). In the latter case there wereb = C(N+1,S)
sequences, but again your observation restricts to a single case, which had an a priori chance of (1 / 2b).

Following the normal way to convert a priori probabilities to a posteriori, the probability the final ball is red is the ratio (1/2a) / (1/2a + 1/2b) = b / (a + b). Apply the identities mentioned above:
b / (a + b) = C(N+1,S) / ( C(N+1,S+1) + C(N+1,S) )
= C(N+1,S) / C(N+2,S+1)
= S+1 / N+2
I’ve gone through the T=N+1 case because it is simplest. Does this apply also for Do T=N+2, T=N+3, … and T= ∞? Take a leap of intuition (or calculus), or just a leap of faith as in The Last Crusade with Harrison Ford!

Yes, “Beta(1, 1)” is just a fancy way of saying “uniform between 0 and 1”.

Bayes’ Theorem is usually phrased in a pretty useless way, which ends up being converted to the form that link gives to be useful.

Let’s look at where Bayes’ Theorem comes from. Consider the conditional probability P(A | B). This is, by definition, P(A & B)/P(B). Conjunction being symmetric, we have also that P(B | A) = P(A & B)/P(A), and thus, putting these together, that P(A | B) = P(B | A) P(A) / P(B).

Usually, the conditions in which want to apply this are where it’s easy to calculate probabilities of A (and various alternatives to A), and easy to calculate probabilities of B conditioned on A (and on various alternatives to A), but not easy to calculate any other probabilities directly. We can use this to still calculate P(A | B), indirectly…

Except P(B) is also something we don’t know off the bat. We’ll have to calculate it as well, expanding B into B & (A or A2 or A3 or …) = (B & A) or (B & A2) or (B & A3) or …, where A, A2, A3, …, are various alternatives, comprising an exhaustive, exclusive list of possibilities.

We then get that P(B) = P(B & A) + P(B & A2) + … = P(B | A)P(A) + P(B | A2)P(A2) + … . Our Bayes’ Theorem becomes P(A | B) = P(B | A) P(A)/[the sum of P(B | a) P(a) as a runs through A and all its alternatives].

That’s what the link you noted is talking about, using “f(theta | x)” to stand for “P(the quantity we are interested in has the particular value theta | x)” and using “g(theta)” to stand for “P(the quantity we are interested in has the particular value theta)”. The equation they state is then just the above version of Bayes’ Theorem [with different variable names].

Note that all the complication in the denominator is just there to normalize the whole thing. If we just focused on relative frequencies, we could ignore it; the observation would be that the ratio P(A2 | B) / P(A | B) is [P(A2) P(B | A2)] / [P(A) P(B | A)]; i.e., probability-ratios scale according to “p-values” when conditioning on new information.

I guess most of that last post is not pertinent to the particular confusion you were asking about, so let me re-emphasize the one salient point in the middle of it:

A value for a parameter defines a corresponding event (specifically, the event that that parameter takes on that particular value).

I would also say that the idea that, after S successes out of N launches, I should expect going forward a success-to-launch ratio of (S+1)/(N+2) doesn’t reflect anything about the way I live my life, nor do I think it should. One may say that this means that I don’t have a uniform prior in my head, and that’s true, but it’s also true that I don’t always have any particular alternative prior in my head, and I don’t think living a rational life demands I do either. In typical situations of uncertainty, investigation will not uncover any probability distributions lurking within me; I see no need to semi-arbitrarily assign numbers to all propositions in accordance with the laws of proportion and then pretend this reflects hidden structure of my internal beliefs.


That all having been said, rest assured that in the OP, you did do the correct math to determine the mean value for a parameter p, controlling the success probability in N independent trials, conditioned on S successes, starting from a uniform prior on p between 0 and 1.

You had asked about how to carry out the relevant integral (of p^n (1 - p)^m over the range from 0 to 1) more generally. The answer actually has a very nice expression in terms of familiar functions: it is n! m!/(n + m + 1)!. [So the calculated value for S successes out of N trials along the lines of the OP is (the integral of p * p^S * (1 - p)^(N - S)) divided by (the integral of p^S * (1 - p)^(N - S)) = (S + 1)! (N - S)!/(N + 2)! divided by S! (N - S)!/(N + 1)! = (S + 1)/(N + 2), as expected.]

To understand why the integral comes out this way, let us think more generally about convolutions (the convolution at x of a collection of functions is the total of the products of those functions over inputs summing to x; your question was about the particular case where the functions are monomials (restricted to nonnegative inputs) and we’re considering the convolution at 1).

Immediately from that definition, we have that convolution is commutative, associative, linear in each argument, etc. Actually, instead of thinking about convolution, let’s think about “donvolution”: whenever the convolution of f and g is h, let us say that the donvolution of F and G is H, where F, G, H are the integrals starting from 0 of f, g, h, respectively. Donvolution inherits all the same commutativity, associativity, and multilinearity properties that convolution has, but will be somewhat more apropos for our purposes.

Note also that convolving a function with 1 (restricted to nonnegative inputs) is as good as integrating it from a starting point of 0; this means donvolving a function with x is as good as integrating it from a starting point of 0.

Thus, the n-fold donvolution of x with itself is its n-fold integral, x^n/n!. Which means x^n/n! donvolved with x^m/m! must be x^(n + m)/(n! m!).

Taking derivatives to go down to convolution-land, this means x^(n - 1)/(n - 1)! convolved with x^(m - 1)!/(m - 1)! must be x^(n + m - 1)!/(n + m + 1)!. Shifting indices and moving denominators around, this tells us that x^n convolved with x^m is x^(n + m + 1) n! m!/(n + m + 1)!. Evaluating this at the point x = 1, we get the result we set out to prove.

[For what it’s worth, we’ve made up a special name for this function of n and m yielding the integral from p = 0 to 1 of p^n (1 - p)^m = n! m!/(n + m + 1)!. (Or rather, for a silly shifted version of it). It’s called the Beta function, which is where the name “Beta distribution” comes from]

Typos corrected in bold. Also, to clarify, I’m using “n-fold integral” in the sense where the 1-fold integral of a function is itself, integrating that yields the 2-fold integral, integrating that yields the 3-fold integral, and so on.

Cool! I like it. One of my favorite aspects of math is coming to the same conclusion in an entirely different way (also, I didn’t know about that second identity of Pascal’s triangle, so that’s neat as well). Maybe I’ll try the T=N+2 case for kicks…

True enough. There is some applicability to gaming, where assuming a uniform prior is usually reasonable. But in general, you’re right. Then again, I don’t sell rocket insurance.

Thanks for all the other stuff–that will require a period of reflection to grok; hopefully I have some time this weekend. I follow the basic flow of the “donvolution” argument but I am so used to convolution in a signal processing domain that there are some steps that I don’t fully get. In particular, I think you are using integration/derivation as an operator and then using the linearity property to move them around (i.e., conv(D(f), D(g)) = conv(D(D(f)), g), etc.). I’m familiar with the approach but haven’t actually used it so it’s not exactly intuitive. Anyway, I think I can probably puzzle it out from here. Much appreciated!