Pasta:
You are correct that the formula doesn’t actually use the correct winning percentage, instead using what amounts to the probability of beating a team with a 50% win likelihood. I didn’t bother checking to see whether this worked for a distribution. Of course we have no idea whether uniform is the correct baseline, but then again for this problem we are going to have to make unwarranted assumptions so why not. Still as I show below there might be some way to use the observed distribution of team winning percentages to figure out what this should be.
**Pleonast **
You are partially right for the reasons suggested by pasta, that the win percentage should be versus a team drawn form a distribution rather than versus a team with a 50% record. However, no matter what the distribution is, a team with a win percentage of 60% should be better than a randomly picked team (other wise its record would be 50% or less), and so a 40% team should have less chance of beating it that it would a random team (which it beats 40% of the time) and so the probability of its victory should definitely be less than 40%.
ultrafilter You asked for it don’t say you weren’t warned:
**
Notation: **
I will use** Phi(x;s)** to indicate the cumulate normal distribution of x in a normal distribution with mean 0 and variance s, so Phi(0;1)=0.5, Phi(1.96;1) will be 0.975
I will use InvPhi(p;s) to denote the inverse of this function InvPhi(0.5;1)=0, InvPhi(0.975;1)=1.96
**
Assumptions:**
I assume that each team has an underlying average number of point that it scores in a game with and that this has distribution F. This seems reasonable.
I further assume that if a team with an average points equal to x meet a team with an average number of points equal to y, then the score1-score2=x-y+z where z is a symetric random variable drawn from a distribution G with mean 0. If this is greater than 0 then team 1 wins if it is less than 0 team 2 wins.*This is a big assumption and assumes that there are no relative strengths or weaknesses or increases in variability between teams. *
I also assume that there is enough history such that the the observed winning percentage is equal to the exact winning percentage. There may be more calculations that can take this into account, but its too much trouble to bother with now. Bayesians can eat my shorts.
For this exercise I will make the final assumption that F is distributed as N(0,1), while G is N(0,s) for some unknown varaince s. For small s will indicate that luck has little role and a good team will almost always beat a poor team. Large s will indicate that the results are largely random and each team will *Another ginormous assumption. The distirbution F doesn’t matter too much and only does so relative to G, but assuming that they are related in this way requires some faith. Other models can be assumed but the calculations are easier for normal. *
Results:
Probability of with win percentage A beating team with win percentage B is
Phi(InvPhi(A,1+s)-InvPhi(B,1+s),s)
Where s = [1-Var( InvPhi(p_i) ) ] / Var( InvPhi(p_i) ) for p_i the observed winning percentages for all teams in the league.
**
Proof:**
Given a baseline average of x for a team, the proability A of winning against a random team will be F*G where * denotes the convolution. For the models we are using this will be Phi(x;1+s). So if we observe a team with a winning percentage A, then the average point count of that team will be InvPhi(A,1+s).
If we have a team with point count x vs a team with point count y, then the probability of team 1 willing is P(x-y+z>0)=P(z<x-y)=G(x-y). So if we have a team with record A vs a team with record B, then the probability of team one winning will be Phi(InvPhi(A,1+s)-InvPhi(B,1+s),s)
This is ok so far, but we still need to know what s is.
Let us look at all of the teams in the league, and suppose that V§ is cumulative distribution of winning percentages. Since the underlying score is distributed as N(0,1) accross the teams,
we know that for p~V, that
InvPhi(p,1+s)~N(0,1), so
InvPhi(p,1)/sqrt(1+s)~N(0,1), so
InvPhi(p,1)~N(0,1/(1+s)),
So all we have to do is to look at the distribution of InvPhi(p,1) for all observed winning percentages p, and we should end up with a normal distribution with varaiance (1/(1+s)) from which we can solve for s. Of course if we get a distriubtion that looks nothing like a normal then that will indicate a problem with our assumptions.