Bayesian updating of probabilities in rating systems due to head-to-head results

I know this is mainly a general information forum and I’m asking what is likely a highly technical question, so maybe I just need direction to a forum where these kinds of questions are more typical. But this is the place for me to get answers to things that I can’t manage to Google, so I’ll start with the place that I know.

So I’m working on a system of predicting the outcome of sporting/gaming events that are between two individual competitors where the result is a win for one and a loss for the other, with no other information available from how that outcome occurred. I have implemented the Glicko2 system as a method of rating these players, and the system provides a point estimate, and a variance of that estimate, of each player’s rating, with which one derives the probability of a player winning a match against another player. This probability is my “prior”.

There is also a head-to-head history between these players. Sometimes it corresponds well with their ratings at the times they met previously, other times it does not, and I figure that there should be a way of updating the probabilities that each player wins based on the additional information of their head-to-head results. We would have from Bayes’ formula:

W = win, L = loss
D = head-to-head data

P(W|D) = P(D|W) * P(W) / (P(D|W) * P(W) + P(D|L) * P(L))

P(W) and P(L) are just the prior probabilities. In order to use the head-to-head data, we need to figure out P(D|W) (and P(D|L)), which is the probability of getting the head-to-head data given that the player will actually win (lose) the next match. After looking through the probability text I had from school a decade ago, in all such Bayes’ formula problems, this probability is given. But how do I figure it out in this case?

I know the system’s estimate of the probability of each player having won each of their previous matches, so I have an expected number of HTH wins with which to compare to the the actual number of HTH wins, as well as the variance of the distribution of the actual number of HTH wins. So with these values I’m expected to find the probability that the HTH data would occur given the next match goes a certain way. I can see how this can in theory be calculable, but I have absolutely no idea how to actually calculate it; I don’t remember that much probability theory, and this may have gone beyond what I learned anyway. I’m guessing you have to integrate something against something else, but I’m at a complete loss for details.

I’ve tried to do some research on Bayesian methods, but I haven’t found anything that I could tell for certain was applicable to the situation I was in. To be sure, there is plenty out there regarding Bayesian inference, but if there’s anything regarding how to solve this particular problem it appears to be over my head, and I’m hoping someone here can offer me a bit more guidance as to how to approach this problem or what examples exist somewhere online that are similar enough to what I’m doing to provide the framework for me to make these calculations.

Without using Bayesian inference at all, I decided to model the change in probability as a change in rating of each player by an amount equal to the standard deviation of each player’s rating (from the Glicko system) multiplied by the number of standard deviations (of the distribution of the head-to-head results) that the expected number of head-to-head wins differs from the actual value further multiplied by some coefficient which I will then vary and try to determine which value gives the historically best results. I have no idea if this is remotely statistically valid, but I was able to get the system to make slightly better historical predictions than it made without considering the HTH data. After thinking about how I might justify that calculation, I realized that I should be doing some Bayesian reasoning, and am stuck with the problem I have presented above.

I don’t know the Glicko2 system specifically, but doesn’t it already include some technique for doing this? I can’t see how a rating system would be much use without such a technique.

The stats Qs are above my pay grade.

One comment I would make is that all prior H2H data is not created equal. A recent result has greater real world predictive power than one from 5 years ago. So you’ll need to scale the H2H data by age somehow.

Consider the case where over the last several years player A beat player B five times in a row, then B beat A the next (= most recent) 5 times they met. Their H2H records are each 5/10 = 50%. My money’s all on B though.

I’m having a little trouble following you but I don’t think that you are setting up the problem quite right. Let me try setting it up for you.

  1. you have N players
  2. each player has an true “rating”, R_i which is unknown, but for which you have a prior distribution with a given mean and variance. (Can we assume its normal?)
  3. Given two players with true ratings R_1 and R_2, you have a function W(R_1,R_2) that gives the probability that player 1 wins
  4. You have some data that gives actual wins and losses of a number of head to head competitions.

For now ignore the data, and lets just concentrate on the probability that a player wins a given match-up. Assuming that the ratings are discrete, the probability that player 1 wins against player 2 is.

P(Player 1 wins)=Sum( W(x,y) * P(R_1=x)*P(R_2=y)), <<equation 1>>

where the sum is over all possible ratings x and y, and P(R_1=x)is calculated from the prior probabilities for player 1 (similarly for R_2 and player 2). If its not discrete the sum will become an integral.

Now where the Bayesian comes in is in terms of updating the prior probabilities. Suppose we observed that player 1 did in fact beat player 2, then we update the prior probabilities as follows

Then P(R_1=x|Player 1 wins)= P(R_1=x and Player 1 wins)/P(player 1 wins)
=Sum (W(x,y)*P(R_2=y))/Sum(W(z,y)*P(R_1=z)*P(R_2=y))

Where x is fixed, the sum in the numerator is over all y, and the sum in the denominator is over all x and z. If player 1 loses then you simply replace W(x,y) with 1-W(x,y). Also you can similarly update the priors for Player 2. Then you go on to the next game…

Once you have gone through all your data, and updated your priors then you can go back and calculate the likelihood of a given player winning against another player from equation 1 with the updated priors.

Glicko is a generalized rating system like Elo, telling you on average how strong a player is. It does not take into account any opponent-specific advantages or disadvantages; it only has a mean and variance. As it is reasonably possible in this game to have an advantage or disadvantage relative to the field against a specific player (or type of player, which would be seen by the effect on each individual of that type), I am trying to use the information known about the history of those specific matches to adjust the probabilities that the opponent-agnostic ratings give. If the ratings say that X should beat Y 75% of the time, and their relative ratings have been somewhat consistent over their careers, but the actual winning rate of X against Y is 50% over a significant number of matches (say 50), I have evidence that the ratings are not accurately giving me a probability that is correct for that specific match even though the ratings that those probabilities are derived from may be accurate in general. I thought it would be a straight-forward application of Bayes’ Theorem to update the prior probability that the rating system gives with the additional information about previous results drawn from what might be considered the same distribution. It may be in theory, but it might just be that it requires knowing a probability that is not actually determinable. I think there’s enough information to know what the probability is, but I have no idea what calculation to do to arrive at it.

This is certainly within the realm of possibility to add into the calculations, once I know what calculations it is I need to do…

What you’re calculating here is the probability that player 1’s rating is x given that he wins. I’m not interested in updating what I feel player 1’s rating should be. I’m interested in updating the specific probability that he beats player 2 given the head-to-head data, with the prior probability being that which is based solely on overall rating. That rating doesn’t take into account the fact that one has “samples” from trials of p1 v. p2 that might modify how one feels about the specific distribution of the binomial random variable that is the result of p1 v. p2. I start with the knowledge of their overall ratings so I can make a good guess at the binomial probability, and then I want to modify it based on having samples from the actual (well, presumed actual) probability distribution that we’re interested in determining. That is, I believe the probability that a player wins is based partially on how strong of a player he is overall, as well as how much more or less likely he is to beat his opponent compared to a random person from the field which I am trying to estimate based on the prior head-to-head data.

It looks like a straight-forward Bayes’ Formula problem if it weren’t for the fact that it’s asking me for a probability I don’t know but might in theory be able to calculate. Whatever theory is required to calculate that probability as given in the OP, I don’t know enough about it. It may just be that it’s indeterminate, and Bayes’ Theorem cannot be applied, but it doesn’t seem like that’s necessarily the case.

So, to be sure I’ve got this straight: In this game, one might have players A, B, and C, who are equally-strong players overall (and who thus have the same rating), but due to details of their playstyles or options, it might be the case that (for instance) A will beat B 60% of the time, B will beat C 60% of the time, and C will beat A 60% of the time. Except you don’t yet know those numbers, but instead want to calculate them based on a combination of ratings and known specific matchups.

Sorry, I’m tired and didn’t have time to read all of it, but I’ve done Elo type stuff before (not my main area of expertise by any means), but bas on a general scanning - it seems like you might want something like a Bradley Terry model:

After you’ve kicked this around here for a couple of days, you might try stack overflow.

FWIW, that’s also my interpretation of his goal. Not that I can help him / you beyond that; stats is almost entirely outside my domain.

Correct. I have their overall ratings, and I have a sampling of their past head-to-head results which I believe should be weighted more heavily in determining precisely how likely one player is to win in the specific match-up. Given that I am keeping track of how likely those head-to-head results are given their ratings, I think there should be in theory a way of calculating how to update the probability of each player winning that specific match-up based on how (un)likely their head-to-head results have been.

That model does not appear to take into account any aspects of head-to-head results, only overall results, as the end parameters are only the 30 “specific abilities” and not a matrix of opponent-dependent abilities. The only wrinkle there is the Home Team effect, which is not at all similar to the effect I’m looking for.

Thank you for this advice; I was not really familiar with anywhere to get answers to specific questions of this level of technicality. I may have better luck finding someone who’s familiar with this kind of calculation there.

I was intrigued by this problem, but had little to offer OP beyond what he already knew. Intoxicated enough to be garrulous, I’ll post anyway. :stuck_out_tongue:

As glowacks observes he’ll need some additional parameter to impose weighting for head-head vs all-league stats. Note that this parameter will vary by game, e.g. between, say, stratego and checkers. What’s the best among various ways to model that extra parameter? I don’t know. I’ll guess that multiple papers have been written on the topic, but have no idea whether it’s a “well solved problem.”

So … the task would be to amass lots of data. (C vs D, etc. would be useful to help tune that per-game weight estimate, even though only A vs B is targetted.)

Then a model is needed (e.g. how to formulate the probabilities); you can estimate the probabilities from the data; not my specialty; yes, stackexchange is probably an excellent place to get answers. Please post the response here.

I read some more probability theory and think I have the answer. I needed to approach it not from a formula-centric perspective like I was, but thinking about it truly in terms of what Bayes’ formula means. I took to analyzing a simple problem of Bayes’ formula on the materials of MIT’s open course ware, and realized that the probabilities that I think I need to compute are specifically “the likelihood that the data would be observed under the hypothesis that the event you are interested in the probability of happens”, and that the denominator shouldn’t be broken out the way I have it, but should simply be the likelihood of the data occurring given what we know of the model and nothing else. The ratio of these two probabilities then used as a multiplier to update the prior probability. If the data is more likely to be obtained when the “hypothesis” of a player winning is true than if we don’t know whether they will win or lose, we end up with a ratio greater than 1 and the posterior probability is greater than the prior. Vice versa mutatis mutandis.

So I do have what I think the probability of the data is, because I have been keeping track of what their relative ratings were at each match and can either do a computationally intensive calculation of the probability of obtaining the exact number of prior wins given those relative ratings at each particular time, or I can simplify the calculation, modeling it as a binomial variable whose probability of occurrence is the expected number of wins divided by the total number of matches. This simplification is justifiable somewhat in that the relative strength of the players does not change dramatically over the course of their career on average, although it will tend to cause the calculation of the probability of a result tending to follow their overall rating to be slightly lower than it actually would be as if it varied higher and lower than that overall average, the extremes would be less likely. I think the computational complexity of looking at the likelihood of winning each match separately would be absolutely ruinous compared to how simple it would be taking the average. I’ll probably do the simple way first, see how well it works, and then see if the more complicated version is feasible.

So the denominator likelihood is simply the binomial distribution based on the past data of expected number of wins vs. actual number of wins, while the numerator likelihood is the same distribution but with the expected number of wins incremented by the prior probability of winning and the actual number of wins either incremented one or not.

Now I just have to code it.

Well, the above realization I think is mostly accurate, but the exact way of determining the probability I’m looking for - P(D|W) - is definitely wrong above. I’ll have to think more about what exactly it means, but I think I’m on the right track now and have some idea of what is going on as opposed to just staring mystified at the formulas.

Bravo. Keep us posted.

Mind sharing what sport this is? Or which ones you’re targeting?

I have come to the conclusion that I’m going to need some sort of short-cut, because I’m making semi-contradictory assumptions in my model. I’m assuming that there is a opponent-specific adjustment that has to be made, but I’m using data that I’m collecting from other times when I know that the probability of each player winning is different than it is currently. So I’m assuming there’s one factor of skill that’s not changing with respect to the opponent, and one that is, general strength. This is because of the model I’m using that is trying to figure out how to deal with different “styles” of players being better against other styles, and modeling that only through the individual player-opponent effects.

I need to make some sort of decision on how exactly I’m going to calculate both P(D) and P(D|W). I would like it to take into consideration the historical relative ratings of the players at the times they met. At the same time, I need it to have the conditioning on W have an actual effect on the probability, which means providing a way of recasting my thoughts as to how likely I think each player was to win at those particular times based on the observation of W. I think the simplest way to go here is to calculate P(D) using the (expected historical wins) / (total matches) approach, and then for P(D|W), adjust the expected number of wins upward by some amount and do the same calculation. The amount of the adjustment would in a perfect world be controlled by exactly how long it’s been since those matches, which I have the data for, but coding it to take into account all those time periods and performing all those calculations would probably be too time-intensive, so I’ll probably go with some sort of decay from linearity as the number of matches increase, with the multiplier being based on the current variance of the player. This means that I’ll be stuck with having some parameter for curve shape that I have to fit the data to, which is alright since I had one with the previous approach, but part of the idea of doing this was to find something on more solid ground. Nevertheless, this analysis has shown that I need to make some sort of assumption on how much past results are predictive of current ones.

As to which game, it’s meant as a general purpose add-on to Glicko2. Yes, there is one particular sport that I’m interested in making predictions for, but it’s totally irrelevant and rather unpopular.

You’re up to three parameters now, IIUC: weight of arbitrary-vs-specific opponent selection, A’s improvement speed, B’s improvement speed. Be aware that the more complex your model, the greater risk of over-fitting, especially if your training data set is small.

Glicko2 has 3 parameters that are supposed to be determined by how well the data fit them - Initial Volatility, Initial Variance, and Time evolution of Volatility. I’m well aware that overfitting is potentially a problem, and sometimes very small changes in the model can lead to vast changes in the parameters. However, they usually don’t result in large changes to the actual probabilities. The main issue is how well new players are integrated into the system. One must at some level be forced to assume that good new players will have the same sort of early success that previous good players had, but this not a particularly good assumption as every player who rises to the top has a different path to success. But the whole idea is that it’s all probabilistic in nature, and that I’m trying to come up with the most reasonable estimates for the probabilities that each player wins. There’s no particular reason why some parameters would be logically more likely to work than the ones that best fit the data. Yes, history will probably not follow the same path as it has - but the path it has is pretty much the best guess.

There’s no money changing hands here - it’s all just a point of pride at who can predict the outcome of these contests the best, and I see it as a ripe avenue for probabilistic analysis. I’m being very careful to not introduce anything into the model other than the binary results, to see if careful analysis of them can provide more compelling information about the probabilities of each player winning than someone watching the contests closely and observing the details. I have no particular interest in actually being right - I just want to make the best theoretical model I can and see how it stacks up.

Well, that didn’t work so well. It looked like it worked well when I used my metrics for how well the parameters were fitting, but the probabilities that it spit out in some cases were absurd - someone with a 75% chance against an opponent according to the ratings with a 60% historical win rate against that opponent was having his calculated probability of winning against that opponent revised to around 45%. So something I’m doing is clearly wrong; I’ll have to do more thinking about this.

If you can’t start with the real stats science you’re in danger of creating algorithmic mud. Things that seem like plausible sensible tweaks often aren’t.