Another math problem (measuring streaks)

I’m not sure what area of maths this comes under.

But basically I play a lot of chess, and one thing I’ve noticed is that wins and losses tend to come in streaks. This is probably because my ability fluctuates day to day, and also because of the psychology; how wins and losses affect your confidence and frustration level going into the next game.

But how can I measure the “lumpiness” of the data? How do I compare “WWWWLLLL” to “WLWLWLWL”?

Bonus points for including the concept of ELO. Chess servers pair players within a certain ELO range; this should, all else being equal, reduce the likelihood of streaks, since each win puts you into a stronger pool of opponents, and vice versa.

Thinking about it I can just find the average group size. In WWWWLLLL, it’s 4.0, and in WLWLWLWL it’s 1.0. That’s The lumpiness so then I just need to work that out into a win/loss expectation based on current group size.

In general, people tend to underestimate the amount of streaks that can be found in purely random data. If you look at a list of 720 random coin flips, there is a probability of 75% that there will be at least eight heads or eight tails in a row. You can’t rely on your own observations of streaks without doing some probability calculations. This is sometimes known as the “hot hand” fallacy:

Try e.g. a Wald–Wolfowitz test

ETA I mean that an obvious difference between WWWWLLLL and WLWLWLWL , even though they have the same number of W’s and L’s, is that they have a different number of “runs.”

I play a lot of backgammon online and have noticed the same phenomenon. With no scientific basis at all, I attribute some of it to the fact that when I am ahead, I tend to take more risks.

You should always establish whether there even is a phenomenon before you start looking for explanations for it. True randomness would have streaks in it. Do you have more streakiness, less streakiness, or about the same amount of streakiness than you would expect? If you have the same amount, there’s nothing to explain. If you have more or less, then you can start looking for explanations, and the explanation you look for will depend on whether it’s more or less.

Sure, but that’s exactly the question. How do we measure streakiness, and in turn, how streaky is my chess history, and so is it a real phenomenon?

Well, just taking your small example, if you for example flipped 8 coins, won 4 and lost 4 then you would expect 5 runs on average. So WWWWLLLL, with only 2 runs, is streaky, and WLWLWLWL is not streaky enough, compared to what you would expect from random results. You need to look at a longer string of results, obviously, but that is one approach.

I am not a professional statistician, but:

DPRK only calculated the mean, so there’s a lack of (formal) arguments for why 2 runs is streaky.

With some assumptions we can make some pretty simple formulas to calculate both expected mean and variance for any number of matches though.

We’ll assume that the ranking system is working pretty well and you have about the same number of wins and losses. And we’ll also assume that the ranking system isn’t so sensitive the expected outcome trends towards WLWLWL as you are promoted well above your ability when you win and well below your ability when you lose.

Using the functions in the Wikipedia-article linked above and assuming N_+ and N_- both equal 0.5N, we get that the mean and variance as a function of the number of matches is:

mean(N) = .5 N + 1
var(N) = (.5 N) (.5 N - 1) / (N - 1) = .25 (N - 1 - 1/(N-1))

If you have a large enough N for the streaks to approach a normal distribution you can then test this against the threshold you want to use for you hypothesis test.

You’ll want the standard deviation for that, and
sd(N) = 1/2 * sqrt((N - 1 - 1/(N-1)))

so for DPRKs example, although it is too short to be normal, that gives a standard deviation of
1.3, so 2 runs would be outside of 2 standard deviations, and unusually streaky, if your treshold was “outside of 2 standard devations”.

What I would do is take a large tournament and for each entrant figure out the number of wins after a win and the number of wins after a loss. Average this over all contestants and finally see if the probability of a win after a win is higher than that of a win after a loss.

You could also do this for an entire season of a pro league. Not baseball, because teams generally play three in a row with the same opponent and the better team is likely to do WW more often

Whenever this has been seriously studied, the conclusion is that there is no such thing as a hot hand. Streaks happen, but they do in random coin flips too.

I think some people may have misunderstood the OP a little (clearly my fault, if multiple people did).
I’m mostly talking about win or loss streaks within a single sitting. There’s nothing superstitious about this; I just can have good or bad days.
I play blitz chess, and typically play about 10-15 games of an evening. I sometimes win or lose all of those games.

An example of a particularly bad run might be starting at what is already a low ELO for me, and losing 8 games in a row. I’m usually tearing my hair out by this point.
Then the next day, whatever cloud was over my mind the previous day is lifted and I play like my more typical level, and get a win/loss record of W13 : D1 : L1, say.

Unfortunately I didn’t realize that chess.com has suspended the facility to download all games. I was hoping to do an analysis over thousands of games, but it seems it’s going to take a while just to download a few hundred.

While its definitely possible to be in a bad mental spot and end up losing multiple games in a row against equally matched opponents, it’s a lot harder to be in a good mental zone and suddenly get a streak of wins against equally matched opponents. Some of them may be having “off” nights as well, but unless your matches are all played against the same person, it’s unlikely for all of them to have “off” nights.

A 100 point ELO difference is only supposed to represent a 64% win probability. So it seems like either:

  • you’re a much better player (once you get in a good mental zone and don’t sabotage yourself) than your ELO would otherwise suggest

  • the matchmaking system is set up specifically to cluster much better ELO opponent matches for you following any string of victories (and vice versa for defeats).

  • The streaks that you remember are just confirmation bias, and your actual streak occurrence rate is no better than random chance.

But this is why ELO is a complication here.

To put actual numbers on it; my ELO spends most of the time between 1780 and 1850. But I get occasional spikes and troughs, and the troughs are much deeper than the spikes are tall so the low and high water marks are 1600 and 1950 respectively.

So, what happens is, sometimes I get frustrated for whatever reason and then get a string of losses (and a sore hand, from punching my table).
The next day, my ELO is way down to 1660, say, and the computer is going to match me against weaker opponents on average (e.g. the first game is necessarily against an opponent rated 1560 - 1760 because it will always find someone ± 100). So, if my actual playing level is back to “normal”, then a winning streak becomes much more likely.

But:

…this is possible too.
I’m not going to remember or care about the days where my wins and losses were pretty even and my rating made little movement. And this is frequently the case, as implied by the second sentence of this post. So that’s why I was interested to try to do an analysis.

First of all, a nitpick: Chess rankings (or any other rankings using the same system) are called “Elo”, not “ELO”. It’s not an acronym; the system was developed by a guy named “Elo”.

Second, your Elo ranking really shouldn’t be fluctuating that much. If you’ve been playing for a few years, then any given bad day will only be a fraction of a percent of all of the games you’ve played, and should thus only make a small difference in your ranking, unless the site is doing something like weighting recent games more heavily.

Nitpick: it’s a rating (a number representing your skill level), not a ranking (an ordinal representing how many players are better than you).

I know that the Elo algorithm takes into account how many games have been played. If you create a new account, your rating may move by hundreds of Elo (thanks for the nitpick) in a single game. However, it bottoms out at a 16 point total maximum shift per game.

Let’s say chess.com matches me with someone with the exact same rating as me. I will gain 8 Elo for a win, lose 8 Elo for a loss. A particularly bad sitting can easily mean dropping 80 Elo, and falls of 150 can and have happened over a couple days.
There’s even a word for it in chess and poker parlance: “tilt”. Being tilted can mean getting too frustrated, too focused on the result, and making bad decisions.

I know nothing about statistics, but after reading one of the above cited articles on the “hot hand” in basketball and other posts, I have a question. There seems to be an emphasis on random event comparisons at the root of the explanations. However, whether ot not a player makes is next shot isn’t a random event, is it? The chance of making an individual shot isn’t a random event. I guess you could look at history and determine what percentage of attempted shots under identical circumstance went in but are the circumstances ever identical? They tried to control for difficulty in one of the studies by using distance. But that, alone, doesn’t determine difficulty.

Bob_2 said he thinks he takes more risks when on a winning streak. By definition, more risks should equal more failures, right? But maybe his riskier moves throw his opponent off and lead to mistakes? I don’t know but it seems to me that activities that require skill, success and randomness don’t belong together. There are too many variables to consider and no way to control for all of them.

ISTM the type of simple statistics we are discussing should be sufficient to distinguish the case that the chances of making a shot are more (or less) likely following a streak than if it were independent. There is the issue that many variables change from game to game, so you would not necessarily want to combine unconditioned data from different games. In the chess case, the assumption might be that while one’s skill can grow better or worse, it changes sufficiently slowly so that if you sit down and play a series of games against a single opponent, each result will be identically distributed (maybe not 50/50 if the two players are of equal skill, but whatever it is will not change over the course of a single match). Elo-type ratings attempt to quantify what those odds actually are, but for a basic runs test we do not need to know them.

The rule of thumb is that people tend to underestimate the number of streaks in a random stream. For instance, try the following experiment. Ask several people to write down a 1000-long stream of Hs and Ts, not by doing anything random themselves but just by writing them down as fast as possible in a way that they will think will look random. Then have someone get a coin and flip it one thousand times. Have them write the results down as a stream of Hs and Ts (for heads and tails). Nearly always the stream with the most streaks of Hs and Ts in it is the one actually derived from flipping a coin, because people don’t realize how many streaks are expected at random. So-called “human random” is not really random.