A question about chance and averages

I was pondering this scenario the other day:

A group of settlers from the east coast of the United States travel to Kansas circa 1850. Their first 49 days on their new land go swimmingly, but a tornado goes through their settlement on the 50th day. 50 days later, another tornado. 50 days after that, another tornado. So, for the last 150 days there have been 3 tornados, an average of 1 tornado every 50 days. One settler suggests that with each passing day after the last tornado they are “due for another one.” Another settler suggests that with every passing day, a tornado is in fact less likely because every day without a tornado necessarily increases the average number of days per tornado.

So who’s right? Does the number of days without a tornado increase or decrease the likelihood of another tornado coming? What should the settlers expect the day after 49 days of being tornado-free? Am I confusing concepts? Or am I just seriously confused? I would imagine that the second settler is right and more data would have to be collected before the 50 days/1 tornado pattern can be attributed to something other than chance.

I think I should extend this further and assume that this pattern continues for several years and the settlers have not found a reason for the pattern (regular climactic patterns, a vengeful and punctual god, etc.) When does something change from coincidence and chance to a concrete pattern, even if the causality of that pattern is murky?

Let’s use a randomization device instead of weather. Suppose they carry with them a huge barrel of things which are identical from the outside but which can be different inside. Perhaps it’s objects like a drug capsule, so that the barrel is full of tens of thousands of them. Each capsule has either a small black or white piece of cloth inside, but it’s impossible to tell which color is inside before pulling out a capsule and breaking it open. Before leaving they had thoroughly mixed up the capsules in the barrel and agreed that they would randomly pick out one capsule each day for the next few years and do something differently depending on whether the cloth inside is black or white. Someone not in the group doing the travel had made all the capsules and put them in the barrel. The people in the travel group knew that the cloth inside each capsule was either black or white, but they knew nothing about what proportion of the capsules had a black or a white cloth inside.

Suppose, in a way similar to your scenario, that they had been travelling for one hundred and fifty days. There were forty-nine days with a white cloth inside the capsule, then one day with a black cloth, another forty-nine days with a white cloth, then one day with a black cloth, then another forty-nine days with a white cloth, and then one day with a black one. If it were really true that the capsules were completely randomized, the best guess that the next cloth is white is 147/150 = 49/50, while the best guess that the next cloth is black is 3/150 = 1/50. If the cloth found the next day is white, then the best guess goes to 148/151 white and 3/151 black. If the cloth found the next day is black, then the best guess goes to 147/151 white and 4/151 black. Assuming the capsules were truly randomly mixed, the pattern of black and white in previous days is completely irrelevant. The ratio of black and white so far is the only thing that matters. There is no such thing as being “due” or “not due” for a black or white cloth.

Actually, in the scenario you describe, my reaction if I were in the travel group would be to say that someone is cheating. The people that filled the barrel didn’t put any black cloth inside any capsules. They were all white. There is a confederate of those cheaters in the group. When he pretends to pull out a capsule every fifty days, instead he palms one that he has hidden on his body that contains a black cloth. However, note that this is going against the conditions given above and assuming that those supposed conditions are a lie.

Changing from a randomization device to weather makes it impossible to evaluate what’s going on unless one has a knowledge of how weather works. At that point the question is about weather and not about probability. You’d have to know how tornadoes work and whether there is a pattern to them.

I don’t understand why he says this. Surely he would argue that this makes a tornado more likely, not less.

Assuming tornados are not a limited resource, based on the limited time of observation the chance of getting a tornado on any given day is 1/50. This probability is not affected by how many tornado-less days have passed.

Well let’s stick with 1 tornado per 50 days.

So the chance that the next tornado comes on any day is 1/50 and the chance that it doesn’t is 49/50.

So on day 150 the chances of the next tornado being on day 151 are 1 in 50 (0.02).
The chances of the next tornado being on day 152 are 49 in 2,500 (0.0196)
The chances of the next tornado being on day 153 are 2,401 in 125,000 (0.0192)
and so on.

So it’s always most likely to strike tomorrow at any given point of time.

A change in events doesn’t necessarily mean anything. A weather model that predicts tornadoes might give completely different results than a regression model like the one being applied in this instance.

In the case of rolling a die, every time you don’t roll a 6, it means absolutely nothing - the chance of rolling a six the next time is still 1 in 6. It neither increases the odds (because you are due), nor decreases the odds (because you’ve observed less frequent sixes).

In the case of some events, however, we don’t have a formula - all we have is observations, like in the OP. In these cases, there is no guarantee about what the observations mean.

A long streak of tornado free days may just be pure coincidence or it may mean that tornadoes are just no longer as likely. If someone was looking back at the history of tornados and saw them spaced out like this (in number of days):

50
50
50
50
85
50
50
50

They might reasonably say, 85 was an outlier - the real average is 50, let’s plan for that.

If the pattern were:

50
50
50
50
50
55
60
65
70

They might reasonably say, hey - these things are becoming less frequent - we should have an expectation that one will occur in 70 days or more (even though the average is much lower).

When you are in the middle of an abnormal event, you don’t know if you are witnessing an outlier or a change in the trend. The best approach in this instance is to see if any other information indicates a preference for one assumption over the other, but sometimes the answer is just “I don’t know.”

Because ultimately the expectation of how many days isn’t defined by past occurrences, but by some natural conditions. Using the historical data helps us to infer things about the future that we can’t directly model , but that isn’t actually determining what happens in the future, so its use will always have an element of subjectivity

I can’t remember who said it (I think it might have been Warren Buffet), but I love the quote:

“Historical data is great for predicting the past.”

I’m pretty sure this is a question, not about chance and averages, but about weather and tornadoes. Knowledge of weather patterns and what causes tornadoes in that area would help you answer it, but I don’t see how knowledge of the theory of probability could.

If this is a question about probability the answer is, it makes no difference whatsoever how long it’s been since the last tornado. Your odds are the same the day after the previous one as they are 100 days after the previous one.

If this is a question about forecasting weather events, I have no idea.

I think this gets to the heart of my question. The first settler “knows” that the tornado is not “most likely to strike tomorrow.” Based on the observed pattern, it will strike 50 days after the previous one. The second settler, using a basic average, concludes on the 49th day since the last tornado that there is a 3 in 199 chance (.015) that another tornado will strike the day after. What scenarios call for each model?

This is not so much a probability issue as a modelling/estimation issue. If you model is that the probability of a tornado on any day is constant and the events are independent, then you are simply trying to estimate the probability p that a tornado strikes. In that case, using any reasonable estimation technique I’m aware of, your estimate of p will fall each day there is no tornado and rise each day there is a tornado.

But suppose your model is that there is a certain amount of energy in the weather system that has to be dissipated by tornadoes. In that case each tornado uses up some of the energy so that another tornado tomorrow become less likely. This is particularly true if the energy has to build up and then release. This is the same prediction, but not it’s causal not statistical

If your model is that there are some conditions that favor tornadoes and some that do not, and if these conditions tend to be slowly changing, then when you see a tornado, you should concluded conditions are right and will likely be right tomorrow so the probability tomorrow is higher than if you’d not seen one today. Again this is causal, but now the opposite prediction.

Like most statistical questions, the answer depends on which model is more likely.

Actually, if we assume that the “once in fifty days” estimate is good to two sigma (standard deviations) on a single tailed distribution (we would expect one tornado to occur in any given span of fifty days to a 97.7% confidence level) the probability of ocurrence on any given day is actually 7.3%. If we assume a thee sigma confidence in our data (99.9%) then the probability of occurence per individual day goes up to 12.3%. If we assume that the occurence of actual tornados is randomly distributed along a normal distribution (and only occuring at one extreme, hence the use of a single tailed distribution) there is absolutely no influence from day to day; we are just as likely to have 28 days of clear weather and then two tornados in a row as we are to have tornados on each end of a month, or any permutation thereof. The odds of a tornado occurring in the future increase with every successive day of clear weather (although not additively) but the odds of a tornado on any particular day remain constant.

However, if you really had tornados on fifty day centers three times in a row, we would have reasons to doubt the hypthesis that tornados are normally distributed and would probably look for some other mechanism (physical or abstract) to develop a trend prediction. In reality, of course, weather patterns are not at all random, although they are perturbative and chaotic within certain parameters, and therefore difficult to predict from any kind of first principles analysis.

Stranger

I just realized that I completely missed this question which was the heart of the question posed in the o.p. It is true that with every additional day in the sample population without a tornado, the estimate of probability goes down. For instance, if we have 3 tornados in 199 days there is a 5.5% of daily occurrence at a 2 sigma confidence, and 9.4% at a 3 sigma bound.

Whether we would really have this high of confidence in the sample data reflecting the actual distribution is another discussion entirely. Since we don’t have a large enough data set to randomly sample individual subsets to evaluate consistency (always a problem with binomial testing with realistic population sizes) the actual confidence in the “goodness” of the data is always a best guess with some empirical judgment than a rigourous quantitative estimate.

Stranger