Math/statistics folks: please answer a question about Correlation

Ah, I see what you’re getting at. I’d still be interested in seeing the p-value in that comparison. If there is a significant p, my contention would that teams that win the World Series are those that are able to rest their best pitchers during September and so may have a lower win percentage late in the season.

Are you only looking at relationships in known data, or do you want a predictive model? In the former case, you don’t need all the data, but in the latter case I think you do. But if you want to predict how all the teams are going to do in the playoffs this year, you have to consider “not playing” as a possible outcome. And if you only train on the teams who did play, there’s no guarantee that you’ll be able to make reasonable predictions about teams who won’t.

I’m looking at the probit model on Wikipedia, and it looks to me like logistic regression might be more appropriate. Thoughts?

Probit allows for a GLM where your response variable is a binary outcome. As my response variable, I’m using “won the world series”.

A logistic regression is a GLM where, yes, there is a binary variable, but the response is actually a percentage and you have a known ‘n’ for your binomial. I think that’s key. If they ALWAYS played 7 games, regardless of outcome of the first 5 (or whatever), then it would probably be valid.

On further thought, I think that a regular logistic regression can be used here. I’m used to running models with a binary outcome, but I have repeated trials for each independent variable, so I can get a percentage. I thought you needed a different technique if you didn’t have repeated trials.

I was talking to one of the students in the statistics department today, and I asked him about this question. He said that while the exact analysis is complicated (you’d need to have a precise definition of momentum and make some assumptions about how baseball games work), there’s really no evidence that regular season records have any bearing on the playoffs.

Forgive my presumption, but. . .

Wouldn’t be significant (in a baseball sense, if not a mathematical sense) to examine the September win/loss percentage of all World Series Winners, and compare them to all World Series Losers over the same time period? It is, by necessity a set of thirty pairs of teams, which don’t play each other in September. While it might not be the entire statistical picture, it seems to me to be more baseball relevant than larger samples. We must have nearly a hundred years of that particular statistic, if we need a bigger sample.

Tris

It would be, perhaps, “significant”, but IMO wouldn’t get to the heart of the issue as well as the WS-winners-versus-all-other-playoff teams comparison. In today’s playoff format, with eight teams, even the WS loser is pretty successful in the postseason. By comparing the WS winner to the WS loser, you’re comparing the most successful team to the second-most-successful team, and even from a Joe Morgan standpoint (that is, from the standpoint of somebody who believes that September success carries into the post-season) you wouldn’t expect that wide a separation. So a failure to find one wouldn’t prove much.

I think what is missing and probably wouldn 't be that hard would be to look more at a matchup basis. For example, does a team playing .700 in September have success against a team that played .450 in Septmebr (yet still made the playoffs).

Yeah, but a team that plays 700 in September won’t win the world series if they played 250 in May, June and July. They won’t have a chance to do so, because they won’t get into the playoffs. I doubt it has happened all that often, but hey, it’s baseball, man, you never know!

Tris