Math/statistics folks: please answer a question about Correlation

USCDiver · September 28, 2007, 2:51pm

Freddy the Pig:

In this part of the test, we’re not examining correlation. We’re testing whether two groups (WS winners and non-WS winners) have significantly different September performance.

The “null hypothesis” is that the two groups have the same September performance, and any observed difference is due to random fluctuation. The “alternative hypothesis”, beloved of Joe Morgan et al, is that WS winners demonstrate better performance in September, because their momentum carries over into the postseason.

The “burden of proof”, as the lawyers say, is on those who believe in the alternative hypothesis to provide convincing evidence in favor of it. Since we don’t expect Joe to grind the numbers himself, we’re doing it for him, acting as a devil’s advocate and seeing if the numbers provide enough evidence to reject the null hypothesis and adopt the alternative hypothesis.

In this case, however, the process grinds to a halt at the first step. WS winners have inferior September performance in our sample. There is no evidence whatsoever (at least, in this particular test) to support the alternative hypothesis. If WS winners showed a higher September percentage, we would proceed to tests of statistical significance such as the two-sample t-test. In this case, there is no reason to do so, and indeed no way to do so.

Ah, I see what you’re getting at. I’d still be interested in seeing the p-value in that comparison. If there is a significant p, my contention would that teams that win the World Series are those that are able to rest their best pitchers during September and so may have a lower win percentage late in the season.

ultrafilter · September 28, 2007, 2:58pm

Are you only looking at relationships in known data, or do you want a predictive model? In the former case, you don’t need all the data, but in the latter case I think you do. But if you want to predict how all the teams are going to do in the playoffs this year, you have to consider “not playing” as a possible outcome. And if you only train on the teams who did play, there’s no guarantee that you’ll be able to make reasonable predictions about teams who won’t.

I’m looking at the probit model on Wikipedia, and it looks to me like logistic regression might be more appropriate. Thoughts?

Trunk · September 28, 2007, 3:30pm

Probit allows for a GLM where your response variable is a binary outcome. As my response variable, I’m using “won the world series”.

A logistic regression is a GLM where, yes, there is a binary variable, but the response is actually a percentage and you have a known ‘n’ for your binomial. I think that’s key. If they ALWAYS played 7 games, regardless of outcome of the first 5 (or whatever), then it would probably be valid.

Trunk · September 28, 2007, 4:04pm

On further thought, I think that a regular logistic regression can be used here. I’m used to running models with a binary outcome, but I have repeated trials for each independent variable, so I can get a percentage. I thought you needed a different technique if you didn’t have repeated trials.

ultrafilter · October 1, 2007, 12:00am

I was talking to one of the students in the statistics department today, and I asked him about this question. He said that while the exact analysis is complicated (you’d need to have a precise definition of momentum and make some assumptions about how baseball games work), there’s really no evidence that regular season records have any bearing on the playoffs.

Triskadecamus · October 1, 2007, 2:14am

Forgive my presumption, but. . .

Wouldn’t be significant (in a baseball sense, if not a mathematical sense) to examine the September win/loss percentage of all World Series Winners, and compare them to all World Series Losers over the same time period? It is, by necessity a set of thirty pairs of teams, which don’t play each other in September. While it might not be the entire statistical picture, it seems to me to be more baseball relevant than larger samples. We must have nearly a hundred years of that particular statistic, if we need a bigger sample.

Tris

Freddy_the_Pig · October 1, 2007, 4:16pm

It would be, perhaps, “significant”, but IMO wouldn’t get to the heart of the issue as well as the WS-winners-versus-all-other-playoff teams comparison. In today’s playoff format, with eight teams, even the WS loser is pretty successful in the postseason. By comparing the WS winner to the WS loser, you’re comparing the most successful team to the second-most-successful team, and even from a Joe Morgan standpoint (that is, from the standpoint of somebody who believes that September success carries into the post-season) you wouldn’t expect that wide a separation. So a failure to find one wouldn’t prove much.

Gangster_Octopus · October 1, 2007, 4:40pm

I think what is missing and probably wouldn 't be that hard would be to look more at a matchup basis. For example, does a team playing .700 in September have success against a team that played .450 in Septmebr (yet still made the playoffs).

Triskadecamus · October 2, 2007, 4:34am

Yeah, but a team that plays 700 in September won’t win the world series if they played 250 in May, June and July. They won’t have a chance to do so, because they won’t get into the playoffs. I doubt it has happened all that often, but hey, it’s baseball, man, you never know!

Tris

Topic		Replies	Views
Ask the guy who's downloaded at the Retrosheet baseball data into a database. Miscellaneous and Personal Stuff I Must Share	31	4119	October 25, 2005
Sports standings probabilities Factual Questions	2	564	March 12, 2000
Baseball Statistic Question The Game Room	7	5514	September 8, 2010
Baseball stats question. Factual Questions	11	816	June 19, 2000
A "Probability Calculator" for Baseball Factual Questions	3	2647	October 6, 2001

Math/statistics folks: please answer a question about Correlation

Related topics