Math/statistics folks: please answer a question about Correlation

I was good at mathematics in school. Basic calculations have always come easy for me, and in high school i had a pretty firm grasp on more complicated geometry, calculus, etc. But i’m a historian now, and haven’t done any formal math study or advanced calculations since i left school.

Since arriving in the US, i’ve become a big baseball fan, and also interested in the statistics of the game. Not just ERA and Batting Average and Slugging Percentage, but the more esoteric stuff done by the Sabermetrics crowd on websites like Baseball Prospectus. But i’ve never actually gotten into doing any math myself; i just read what other folks do, and absorb their (usually very good) explanations of what their calculations can tell us. I have no idea how to do a regression analysis, and while i understand why standard deviation is important, i’m still not sure how to read a SD and derive significance from it.

Anyway, i’ve decided to try and get a little better at some of this stuff, so i’ve got a Statistics for Dummies book and a few websites and i’m learning.

The thing i’m working on now—inspired by a post on a baseball blog i read—is trying to see whether there’s any correlation between a team’s September performance and its performance in the playoffs. I have gathered data about the September winning % and the playoff winning percentage for all playoff teams since 1976. That gives me 30 years of data (1994 missing due to strike), and a total of 172 playoff teams (8 teams per year in Wild Card era; 4 teams per year before that).

The thing is, though, that i’m still not clear on when different types of correlation are appropriate. To get correlation for this group of stats, i used the Pearson Product Moment Correlation. Based on the reading i’ve done, it seems appropriate, but i’m not completely confident. Is this an appropriate measure to use in this case? If not, why not? And what should i use instead?

So far, the correlation i’ve derived is between Septmember Win % and Playoff Win %. That seems fairly straightforward, because i’m comparing similar numbers (Win%). The Pearson correlation i’ve arrived at for these numbers is 0.018, or basically no correlation, positive or negative.

I also have another question about determining correlation between September wins and Playoff success, where “success” is defined as winning the World Series. How would i do this? Can i assign “success” a sort of binary value, where Winning World Series = 1 and Not Winning World Series = 0? This one has me confused.

Any help you can give my overtaxed stats brain would be appreciated.

Wouldn’t things like the format change (wild card), drug use, indoor stadiums, unforeseen injuries, and Pete Rose also have to be factored in? Seriously, I think there are some things that don’t lend themselves well to mathematical models; this is one of them.

Why?

All i’m seeking here is correlation, not to make any statement about causation. In the simplest possible verbal terms, i simply want to find out whether there is any historical correlation between the way teams play in September and how well they do in the playoffs. That is, as i understand it, something that can be measured empirically, even if we don’t know exactly what effects the things you talk about had on each individual team, or on MLB as a whole.

The reason behind my interest in this issue is that quite a few non-statistically-minded baseball commentators (like Joe Morgan) tend to argue that a team’s “momentum,” its late-season performance, is an important factor going into the playoffs. Well, a statement like that needs more than just a “feeling” behind it to have validity.

If we show, for example, that teams with excellent September records have a considerably higher-than-average historical tendency to win the Wolrd Series, then maybe the theory has something to it. If, on the other hand, teams that do very well in September typically get bundled out in the first round of the playoffs, we might conclude that September performance actually has a negative impact on postseason play.

And if we find very weak correlation (which is what the numbers i got said), we might simply conclude that there’s no way to predict postseason success based on September results.

The numbers have a relationship (be it strong or weak) irrespective of the external factors like steroids, injuries, new format, etc. Now, it might be that to get more accurate and useful numbers, we might need to normalize for some of these external factors, but i think the raw correlation can at least tell us something.

At least, that’s what my currently very rudimentary understanding of statistics leads me to believe. :slight_smile: If you can explain to my why i’m wrong, i’d be happy to hear it.

If you already understand this, you’re well on your way to being statistically literate. That’s good.

The Pearson product-moment correlation is the standard correlation coefficient (to the point where it’s usually just referred to as the correlation, at least in undegrad books). It’ll tell you to what degree the two variables you’re looking at are related, but that’s about all that it’s good for. If you want to model the relationship between a team’s winning percentage in September and their winning percentage in the playoffs in any more detail, you’re going to need linear regression or something similar.

This should be fine. In general you have to be careful about encoding categorical variables as numbers, but if there are only two and you encode them as 0 and 1, you should be fine. Just keep in mind that you might get a different model if you were to switch the encoding (e.g., make a World Series win a 0 and a loss a 1).

(You’re a grad student, right? You may want to see if you can find a computer with SAS installed. It’ll do a lot of the more computationally intensive stuff for you, so you can focus on understanding various statistical techniques and what they mean.)

Ooh… this sounds like a simple datamining / pattern recognition problem. IIRC, one of the guys with whom I’ve taken a couple classes did a similar term project. I’m unsure if he included that particular statistic, but he included a bunch of other things. If you’re going to persue this, I would recommend WEKA; it’s a free tool that has dozens of algorithms and other tools built in.

  1. The correlation statistic you suggest seems perfectly appropriate for testing the relationship between September performance and playoff performance.

  2. For testing WS wins, use a two sample t-test with unpaired data (because there are more non-WS winners than WS-winners).

Cool, thanks. So, just a question then about what i can say.

If someone says to me “September momentum is important for teams going into the postseason,” would it be reasonable for me to answer, based on a Pearson correlation of 0.018 over the last 30 seasons, that “There is really no significant correlation between a team’s September performance and its postseason performance”?

Is that too much, or is it a reasonable conclusion to draw from this particular calculation, without further efforts at regression analysis etc.?

OK, i’ll give it a go and see what correlation i get between September performance and World Series victories.

Good advice. The computers on campus have all sorts of mathematical and statistical programs installed on them. When i get some time, i’ll check them out.

Thanks, i’ll check it out.

Excel actually runs quite a few statistical equations natively. In fact, after i spent 20 minutes last night plugging in the appropriate equations to do a Pearson correlation on my data, i then discovered that Excel has a Pearson calculator already built in. :rolleyes:

Thanks for the confirmation, and for the advice re. testing WS wins.

At first glance, that Two-Sample T-Test looks pretty complicated, but i’ll spend some time trying to work it out. I’ve found that, once you know what all the variables and constants stand for, the equations are often not as complicated as they appear.

Be sure to post the final analysis here!

I can’t thing of a sport that is more amenable to having talking head pontification entirely rebutted on the basis of statistical analysis, and yet, the heads continue to nod.

Tris

I wonder if you’d see some relationship if you were to include all the teams who didn’t make the playoffs. Right now, you’re restricting yourself to a very small portion of all the data that’s out there, and it may be that it behaves differently from the population at large.

If I were analyzing this, I’d want the complete regular season record for every team, as well as the corresponding data for the playoffs. It’s a little tricky because you have to differentiate between winning, losing and not playing in the playoffs, but if you can get around that, it gives you a large dataset to explore. Of course, we’d probably pretty quickly get beyond what your copy of Statistics for Dummies covers…

Um, wouldn’t the percentage of wins of all teams generally have to be strongly correlated to being in the playoffs at all?

Tris

Yeah, i was thinking about that last night as i put together my data, but right now i wouldn’t even know where to start with all of that.

As a general question, though, how does one trace a relationship when some teams are not part of it? That is, if i’m looking for a correlation between September performance and playoff performance, then i can only make that correlation for teams that actually play some playoff games. What can the (non-)performance of teams that don’t make the playoffs tell me about the correlation for teams that do make the playoffs?

I don’t doubt that you’re right about all this; i just don’t know enough to understand why, or how.

Generally, that’s true. If it was simply the fact that the 8 teams (or 4 in the pre-Wild Card era) with the best record in MLB made the playoffs, then the correlation would basically be +1. But, as you know, the League and Divisional breakdown, combined with the Wild Card, means that teams with very good records can sometimes miss out on the playoffs, while teams with mediocre records (2006 Cardinals, anyone?) can make the postseason and win it all.

Still, i’m sure you’re right that the overall correlation is strong.

Well, i’m struggling with the whole Two-Sample t-test thing right now, and until i’ve made a better attempt to understand it i’ve pretty much reached the limit of my abilities.

But that doesn’t mean that other people can’t have a go. Here is the Excel file containing my raw data, and the Pearson calculations. It should all be fairly self-explanatory, and i’ve added a few comments to some of the cells for clarity. If anyone wants to play with the numbers, or point out any problems with what i’ve done, be my guest.

Yes, but this sort of correlation analysis is pretty rudimentary. With the win/loss for each game, there’s a lot more we can do.

Here’s the way I’m looking at it. We have some number of observations (call it n) on 162 binary variables (there are 162 games in a regular season, right?) for our input, and n observations on p variables for our output (where p is smaller than 162, but depends on how exactly we encode each team’s playoff record). You can apply a technique called principal component analysis to bring the number of input variables down from 162 to probably less than 10. These variables won’t account for all of the variation in the sample, but unless you’re really unlucky they’ll account for over 90%. That’s good enough to work with.

You can then do various tests to see which of those new variables are the strongest predictors of the outcome. If you’re lucky, you end up with one or two variables that explain most of the variation in the outcome. If not, there are other dimensionality reduction techniques you can use instead.

Wikipedia’s article on PCA is good if you know enough linear algebra to read it. If not, don’t worry too much about the specific details.

I get it. (Surprise, surprise!) The existence of one correlation doesn’t obviate the existence of other, perhaps even stronger correlations.

Damn statisticians.

Tris

How could data about teams which don’t make the playoffs be relevant to an analysis of whether September performance is a good predictor of playoff performance?

Absolutely.

Ponder this one: the 30 WS winners in your sample have a September winning percentage of .592 (514-354). The 142 non-WS winners have a September winning percentage of .608 (2430-1564). At this point the two-sample test becomes superfluous, unless someone wants to suggest the opposite hypothesis–that hot Septembers hurt you in the post-season (because you burn out, don’t you know).

I think you’ve demonstrated very satisfactorily that September momentum doesn’t carry over into the postseason. I also think . . .

. . . that none of this will stop Joe Morgan from yakking about momentum.

Wow, that’s so simple, and yet so illustrative of the problem with the “momentum” hypothesis. Thanks.

Well, birds gotta fly, world gotta turn, and Joe’s gotta ramble about something.

By the way, the blog that inspired this thread, in case anyone’s interested, was Fire Joe Morgan, one of my favorite baseball websites. Not pretty, but smart and sometimes very funny.

You haven’t demonstrated that the difference in these two values are statistically significant from each other. Therefore you can’t make any statements of correlation or not.

Well, forgive me if i’m missing something, but aren’t those very basic stats enough to at least falsify the claim that September form is an important predictor of World Series success?

If teams that have won the World Series have, on average, a worse September record than teams that made the playoffs but didn’t win the World Series, then at the very least we can refute the idea that September “momentum” is important for World Series success, can’t we?

In this part of the test, we’re not examining correlation. We’re testing whether two groups (WS winners and non-WS winners) have significantly different September performance.

The “null hypothesis” is that the two groups have the same September performance, and any observed difference is due to random fluctuation. The “alternative hypothesis”, beloved of Joe Morgan et al, is that WS winners demonstrate better performance in September, because their momentum carries over into the postseason.

The “burden of proof”, as the lawyers say, is on those who believe in the alternative hypothesis to provide convincing evidence in favor of it. Since we don’t expect Joe to grind the numbers himself, we’re doing it for him, acting as a devil’s advocate and seeing if the numbers provide enough evidence to reject the null hypothesis and adopt the alternative hypothesis.

In this case, however, the process grinds to a halt at the first step. WS winners have inferior September performance in our sample. There is no evidence whatsoever (at least, in this particular test) to support the alternative hypothesis. If WS winners showed a higher September percentage, we would proceed to tests of statistical significance such as the two-sample t-test. In this case, there is no reason to do so, and indeed no way to do so.

First of all, if you’ve never bought a copy of “Baseball Prospectus”, buy it. It will interest you immensely. Maybe wait till next spring, and get a copy of “BP 08” when it comes out.

wiki bp

“Pro Football Prospectus” will interest football fans who want to look at statistical breakdowns of NFL teams.

Second of all, the correlation coefficient should help measure the strength of the LINEAR relationship between September winning percentage and October winning percentage. What you’ve done sounds appropriate.

To get to the matter of how September WP affects whether a team wins the world series or not, I THINK you probably want to do something called a “Generalized Linear Model” with a probit link function. I haven’t worked with those in a while, so I’m not sure you meet all of the assumptions, but I think you do. That would give you an idea about the significance of the relationship between September winning percentage, and whether a team won the world series or not.

Generalized Linear Models is a grad level stats class all its own, however. But, that’s where you’d want to look.