I’m currently in the process of trying to establish the degree of correlation between two variables L and S, and I’ve been using the Pearson R coefficient. This has worked perfectly fine for L and T and L and N, (where T and N are other variables, which are different to S), and I’ve been getting sensible values.
However, when I try to establish the degree of correlation between L and S, something wierd happens, and I get a coefficient of 3.6! I’ve triple checked that I’ve done the calculation properly (I have done it properly), I’ve checked the formula I have for calculating the coefficient on data that is perfectly correlated (it works in that case), and I’ve done the calculation by hand to double check the spreadsheet (which matches what the spreadsheet says). I’ve also taken a look at the scatter plot of S versus L, and this is where this gets particularly interesting.
It turns out that there’s one datapoint that has a particularly high Sand a particularly large L, i.e., its a total outlier. So, I’ve recalculated the coefficient without the outlier, and I end up with a perfectly reasonable correlation coefficient of 0.77. Now, could the outlier cause the meaningless coefficient I obtained initially, or is there something more afoot?
Thanks for any help anyone can shed on this, my advisor’s as confused as I am.
The Pearson’s correlation is always between -1 and 1, so if you’re getting anything outside that range, either (1) you performed the calculation incorrectly, or (2) whatever you used to do the calculation has a bug. You’ve apparently eliminated (1), so what tool are you using to do the calculation? It could be a problem with the way the tool is handling the outlier value. Note that even outliers cannot throw the correlation outside the valid range, if the formula has been properly coded and potential overflow situations are handled correctly. It might help if you post the values for L and S and the code that’s doing the calculation.
I’m actually doing the calculation myself “by hand” in the spreadsheet. I get the spreadsheet to calculate Sum(L), Sum(S), Sum(LS), Sum(L[sup]2[/sup]) and Sum(S[sup]2[/sup]). Then I use the spreadsheet to calculate the coefficient. I realise now that this is probably a baaad idea, and I should just check to see if the spreadsheet will do it automatically.
I have just checked in fact, and the spreadsheet can actually calculate the Pearson R coefficients for me. Using the inbuilt tool (I’m using OpenOffice), it seems to give me a number that looks more correct. I think I may have been mis-coding or something.