I have only the shallowest and most tenuous grasp of mathematics and statistics, so I plead with anyone responding to explain things to me as if I were an idiot.
So, I’m taking a look at concepts such as regression analysis. One of the things that is mentioned is that regression assumes that the correlation being studied takes a linear form.
What happens if the correlation isn’t linear? There is a suggestion that one may transform the data. (By the way, the vast majority of that Wiki article goes over my head.) This means that you can apply a function – commonly a logarithm – in order to see if your scatterplot takes a more linear form.
(1) Why is this valid? Aren’t you just changing your data?
(2) If you transform one data set, do you need to transform all other data sets? For example, if I’m transforming using log, does the log get applied to both the independent and dependent variable?
(3) If I’m looking at sets of data in separate years, if I transform the data in one year, does that mean I have to transform the data in all the other years too using the same transformation function?
Yes, you are changing the data temporarily in order to find a pattern which is useful for drawing conclusions. As long as, when you present the data, the transformation is explained, and therefore people using the data can interpret it correctly, there’s no problem.
It’s not necessary, but often useful, to use the same transformation on both variables. Same thing with different data sets (like different years); some years might work better with a log transformation, some with (for example) a square root, and that’s fine if you’re looking at each individually. Comparing years that way, though, is difficult and prone to misinterpretation.
In other words, if you’re interested only in examining what’s happening within a year, then you don’t necessarily need to apply the same transformation to each year – or you could use even different transformation.
However, if you want to compare year-to-year, you need to apply the same transformation.
Right?
I still don’t get why transformation would give you valid results.
You don’t need all of the actual values to find a correlation, just their order. For instance, if the biggest value of X corresponds to the biggest value of Y, and the second-biggest value of X corresponds to the second-biggest value of Y, and so on, then you have a very strong correlation, regardless of what those values actually are. So any transformation which preserves the order of all the points won’t change the correlation.
It’s hard to give a short answer to this question, but the basic idea is that the actual relationship between two variables can be nonlinear, but there might be a way to transform one or both of the variables so that you do have a linear relationship.
The classic example is a power law, where you have an equation of the form y = ax^b. If you take logarithms of both x and y, you find that log(y) = blog(x) + log(a). That is a linear relationship, and so you can fit a regression model to those variables that corresponds to estimating the power law relationship between the original variables. Does that make sense?
Here’s a counterexample in R:
n <- 100
r <- cor(1:n, (1:n)^2) / cor(1:n, exp(1:n))
After you run this, r will be right around 3.8, which couldn’t happen if what you said were true. Correlation measures the strength of the linear relationship between two variables; it may or may not capture whatever nonlinear relationships are present.
If the idea of changing your data is still worrisome, consider if you have some data in feet and convert the data to be in yards. You’ve just transformed the data, but you’ve preserved the correlations (though you might have just changed in by a factor of three).
I’m very hesitant to say it’s OK to treat different years of the same data with different transformations. Unless you have some very good external reason (in other words, not based just on the data), preferably specified in advance before you’ve even collected data, to think different years behave very differently, then you really need to treat all years the same.
Now, it’s perfectly fine to do different transformations on different dependent variables (say, age and weight), or to do different transformations on the dependent and independent variable, but unless there’s a really good physical reason, you need to treat all sets of the same variable the same way.
Transformation can be very useful. Most statistical models and tests make certain assumptions about the data. In real life these assumptions tend to be strictly speaking false, but as long as they are close to true you should be ok. Often times a transformed data set matches the assumptions more closely than a non transformed set. For example in your regression problem if there really is a log relationship between the variables, and the regression model assumes a linear relationship, then you are better off transforming your data.
That said, if you try 30 different transformations and only report the best one you are cheating and should really use an alternative test that takes the multiple possible transformations into account.
If you want to do away with transformations altogether, I suggest you look into non-parametric statistics which make no assumptions about the data distribution and the form of the relationship.
Missed Edit Window:
Transformation can be very useful. Most statistical models and tests make certain assumptions about the data. In real life these assumptions tend to be strictly speaking false, but as long as they are close to true you should be ok. Often times a transformed data set matches the assumptions more closely than a non transformed set. For example in your regression problem if there really is a log relationship between the variables, and the regression model assumes a linear relationship, then you are better off transforming your data. You do not need to transform all components, although if they are two measures of the same thing it would be unusual not to.
That said, if you try 30 different transformations and only report the best one you are cheating and should really use an alternative test that takes the multiple possible transformations into account. Using different transformation on subsets of the same data, is highly unusual and unless there is an extraordinary reason why you feel the subsets would be totally different in flavor I would avoid doing so. A better way to proceed would be to use a regression model which included muliple components some of which are linear and others are various different transformations of the data, then fit this same model to all subsets of the data, but that may require more statistics than you are capable of.
If you want to do away with transformations altogether, I suggest you look into non-parametric statistics which make no assumptions about the data distribution or the form of the relationship, but can be used to infer general monotonic trends. (See for example Spearman rank correlation).