Data transformation in statistics: Why is it valid?

Acsenray · November 30, 2010, 5:20pm

I have only the shallowest and most tenuous grasp of mathematics and statistics, so I plead with anyone responding to explain things to me as if I were an idiot.

So, I’m taking a look at concepts such as regression analysis. One of the things that is mentioned is that regression assumes that the correlation being studied takes a linear form.

What happens if the correlation isn’t linear? There is a suggestion that one may transform the data. (By the way, the vast majority of that Wiki article goes over my head.) This means that you can apply a function – commonly a logarithm – in order to see if your scatterplot takes a more linear form.

(1) Why is this valid? Aren’t you just changing your data?

(2) If you transform one data set, do you need to transform all other data sets? For example, if I’m transforming using log, does the log get applied to both the independent and dependent variable?

(3) If I’m looking at sets of data in separate years, if I transform the data in one year, does that mean I have to transform the data in all the other years too using the same transformation function?

GoodOmens · November 30, 2010, 5:26pm

Yes, you are changing the data temporarily in order to find a pattern which is useful for drawing conclusions. As long as, when you present the data, the transformation is explained, and therefore people using the data can interpret it correctly, there’s no problem.

It’s not necessary, but often useful, to use the same transformation on both variables. Same thing with different data sets (like different years); some years might work better with a log transformation, some with (for example) a square root, and that’s fine if you’re looking at each individually. Comparing years that way, though, is difficult and prone to misinterpretation.

Acsenray · November 30, 2010, 5:32pm

In other words, if you’re interested only in examining what’s happening within a year, then you don’t necessarily need to apply the same transformation to each year – or you could use even different transformation.

However, if you want to compare year-to-year, you need to apply the same transformation.

Right?

I still don’t get why transformation would give you valid results.

Chronos · November 30, 2010, 5:43pm

You don’t need all of the actual values to find a correlation, just their order. For instance, if the biggest value of X corresponds to the biggest value of Y, and the second-biggest value of X corresponds to the second-biggest value of Y, and so on, then you have a very strong correlation, regardless of what those values actually are. So any transformation which preserves the order of all the points won’t change the correlation.

ultrafilter · November 30, 2010, 7:06pm

It’s hard to give a short answer to this question, but the basic idea is that the actual relationship between two variables can be nonlinear, but there might be a way to transform one or both of the variables so that you do have a linear relationship.

The classic example is a power law, where you have an equation of the form y = ax^b. If you take logarithms of both x and y, you find that log(y) = blog(x) + log(a). That is a linear relationship, and so you can fit a regression model to those variables that corresponds to estimating the power law relationship between the original variables. Does that make sense?

Here’s a counterexample in R:


n <- 100
r <- cor(1:n, (1:n)^2) / cor(1:n, exp(1:n))

After you run this, r will be right around 3.8, which couldn’t happen if what you said were true. Correlation measures the strength of the linear relationship between two variables; it may or may not capture whatever nonlinear relationships are present.

Quercus · November 30, 2010, 8:06pm

If the idea of changing your data is still worrisome, consider if you have some data in feet and convert the data to be in yards. You’ve just transformed the data, but you’ve preserved the correlations (though you might have just changed in by a factor of three).

I’m very hesitant to say it’s OK to treat different years of the same data with different transformations. Unless you have some very good external reason (in other words, not based just on the data), preferably specified in advance before you’ve even collected data, to think different years behave very differently, then you really need to treat all years the same.

Now, it’s perfectly fine to do different transformations on different dependent variables (say, age and weight), or to do different transformations on the dependent and independent variable, but unless there’s a really good physical reason, you need to treat all sets of the same variable the same way.

Buck_Godot · November 30, 2010, 9:19pm

Transformation can be very useful. Most statistical models and tests make certain assumptions about the data. In real life these assumptions tend to be strictly speaking false, but as long as they are close to true you should be ok. Often times a transformed data set matches the assumptions more closely than a non transformed set. For example in your regression problem if there really is a log relationship between the variables, and the regression model assumes a linear relationship, then you are better off transforming your data.

That said, if you try 30 different transformations and only report the best one you are cheating and should really use an alternative test that takes the multiple possible transformations into account.

If you want to do away with transformations altogether, I suggest you look into non-parametric statistics which make no assumptions about the data distribution and the form of the relationship.

Buck_Godot · November 30, 2010, 9:27pm

Missed Edit Window:
Transformation can be very useful. Most statistical models and tests make certain assumptions about the data. In real life these assumptions tend to be strictly speaking false, but as long as they are close to true you should be ok. Often times a transformed data set matches the assumptions more closely than a non transformed set. For example in your regression problem if there really is a log relationship between the variables, and the regression model assumes a linear relationship, then you are better off transforming your data. You do not need to transform all components, although if they are two measures of the same thing it would be unusual not to.

That said, if you try 30 different transformations and only report the best one you are cheating and should really use an alternative test that takes the multiple possible transformations into account. Using different transformation on subsets of the same data, is highly unusual and unless there is an extraordinary reason why you feel the subsets would be totally different in flavor I would avoid doing so. A better way to proceed would be to use a regression model which included muliple components some of which are linear and others are various different transformations of the data, then fit this same model to all subsets of the data, but that may require more statistics than you are capable of.

If you want to do away with transformations altogether, I suggest you look into non-parametric statistics which make no assumptions about the data distribution or the form of the relationship, but can be used to infer general monotonic trends. (See for example Spearman rank correlation).

Topic		Replies	Views
Calling all social scientists: multiple regression question Factual Questions	6	685	May 15, 2002
Statistics (Functions) Factual Questions	18	1252	October 25, 2000
Correlation coefficient in polynomial regression Factual Questions	12	8781	February 20, 2004
Linear regression/correlation terminology? Factual Questions	9	2107	September 11, 2012
Linear Transform Possible? Factual Questions	3	643	April 14, 2016

Data transformation in statistics: Why is it valid?

Related topics