# Calling all social scientists: multiple regression question

I need to know, is there any way to do regression analysis on non-normal data? I’m analyzing data for the agency I work for. I want to determine the factors which predict our yearly spending per client. However, spending per client is rather skewed to the right – we have several outliers who get substantially more than other clients.

IIRC, one shouldn’t use regression in this case…at least that’s the stats 101 answer. However there are probably more advanced methods for dealing with this problem…anyone? I’m on the verge of regression aggression here.

Here’s a first cut:

1. Get a copy of Peter Kennedy’s A Guide to Econometrics. It’s very intelligible and has several pages on outliers which may or may not be helpful.

See especially the Chapter 19: Robust Estimation and 19.2 Outliers and Influential Observations.

(Believe me, if you do any econometric work at all, you want Kennedy. It’s not expensive, BTW).

1. I’m not sure, but your data could be bounded from below. If so, this may be a problem of truncation. (Another useful catchword to use when flipping through the references.)

2. If there is an measurable effect that causes those observations to be outliers, your problem may be over, since the errors could be normally distributed even if the dependent variable is not.

Also see Chapter 7, “nonzero expected disturbance”. And note that OLS is unbiased if the expected value of the errors is zero (even if they are not normally distributed).

It is for inference purposes that the assumption of normality is made.

Flowbark is right on.

Another option would be to transform the data to make it more normal. Becareful, though, this can often destroy interpretability. Also, it may be impossible to “de-transform” the data.

To do a quick “mess-timation”, you could take the log of the independent data (money). This is the usual course taken when working with salary data. The distribution of salaries, however, might not match the distribution you have with your spending data.

Thanks for the feedback. I bought the Kennedy book yesterday but haven’t really dug into it yet. I think flowbark’s 3rd point may be applicable to my data.

About the log transformations, I recall this vaguely. I can’t remember, if I transform the DV, do I also have to transform the IVs (even dummy IVs?)?

Arg…ignore the dummy IVs comment…must be losing my mind.

In general, you should have logs on both sides of your equation (except for the dummies: you don’t want to take a log of zero).

OTOH:
For:
ln Y = a +bX + error

That’s called the semi-log functional form. (See, it’s not cheating if you can use jargon!)

A haphazard mixture of logs and not-logs among the independent variables (except for the dummies) will look sort of strange though. Not recommended without good reason.