Calling all social scientists: multiple regression question

Ace_Face · May 9, 2002, 12:47am

I need to know, is there any way to do regression analysis on non-normal data? I’m analyzing data for the agency I work for. I want to determine the factors which predict our yearly spending per client. However, spending per client is rather skewed to the right – we have several outliers who get substantially more than other clients.

IIRC, one shouldn’t use regression in this case…at least that’s the stats 101 answer. However there are probably more advanced methods for dealing with this problem…anyone? I’m on the verge of regression aggression here.

Measure_for_Measure · May 9, 2002, 7:08am

Here’s a first cut:

Get a copy of Peter Kennedy’s A Guide to Econometrics. It’s very intelligible and has several pages on outliers which may or may not be helpful.

See especially the Chapter 19: Robust Estimation and 19.2 Outliers and Influential Observations.

(Believe me, if you do any econometric work at all, you want Kennedy. It’s not expensive, BTW).

I’m not sure, but your data could be bounded from below. If so, this may be a problem of truncation. (Another useful catchword to use when flipping through the references.)
If there is an measurable effect that causes those observations to be outliers, your problem may be over, since the errors could be normally distributed even if the dependent variable is not.

Measure_for_Measure · May 9, 2002, 7:12am

Also see Chapter 7, “nonzero expected disturbance”. And note that OLS is unbiased if the expected value of the errors is zero (even if they are not normally distributed).

It is for inference purposes that the assumption of normality is made.

Spritle · May 9, 2002, 1:10pm

Flowbark is right on.

Another option would be to transform the data to make it more normal. Becareful, though, this can often destroy interpretability. Also, it may be impossible to “de-transform” the data.

To do a quick “mess-timation”, you could take the log of the independent data (money). This is the usual course taken when working with salary data. The distribution of salaries, however, might not match the distribution you have with your spending data.

Ace_Face · May 13, 2002, 10:44pm

Thanks for the feedback. I bought the Kennedy book yesterday but haven’t really dug into it yet. I think flowbark’s 3rd point may be applicable to my data.

About the log transformations, I recall this vaguely. I can’t remember, if I transform the DV, do I also have to transform the IVs (even dummy IVs?)?

Ace_Face · May 13, 2002, 10:50pm

Arg…ignore the dummy IVs comment…must be losing my mind.

Measure_for_Measure · May 15, 2002, 7:12am

In general, you should have logs on both sides of your equation (except for the dummies: you don’t want to take a log of zero).

OTOH:
For:
ln Y = a +bX + error

That’s called the semi-log functional form. (See, it’s not cheating if you can use jargon!)

A haphazard mixture of logs and not-logs among the independent variables (except for the dummies) will look sort of strange though. Not recommended without good reason.

Topic		Replies	Views
Simple Statistics Question - Multiple Regression Factual Questions	5	857	December 5, 2006
Statisticians: would this look reasonably smart, stupid, or simply nonsensical? In My Humble Opinion	2	1100	January 2, 2012
Extrapolating Changes in Distributions - how? Factual Questions	2	569	June 11, 2008
Statisticians (or anyone who knows about statistical analysis): HELP! Factual Questions	10	1616	May 25, 2013
Data transformation in statistics: Why is it valid? Factual Questions	7	8501	November 30, 2010

Calling all social scientists: multiple regression question

Related topics