(Probably a beginner's) question on dummy variables in linear regression

Schnitte · June 21, 2021, 11:37am

Clarification at the beginning: This is not a homework assignment of any kind; it’s just me, as someone with a non-quantitative background, trying to familiarise myself with this topic out of personal interest.

I’m playing around a little in Excel to learn about linear regression. I understand the basic idea that regression aims to find the coefficients that define the best function to calculate predicted values for the dependent variable on the basis of the values of the independent variables, with “best” meaning that function that has the lowest overall error. But I wonder how this can be used if some of my variables are of a non-quantitative nature. As I understand it, it is, in principle, possible to define dummy variables whereby qualitative categories are assigned a numerical value, and that value is then treated, in the model, like any other variable.

To illustrate what I mean, here’s an example: Suppose I wanted to build a model that predicts the purchase price of a house on the basis of variables such as the number of rooms or the year the house was built. So far, this is all quantitative data, so no problem; I can put that data into an Excel spreadsheet and tell Excel to give me the coefficients for the linear regression model. Now suppose I also want to take into account a yes/no issue, e.g. whether the house has aircon (which doesn’t go without saying where I live) or not. I can now add another variable “aircon”, set it to 1 for houses that are airconditioned and 0 for those that aren’t, and expand my model accordingly, right? And the coefficient for this variable will be the average price difference between a house with airconditioning and one without, all else equal, right? This would be methodologically sound?

Now suppose I also want to take location into account. Following the same logic as above, I might say that the city I’m looking at has, say, five different neightbourhoods, at prices differ between them. I could now assign a number from 1 through 5 for each neighbourhood, add a variable “location” which is set, for each house, to the number of its neighbourhood, and expand the model accordingly? Somehow, I have a feeling that this would produce garbage. The numbers for the neightbourhoods don’t have any actual numerical meaning; being in neighbourhood 4 isn’t twice as good as being in neighbourhood 2, it’s simply that 4 and 2 happen to be the names I allocated to the neighbourhoods. Am I right?

Hellestal · June 21, 2021, 11:48am

Yes

But you could have four dummy variables for five locations. Putting in a 0 for all four variables means the fifth location. The other four are defined with a 1, meaning NOT the fifth location.

You need to be careful about colinearity when you’re doing this. If you put in 5 variables for 5 locations, you can get the result that the fifth location is a linear combination of the other four. This creates an error that is almost identical to dividing by zero. Sophisticated stat software will let you know about this, I would guess Excel would also.

Topic		Replies	Views
Casio Calculators & Linear Regression Factual Questions	2	1648	March 11, 2003
MS Excel Q: Errors in Linear Regression? Factual Questions	0	677	February 4, 2008
econometrics (modeling question) Factual Questions	12	923	September 4, 2008
Statisticians: would this look reasonably smart, stupid, or simply nonsensical? In My Humble Opinion	2	1100	January 2, 2012
Statistics (Functions) Factual Questions	18	1251	October 25, 2000

(Probably a beginner's) question on dummy variables in linear regression

Related topics