Clarification at the beginning: This is not a homework assignment of any kind; it’s just me, as someone with a non-quantitative background, trying to familiarise myself with this topic out of personal interest.
I’m playing around a little in Excel to learn about linear regression. I understand the basic idea that regression aims to find the coefficients that define the best function to calculate predicted values for the dependent variable on the basis of the values of the independent variables, with “best” meaning that function that has the lowest overall error. But I wonder how this can be used if some of my variables are of a non-quantitative nature. As I understand it, it is, in principle, possible to define dummy variables whereby qualitative categories are assigned a numerical value, and that value is then treated, in the model, like any other variable.
To illustrate what I mean, here’s an example: Suppose I wanted to build a model that predicts the purchase price of a house on the basis of variables such as the number of rooms or the year the house was built. So far, this is all quantitative data, so no problem; I can put that data into an Excel spreadsheet and tell Excel to give me the coefficients for the linear regression model. Now suppose I also want to take into account a yes/no issue, e.g. whether the house has aircon (which doesn’t go without saying where I live) or not. I can now add another variable “aircon”, set it to 1 for houses that are airconditioned and 0 for those that aren’t, and expand my model accordingly, right? And the coefficient for this variable will be the average price difference between a house with airconditioning and one without, all else equal, right? This would be methodologically sound?
Now suppose I also want to take location into account. Following the same logic as above, I might say that the city I’m looking at has, say, five different neightbourhoods, at prices differ between them. I could now assign a number from 1 through 5 for each neighbourhood, add a variable “location” which is set, for each house, to the number of its neighbourhood, and expand the model accordingly? Somehow, I have a feeling that this would produce garbage. The numbers for the neightbourhoods don’t have any actual numerical meaning; being in neighbourhood 4 isn’t twice as good as being in neighbourhood 2, it’s simply that 4 and 2 happen to be the names I allocated to the neighbourhoods. Am I right?