Same problem, two different versions of it: is it fair to leave all these variables in a model when one is derived from another?
Let’s say you’re looking at income and you have an interest in the impact of being a teen mother. So you have age, gender, and teen-mother (last is a dummy variable). Since age is continous and you don’t have, by definition, any 30 year olds that are teen-mothers, does having both in there make the results uninterpretable? What if the dependent was not income, but a binary (so logistic regression)?
Other model: dependent variable is binary and the model includes 5 ordinal categories for income quintiles, but also a dummy variable for “below federal poverty threshold” (which is made up income tables based on household composition (number of adults, number of kids, presence of seniors)). Again: is including both a problem since one is derived from the other and since there’s no “meaning” (i.e., real person) for a person being in the top quintile and also below poverty?
A lot of it depends on what you’re trying to do. In the first case, if you’re interested in current teen mothers, why are you even considering people above 30 in the first place? If you’re not, and you’re looking at, say, the lifetime impact of teen motherhood, then the flag does make sense.
In the second case, I’ve seen that sort of derived flag used to build decision trees, and sometimes it can give you a better or simpler model than just working with the raw data. For other types of models, it may or may not make sense. Again, what are you trying to do?
One way to attack it is to make age discrete by putting it into categories like 5 year blocks
18-22; 23-27; 28-32; 33-37; etc.
Depending on how many teen-age moms you have at each age, the poverty level data could make a wicked awesome double-line chart comparing the percentage of teen mothers below the level to women who were not teen mothers by age.
I guess what I’m trying to do is identify a subpopulation but keep the total population, too. So not just looking at teen moths, but including it as a predictor, along with other things in a cross-sectional data set of the general population. So let’s say the dep variable is “probablility you are below the poverty line.” The older the less likely and teen mother’s are more likely. So keep “teen mother” as a dummy and age as continuous? (This is over simplified, but the question is not the model, but rather there’s a multi-collinearity problem with keeping age and an age derived predictor both in the same model.)
There is a multicollinearity problem, but multicollinearity is usually not insurmountable. Speaking in general terms, it makes it less likely that your result will be statisitcally significant, but won’t bias results. It increases standard errors. My understanding is that you would be better off using current age and age at first childbirth, though. Turning a continuous variable into a dependent variable is usually a bad idea. You lose all the variability about the differences of a 14, 18, 21, 27, and 39 year old first childbirth.
I’m still a little confused. Are you only interested in whether current teen mothers are more likely to be below the poverty line than the general population, or whether anyone who was ever a teen mother is more likely to be below the poverty line than the general population?
Given that you’re concerned about multicollinearity, I’d guess the first case, but in that case shouldn’t you be considering whether teen mothers are more likely than other teenagers to be below the poverty line? I’d expect teens in general to be much more likely to be below the poverty line than adults who’ve been out working, unless you’re considering their parents’ income.
Also, the multicollinearity of a data set is mostly an issue for linear models. If you use some other techinque, like decision trees, then you can worry about it a lot less. The information gain or Gini coefficient both give you a way to assess the impact of a predictor on the response without reference to a particular model, so they might be superior here.
Thanks. I think the coding would look like this:
y =
Age
Mom (1=currently a mom with child at home who is <18 y.o.)
Teen mom (1=Mom and you are under 20)
Single mom (1= etc.)
Single Teen mom (=Teen mom*Single Mom)
etc. (obvious other predictors are hours working, education, etc.)
So, it’s not so much “how old were you when your kid was born” which gets at past teen moms, but “are you a teen and a mom now” that I’m interested in.
Make sense?
(Ultrafilter: I’m one of those people that runs and hides from data mining. I need to learn it someday, but for now I have the old-fashioned “what the heck is that” attitude. Sorry.)
I take it you are looking at some sample of women who are variously moms/not moms, married/ not married, teens/not teens? It almost looks more like you are talking about a 3-way interaction. Because teen and mom are two variables, and then you are adding in single.
Mom * Age < 20 * Unmarried
I still think you are not doing your model a favor by dichotomizing age. You could do what’s called a spline, where you have two different linear relationships, one up until 19 and one after 20.