econometrics (modeling question)

drhess · August 23, 2008, 3:33pm

Same problem, two different versions of it: is it fair to leave all these variables in a model when one is derived from another?

Let’s say you’re looking at income and you have an interest in the impact of being a teen mother. So you have age, gender, and teen-mother (last is a dummy variable). Since age is continous and you don’t have, by definition, any 30 year olds that are teen-mothers, does having both in there make the results uninterpretable? What if the dependent was not income, but a binary (so logistic regression)?
Other model: dependent variable is binary and the model includes 5 ordinal categories for income quintiles, but also a dummy variable for “below federal poverty threshold” (which is made up income tables based on household composition (number of adults, number of kids, presence of seniors)). Again: is including both a problem since one is derived from the other and since there’s no “meaning” (i.e., real person) for a person being in the top quintile and also below poverty?

Thanks.

LurkMeister · August 23, 2008, 4:32pm

A link to the relevant Staff Report would be appreciated. Unless this was intended to be posted somewhere else.

drhess · August 23, 2008, 4:53pm

wrong forum. meant for general forum. sorry.

bibliophage · August 23, 2008, 8:03pm

Off to GQ

ultrafilter · August 23, 2008, 8:16pm

A lot of it depends on what you’re trying to do. In the first case, if you’re interested in current teen mothers, why are you even considering people above 30 in the first place? If you’re not, and you’re looking at, say, the lifetime impact of teen motherhood, then the flag does make sense.

In the second case, I’ve seen that sort of derived flag used to build decision trees, and sometimes it can give you a better or simpler model than just working with the raw data. For other types of models, it may or may not make sense. Again, what are you trying to do?

Saint_Cad · August 23, 2008, 8:25pm

I agree, what are you trying to do?

One way to attack it is to make age discrete by putting it into categories like 5 year blocks
18-22; 23-27; 28-32; 33-37; etc.
Depending on how many teen-age moms you have at each age, the poverty level data could make a wicked awesome double-line chart comparing the percentage of teen mothers below the level to women who were not teen mothers by age.

drhess · August 23, 2008, 8:26pm

I guess what I’m trying to do is identify a subpopulation but keep the total population, too. So not just looking at teen moths, but including it as a predictor, along with other things in a cross-sectional data set of the general population. So let’s say the dep variable is “probablility you are below the poverty line.” The older the less likely and teen mother’s are more likely. So keep “teen mother” as a dummy and age as continuous? (This is over simplified, but the question is not the model, but rather there’s a multi-collinearity problem with keeping age and an age derived predictor both in the same model.)

Harriet_the_Spry · August 23, 2008, 8:35pm

There is a multicollinearity problem, but multicollinearity is usually not insurmountable. Speaking in general terms, it makes it less likely that your result will be statisitcally significant, but won’t bias results. It increases standard errors. My understanding is that you would be better off using current age and age at first childbirth, though. Turning a continuous variable into a dependent variable is usually a bad idea. You lose all the variability about the differences of a 14, 18, 21, 27, and 39 year old first childbirth.

ultrafilter · August 23, 2008, 8:42pm

I’m still a little confused. Are you only interested in whether current teen mothers are more likely to be below the poverty line than the general population, or whether anyone who was ever a teen mother is more likely to be below the poverty line than the general population?

Given that you’re concerned about multicollinearity, I’d guess the first case, but in that case shouldn’t you be considering whether teen mothers are more likely than other teenagers to be below the poverty line? I’d expect teens in general to be much more likely to be below the poverty line than adults who’ve been out working, unless you’re considering their parents’ income.

ultrafilter · August 23, 2008, 8:46pm

Also, the multicollinearity of a data set is mostly an issue for linear models. If you use some other techinque, like decision trees, then you can worry about it a lot less. The information gain or Gini coefficient both give you a way to assess the impact of a predictor on the response without reference to a particular model, so they might be superior here.

drhess · August 23, 2008, 9:07pm

Thanks. I think the coding would look like this:
y =
Age
Mom (1=currently a mom with child at home who is <18 y.o.)
Teen mom (1=Mom and you are under 20)
Single mom (1= etc.)
Single Teen mom (=Teen mom*Single Mom)
etc. (obvious other predictors are hours working, education, etc.)

So, it’s not so much “how old were you when your kid was born” which gets at past teen moms, but “are you a teen and a mom now” that I’m interested in.

Make sense?

(Ultrafilter: I’m one of those people that runs and hides from data mining. I need to learn it someday, but for now I have the old-fashioned “what the heck is that” attitude. Sorry.)

Harriet_the_Spry · August 24, 2008, 1:03am

I take it you are looking at some sample of women who are variously moms/not moms, married/ not married, teens/not teens? It almost looks more like you are talking about a 3-way interaction. Because teen and mom are two variables, and then you are adding in single.

Mom * Age < 20 * Unmarried

I still think you are not doing your model a favor by dichotomizing age. You could do what’s called a spline, where you have two different linear relationships, one up until 19 and one after 20.

drhess · September 4, 2008, 4:59pm

Thanks. I think my brain has “splined” itself just thinking about this. You are right about the sample and the number of interaction effects.

Topic		Replies	Views
Oh God, SPSS. In My Humble Opinion	22	3682	May 24, 2012
Statisical Confusion Factual Questions	4	844	January 15, 2008
(Probably a beginner's) question on dummy variables in linear regression Factual Questions	1	202	June 21, 2021
endogeneity problem vs. mediating variable Great Debates	6	837	August 25, 2009
Statisticians (or anyone who knows about statistical analysis): HELP! Factual Questions	10	1616	May 25, 2013

econometrics (modeling question)

Related topics