Statistics question--blatant request for homework help

I am running a multiple regression analysis of a dataset, and for the overall model I am getting an extremely low p-value (Pr>F). This seems to indicate strongly that the regression model is significant for the dataset. However, the R-square is also very low at less than one tenth of one percent. What would account for this? Is it because the data follows a nonlinear pattern, or what?

Thanks for your help,

Zarathustra

Gosh, where to begin.

I love MLR. It’s a great starting point. However, one must remember a few things.
1)MLR tries to account for variance in Y[sub]i[/sub] observations.
2)Lots of “non-explanatory” variables can covary with the dependent variable.

You are finding a small p value, indicating that your model is significantly better than none. Not too difficult. The things to consider here are products of your model.
a) the number of variables in the model - usually, heaping gobs of independent variables into a model tend to help explain the variance (some signiicantly and some not) - see #2 above.
b) as you include more variables in a model, the variables are less likely to be completely orthogonal. Multicollinearity is not your friend in MLR.

The low R[sup]2[/sup] means that not a lot of variance was accounted for with your model. Sure, your model is better than none, but it still doesn’t account for much.

Diagnosing your problem here is kinda tough without knowing the model you are regressing.

Usually, when doing MLR, a theoretical model is built and tested. Then, an alternative model is tested and one looks for the increase in R[sup]2[/sup]. This enable one to compare models (and the addition of one variable). Some stat packages (SPSS) give conditional or partial correlation coefficients as a general guide (along with a predicted increase in T[sup]2[/sup], but I find that these are only a good guide as to what variables you might wish to add to the next model. It invites ad hoc additions, almost to the level of “trolling for trends”.

What does it all mean? Well, you may have a problem with multicollinearity with your predictors, or you may have an incredible amount of variance in your observations, or you may be using sucky variables in the model (e.g., using average annual rain fall and mean day temperature to predict postal rates).

Does this help?

If it did or not, I got to use some of the coolest words ever! I don’t think I’ve used “multicollinearity” since grad school.

Oops! I forgot to mention sample size. If you have an enormous sample size, you will find significance in the weirdest places. It’s easier to see with ANOVA (if you are taking a regression course, I assume you’ve dealt with ANOVA and ANCOVA at least a little bit.) Increasing sample size increases power. Recall the effect size? MLR is part of the “General Linear Model” stuff that can be considered to encompass other variance explanation models (like some uses of ANOVA). Large sample groups (like those found in the GSS data sets so often used by students) will often show significant betas even when they are not very explanatory.

Spritle, work Heteroskedasticity in there, and I’ll be swooning.

I had something to say about low R-squared, but it’s probably not all that revelant. Carry on.

must… hold… back… can’t… keep…

MAHALANOBIS DISTANCE!!!

[SUB]It’s fun to say words like “skew”.[/sub]

[sub]Little bump for Zara.[/sub]

No need to bump for my sake–I’ve already turned in the project. If you’re interested, though, feel free to keep the thread alive. Now that I have time, I’m going to try to just answer the question the old fashioned way (looking blearily at last semester’s stats textbook).

Thank god somebody brought up multivariate regression analysis! I’ve been working with it non-stop for the past several months despite an overwhelming ignorance of statistics.

(I should note that the book “The Cartoon Guide to Statistics” has really helped.)

I’m modeling some geochemical gobledy-gook that would only bore you to tears if I tried to explain it, if not render you comatose, but my models are turning out pretty good, at least in terms of r[sup]2[/sup], which are all at least 0.85 (I don’t even bother presenting models with a lower value). Then I learned about p-values, and how one needs to keep 'em low (<0.05 if using a 95% C.I., for instance). So, my first question for Spritle, if you’re still watching, is:

“Get a low p-value”… does that mean just for the whole model, or for each independent variable? I use the Analyse-It add-on for Excel (http://www.analyse-it.com) which gives me p’s both for each variable and for the entire model. Sometimes, I get a model p <0.0001, but one of my variables might have a p around 0.3000. I should note that n > 11 for all, usually around 18, and that I try to keep it down to about 2-4 variables in the model (just things that obviously covary).

Question second concerns the F-test, which my stats program also provides. I’m told that “large F is good” but nowhere (not even in my Cartoon Guide to Statistics, which did a good job of explaining p to me) can I find just what F should be… how large is large? What’s large enough?

And I’ve had no stats… just a year of calculus and one semester of linear algebra.

Penn and Teller-ite, :slight_smile:

I’m gonna print this out and spend time tonight setting up a reasonable explanation for you.

BTW, I own and like the Cartoon Guide to Statistics. It’s not quite as “self explanatory” as CG to Physics, but it’s nice. I like the fact that it looks at probability in terms of gambling.

Also, don’t be too surprised about “some geochemical gobledy-gook that would only bore you to tears if I tried to explain it, if not render you comatose”; my first undergraduate degree is in Chemistry!

Zara hope you got my response in time and that it helped.

bump…

just to make it easier for Spritle to find this post today.

Sorry it took so long; here goes:

Your questions seem to center around the R[sup]2[/sup], p-values and F values, and what they show about “correctness” of a model. I’ll try to tackle each in turn.

R[sup]2[/sup] - This is the proportion of Variance in the outcome variable that is explained by the regression equation model. It’s good if it’s a big number (close to 1). As you alluded in your post, having a reasonable model that only accounts for a teeny bit of the variance is not that super.

p-value and F – This is where things get interesting. Statistics is necessary because we are unable to measure everyone in a specific population and must only sample that population. Now, we have to wonder if the results from our sample are truly representative of that population. This is the crux of Significance Testing and Confidence intervals.

<skip part about sum of squared residuals and F ratio calculation.>

In regression, we draw an hypothetical line to “predict” the outcome variable from a host of predictors. The line does not fit perfectly, so we have to measure how well it fits. The line is drawn to have the best fit. The fit is best when the “sum of squares” is as small as possible. (let’s leave it at that for now)

Suppose we took many many many samples from the population and ran a regression equation for each. Now lets suppose that we took the measure of best fit to calculate a “F ratio”. If we calculate F ratios (from those sums of squares) and plot them for each sample, we get a distribution of those values that looks like (if coding works) this:


c |    *
  |   * *
o |  *   *
  |  *    *
u |  *      *
  | *         *
n | *            *
  |*                  *
t |*                        *
  |*_______________________________*____
  0                                    ~50
           F-ratio

It’s called the “F-distribution”, since statisticians are so darned creative.

Since we take 1 sample (or very few) and find one set of values, we are finding one point in the distribution. The question now becomes:

“Is this value an expected value or is it a rarity?”

The important part is if the value is a rare event. The question can be seen as:

“Is my value so far that it is from a different distribution or is it just extreme?”

We want to see if our value is representative of the population or due to chance. We often set the “chance value” (alpha) to .05 (for the 95% Confidence Interval – 95% of the time our value will be within this interval). This means that our value should be within 95% or the distribution. The “critical” F-ratio value cuts the distribution into 95% and 5%. It is shown below:


c |    *
  |   * *
o |  *   *
  |  *    *
u |  *      *            |
  | *         *          |
n | *            *       |
  |*                  *  |
t |*                     |  *
  |*_____________________|__________*____
  0                      |             ~50
           F-ratio       F[sub]crit[/sub]

The actual value of F[sub]crit[/sub] depends on the degrees of freedom (df) in the calculation of the ratio (I won’t explain df here, unless you want me to).

If we find a value of F for our sample that is greater than F[sub]crit[/sub], our F value lies outside of the distribution. (There is less than a 5% chance that our sample is representative of the population). Because we are testing a null hypothesis, which we want to reject, we want to find an F value larger than the critical value.

The answer to your second question about “how large is large? What’s large enough?” with respect to the F value depends on the degrees of freedom. As a general rule, 10 or better is encouraging.

Since the F[sub]crit[/sub] changes with each set of df and each distribution, it’s kind of tough to check each time. Most software packages do it for you and express the result as a p-value.

p-value – the proportion of the distribution to the extreme of your test value. Since you have alpha = .05, we want our F ratio to have less than 5% of the distribution to the right ( < 5% “error”). This is why alpha values less than .05 are good. If alpha is less than .05, less than 5% of the distribution is extreme to your value, i.e.; your value is within your 95% CI.

p-values are good for testing models. The p-values for individual predictors show the “worth” of that predictor in the model; you can consider removing a predictor if the p-value > .05 (if theory permits).

If you need more, let me know.

Okay, it took a couple of readings but let me try to summarize and if you spot any misunderstandings, let me know!

  1. The bottom line is the p-value; if you’re trying to show the worthiness of your model, it’s all you need. Calculating an F is just a step to get to that value.

  2. If the p-value for an individual predictor (independent variable) is >0.05 (assuming C.I. = 95%), then you can consider throwing out… “if theory permits”. This is something I run into: I use for predictors variables that actually covary with the dependent variable (even if individually, the R[sup]2[/sup] is only ~0.20), but sometimes one of those will come up with p >> 0.05. Since it actually does linearly covary AND because the inclusion of that variable usually gives a significant boost in R[sup]2[/sup], I like to keep it. I’m hoping what you said is “try not to keep that predictor, but if you’ve got to keep it because it fits theory, keep it anyway”.

I Followed your discussion with my own data and handy-dandy TI-85. For one of my iffier (but more important) models, I’ve only got n = 11 and a total DF = 10 (The model sez that 7 are “about” regression and 3 are “due to” regression.) There is also a sum of the squares of the residuals (SSq) for both the “due to” and “about” and each has a MSq, where MSq = SSq/DF, and voila…

F = MSq(“due to”)/MSq(“about”).

You say:

To clarify, “10 or better is encouraging” for F or DF? Or are we happy if F > DF? The model I refer to above with DF = 10 has F = 18.5. (p = 0.0010 and r^2 = 0.89, just FYI).

But I think I see what you’re saying: you could have a large spread of data (which I frequently do) which may have a good r^2, but may not really be that good because of large scatter (and thus high SSq), which manifests as a low F.

Thanks, Spritle; If you need any analytical geochemistry done (rocks and soil only, please!), just let me know!

Pretty much! (Just don’t let this secret out or we statisticians will be out of work :wink: )

Sort of. Don’t remove a variable with a high p-value just because it’s high. Similarly, don’t keep a stupid variable just because it has a low p-value. (Watch out for multicollinearity here also; the first of two collinear variables will be significant while the second will not, regardless of the “value”.)

I meant 10 as a value for F (roughly, based on your df range). The value of F compared to df is not important.

BINGO!!

:slight_smile:

Well, thanks dude! I think I’ve got enough of the basics down now to have more confidence in my models (instead of always just getting by on good R[sup]2[/sup]). The question came to me because I would frequently create three or four possible models (H[sub]a[/sub]'s, I guess) that would have R-squares the same or within 0.01 of each other and I was hoping to find a new, and better, way to interpret these models and that you’ve given me. Again, thanks to Spritle for his efforts and lessons.

But thanks also to Zarathustra for bringing up MLR in the first place! Sorry to have usupred your thread.