Correlation and percent of error

Saint_Cad · February 20, 2025, 4:46pm

I don’t think there is a natural relationship between the correlation coefficient and expected percent of error but maybe here is. Specifically using R^2 is there a general rule (does not have to be mathematically precise) where if R^2=n is should expect an error between estimated values based on the line of best fit and the actual value of a measurement to be X%.

Here is what I am trying to do. I am teaching high school students about correlation and how it indicates how close reality matches theory (the LOBF). My thought was that once students create a scatterplot from sample data that they can compare measured data against the LOBF. The theory being if the correlation is high, their data points will be close to the estimated value as measured by the percent of different.

For example, let’s compare city population to median home price. There are 10 sample cities and from this they create a scatterplot and let’s say the line of best fit is price = 0.5*population + 100000. Now they look up New Orleans with a population of 364,136. The median home price would be estimated to be $282,068. If the median home price is really $250,000 this is a difference of 12.8%. The question then is if R^2 = 0.6 is an error of 12.6% reasonable?

If there is a completely different way to get the students to see lower correlation = bigger differences between real data and the LOBF data, I’m all ears. And again, it does not have to be mathematically rigorous but more conceptual.

jnglmassiv · February 20, 2025, 5:44pm

LOBF = line of best fit.

I have never heard best fit line called line of best fit or its acronym.

If I had a detector of radar, I could exceed the limit of speed with impunity!

Saint_Cad · February 20, 2025, 5:50pm

One of many uses of the phrase.

LSLGuy · February 22, 2025, 12:32pm

IANA statistician. not by a long shot. But IMO you’re barking up a dead tree.

The idea that the correlation coefficient is a measure of the quality of correlation is pretty fundamental. You might try teaching that point directly by coming up with fully random data that is known to be random (rolls of a pair of different-colored fair dice), and demonstrating that, yes, a CC value can be computed. But it’s useless for predicting anything.

To demonstrate the opposite pole take almost perfectly correlated data, then generate a CC from that. WAG idea: height of a subset of class members versus wrist-to-elbow measurement of the same subset of class members. The latter will have a very high CC and will in fact accurately predict the same measurement(s) in any class member whether or not included in the defining sample.

From there I’d tackle some plausible real world data with an intermediate CC. Mostly to show that real world data is not going to have a CC of either near zero or near ±1. Subject to the caveat that if you find near zero, you can conclude it’s a good bet that the hypothesis that caused you to compare those two data sets in search of a connection is probably rubbish.

IMO trying to assign or compute another number which is a plausible range of bestness of fit somehow based on the CC is a confusion that’s probably a fool’s errand.

Buck_Godot · February 23, 2025, 1:50am

I wouldn’t say that the correlation is a measure of percent error, because the percent error depends on the magnitude of the observation. What correlation really is is a comparison of two different types of variability. The variability of the underlying signal to the variability of the noise.

As an example, suppose I take a thermometer at is generally +/- one degree and take two different measurements of a body temperature, if I plot those two values they will have rather poor correlation, since most peoples body temperature (e.g. the pairs may be {(98,99), (97,98), (98,97) } These have lousy correlation because the error is of similar magnitude to the signal. But suppose instead I measure the outside temperature at different times of day/night. Then you may get values like {(45,46), (51,52), (63,62), (55,56)} You will have a much higher correlation because what you are measuring is much more variable than the error.

Saint_Cad · February 24, 2025, 2:25pm

That’s what I remembered from stats class. I was hoping I remembered wrong.

Topic		Replies	Views
Statistical oddity: Trendline flattening out Factual Questions science-math	1	116	December 21, 2024
Statistics (Functions) Factual Questions	18	1251	October 25, 2000
Pearson R coefficient. Help! Factual Questions	2	1071	March 22, 2006
Correlation coefficient in polynomial regression Factual Questions	12	8760	February 20, 2004
Math Q: slope and average % change Factual Questions	5	15119	November 5, 2008

Correlation and percent of error

Related topics