I apologize at the outset for my phrasing (and ignorance), but I think my question will become clear.
My Question: When determining the “best” fit for a relation between two variables, does one have to assume a particular functional relationship, or is there a way of determining the best type of function?
i.e. Does one have to say, “OK, given that there is a polymonial function that relates y to x, what are the best values for the coefficients?” or “Assuming there’s a sinusoidal relation…”, or “Assuming there’s an exponential …”?
Is there a way (? calculus of variations) that determines the best type of function to relate two sets of variables, and not just a way of determining the best coefficients for an assumed functional relationship?
I feel strange trying to teach math to Karl Gauss, especially when my memory is bit hazy. I vaguely remember that “best fit” algorithms relying on assuming the type of function. Freshmen always assume a linear relation when they use “least squares”, but more advanced students can use non-linear relations as well. I believe the technique for finding a best fit is called quadrature, and one popular form is Gaussian quadrature. You might try reading up on that, or just get your great grandfather to explain it to you!
I don’t know if this is the best way, since it’s more an alternative than an answer. But I do seem to vaguely recall that power series (or something similar) could describe almost any function, so that may be one way to relate data; i.e. choose a relation that covers most if not all possibilities.
Now that I think about it, it may be that even a simplified optimizing method would only work well by using one relation.
I’m guessing you reject the trial & error[sup]*[/sup] method as too crude, though it could be done quickly using a computer and if what you’re after is ‘best possible fit’ might be simpler if you have the basic curve fitting algorithms already in.
panama jack
[sup]*[/sup] by this I mean using calculating the coefficients for every possible function, and comparing the ‘fit accuracy’ metrics that are produced (I think called ‘r-value’ in some cases).
I see. So, the choice to use a low (exponent) level approximation is based on the fact that you’d need a huge data set to use a high (exponent) power series approximation, and, probably also on one’s hypothesis (linear, or time varying relation etc).
Is it guaranteed that a power series with higher highest power always give a better (or equal) fit to one with a lower highest power? i.e., if one series is the sum of [ax[sup]m[/sup]] and another the sum of the [bx[sup]n[/sup]], with m>n, the second series can never give a better fit than the first?
I’m a very big fan of least square fitting and the like. But my iron-clad rule is to use the best fitting function to start out with.
You can’t really get exactly the function you want by tacking on a lot of polynomial terms. These don’t deal well with cusps, for instance. And just try to fit 1/x using only polynomials of the form anx^n. And just for fun, look up Gibb’s phenomenon (usually encountered with Fourier series). Even with an infinite number of terms, you can’t exactly represent a function with an abrupt discontinuity in terms of sines and cosines – that “flange” doesn’t go away.
As for a program that will automatically find the “best fit”, I’m sure you can make something that will behave like that – but there are an infinite number of functions. How do you choose the “best” fit? How do you even DEFINe the “best fit”?
I guess the “best fit” could be defined as the curve giving where the sum of the distances of the data points from the curve (or similar) is minimized. This definition, I suspect, would NOT lead to a unique curve if the curve was based on a finite data set.
You also need to factor in the number of degrees of freedom of your fit. Given n points, for instance, it’s always possible to fit them all exactly with a polynomial of order n-1, but this will not, in general, be a good fit: It’ll tend to go all over the place in between the points, rather than a smooth interpolation. In general, the number of degrees of freedom of your fit (i.e., the number of numbers that you have to plug in) should be significantly less than the number of points you have. For instance, to fit a line, your function is y=mx+b, and you need to specify m and b, so you really ought to have at least 5 or 6 points.
More often, the definition used is the curve that minimizes the sum of the squares of the distances, which does (usually) give a unique answer, given some function type: This is the “least squares” method that CalMeacham mentioned. Least absolute value does have its advantages, though, depending on the nature of the data you’re trying to fit.
In least square fitting the object is to minimize the sum of the squares of the distances of the data points from the curve. It’s equivalent to running ideal springs from the points to the curve, restricting them to only run vertically, and letting things “even out”.
But the reason I asked about the definition of “best fit” is that it isn’t always that simple. Sometimes you want to give some points more emphasis than others, so you use a variable “weighting factor”. If you’re plotting log data on one axis and linear on the other you will have to use variable weighting if the uncertainties in raw (un-logged) data are equal. And so on. There are a lot of possible scenarios. I find that I have to judge them case by case.
It’s an interesting problem. Look it up in “Numerical Recipes” by Press et al., or in Philip Bevington’s “Data Reduction and Error Analysis for the Physical Sciences” or any of the other books on the topic.
I should explain why the issue is of interest to me.
In Medicine, and medical research in particular, there are lots of observed (cor)relations. Using observatioanl data, the investigator often demonstrates a statistically significant linear or log-linear correlation. When I look at the curve, I can imagine other curves that might also have fit, that might have done the trick, if only they’d been given the chance. And, these different curves imply a different underlying physiology, carry different insights as to “what’s really happening”.
(Indeed, I’m sure this is a problem throughout science)
So, in a sense, my question still stands: Is there an objective way of choosing the assumed functional relationship between sets of variables? How do we know that we should use a third degree polymonial, a linear relationship, etc. Chronos alluded to this when he said that, “In general, the number of degrees of freedom of your fit (i.e., the number of numbers that you have to plug in) should be significantly less than the number of points you have.”. This seems a bit like the tail wagging the dog.
Perhaps this issue is addressed in the references that you suggested, Cal. However, despite my apparent heritage, I am sure that I won’t understand them. I’m looking for a non-rigorous, arm-waving, type of answer, that might not even exist.
Man, it’s been about 20 years since I thought about this stuff, so I may be out of date. Generally, no, there is no objective standard that says, aha! these data points are best fit by an exponential curve rather than a polynomial.
The trade-off, if I understand you correctly, is fit vs smoothness. If you want the curve to exactly fit the data points, then you lose smoothness (n-th degree differentiability). If you want the curve to be as smooth as possible, then you give up exactness of fit.
Given a curve that approximates a set of data points, there are tests that can measure the relativeness “goodness” of fit vs smoothness. However, given a set of points, there is no way to determine in advance that a piece-wise polynomial will fit the curve better or worse than a log function, say.
People tend towards simpler functions, such as straight lines, log/log straight lines, 2nd degree polynomials, because they’re easier to work with.
Get a book on numerical analysis, rather than simply a statistics test, for detailed discussion.
Of course the benefit of using linear (or log-linear, etc.) is that they are easy to compute. You can also try non-linear regression, which is what you are proposing, I think, by what is essentially a estimate. Almost always you are trying to minimize some function of the errors. So what you can imagine is a bunch of possible unknowns that parameterize the model creating a “landscape” of errors. Imagine a topographic map. So you start with some seed and that puts you, generally, on some hilside on our map. Then you find out which way the “hill” is sloping downward and you pick some new parameters in that direction. You keep doing this until you find a “local” minimum. Now like any topographic map there may be several minimums, and you are hoping you stumble onto the Marianas Trench. Luckily, most common non-linear functional forms, like logistic functions have only one obvious minimum and you are likely to find it. This is very simplified and probably very confusing.
OK, now that I have made no sense, there are progrrams out there that will basically do this for you. Neural networks are very popular, for example, although I do not know how easy it is to get a hold of neural network modelling program.
I doubt there is an entirely objective way to determine which curve to use to fit a set of points. If two different curves agree within the error bounds with the data you are trying to fit, over the range of that data, I’m not sure how you could clearly choose between them without some outside information. If your data looks parabolic, e.g., how do you know it isn’t really a cosine if you go farther out?
In time-series analysis, there are methods for determining the number of degrees of freedom (i.e. model order), given a model type (autoregressive, ARMA, etc.). I don’t see why you couldn’t come up with a criterion for deciding the order in a power series approximation. You just decide on a metric, and find the mininum. Of course, the right metric might depend on the data
In real life, you either have an idea of the relationship due to knowledge of the physical processing going on, or there isn’t a simple relationship between them, and you try to pick a function that allows you to minimize the degrees of freedom, but also minimize the metric you measure fitness with (mean square error, absolute error, etc.)
I believe this does not answer the OP but it is related and might interest readers of the thread.
The way I understand it, the OP is “how can I know which mathematical function will best approximate a given set of data?” In other words, we are trying to analyze the behavior of the variable.
But, if you just have a set of data and want a practical function to represent it I have found Chebyshev Polynomials are a very practical way of doing this.
You can take any set of data and have a function that will approximate it with as much precission as you like. The resulting function is very easy to compute.
I wanted to find out a bit about Chebyshev Polynomials and went to my usual web resource for things mathematical, Eric Weisstein’s World of Mathematics. But, alas, it’s off-line. A copyright problem. If you’re interested, visit the URL and lend your support.
It goes beyond that. If you fit a curve to data points, you’re taking a step toward explaining and understanding the physics/chemistry/whatever behind it. The more terms and coefficients you have to use, the harder it will be to explain and the less likely it will be to be “true” (Occam’s Razor and all that). If you already have some idea of the physics, then you already know the shape of the curve anyway.
There are any number of curve-fitting programs for your PC that will spit back a correlation coefficient (representing the average “closeness” of your data points to the curve) for various assumed curve types, along with your numerical coefficients. If you have the patience, the simplest scientific calculator will do that, too.
Best yet is to calculate your experimental error and put max-error circles around the points, then try to fit the simplest curve that goes through the circles. If you still can’t do that, then re-examine your error calculations, or rerun the point.
As I noted above, the best answer is the function closest to the theoretical form. The trick is knowing that form, of course. It’s not alwas easy.
There’s a story that Max Planck used his famous function to fit his blackbody radiation curve. He had, goes the story, obtained it empirically. He was told that it fit the data better than it had a right to, so he used the formula to “back engineer” the underlying physics.
I don’t know if the story’s true, but it s a least plausible.
This situation happens all the time. If you have several possible functions and ask which one fits best, you can do a simple chi square fit. First, find the best fit parameter for each type of function. For each of these functions you get a chi-squared value, which is the sum of the square of the difference between the best-fit function and the data. You then calculate the reduced chi square, which is the chi-squared divided by the degree of freedom. Degree of freedom is defined as the number of data sets minus the number of parameters used in the fit.
The resulting reduced chi-square is an objective measure of how well you were able to fit the function. Ideally it would be around 1. If it’s much larger (say, more than 4) you can be pretty sure that the function you chose is not a good model for your data.
I strongly second the recommendation for “Numerical Recipes.” It’s very practical and not hard to understand for experimentalists like you (and me). Also, “Data Reduction and Error Analysis for the Physical Sciences” by P. Bevington and D. Robinson is a very good book on this subject.