Curve Extrapolation: Is this a valid method?

Two-part question on extrapoltating data beyond a table of given values:

  1. If you have a table of five data points, for example, that make a smooth curve, I was taught you can extrapolate the table for any “x” value by making a column of the differences between adjacent “y” values on said table. If that doesn’t lead to a constant, you repeat this creating another column of the differences between the first set of differences. And, you can repeat again until you reach a constant, or run out of data (because each column of differences will have one more blank spot than the previous). You can now find any distant point by working the differences backward thereby summing up the differences to find the next point(s).

Is this valid? And, does this method have a name? I am sure it is some kind of numerical method of using iterations to converge upon a solution.

  1. Also, I’d argue that by the last column, with only one difference to find, you can use that value as a constant, whether it is or not. My reasoning is that it would be no more or less accurate that plotting the points and, with a french curve, estimating the extension of the curve beyond the known data points.

What says the SD on this?

Briefly, your approach is equivalent to fitting a polynomial through your data points. If the underlying function is truly a polynomial of low enough order (four, if you have five points), and your data has no error in X or in Y, then yeah, this will work. Otherwise, if the data is slowly varying within the region covered by your data, you can get a reasonable approximation within that region, but arbitrarily large errors outside that region. If the data is not slowly varying within the sample region, the even within that region you can have large errors.

Entire books have been written on this subject.

Ok, so then using a french curve to extend a smooth curve (representing a known range of values) is equally as valid or invalid, true?

I’ve never actually used a French curve, but yeah, you’d have essentially the same issues interpolating or extrapolating.

Any method that produces sufficiently accurate predictions outside of the training set (the data you have) is valid.

As ZenBeam notes, the method described in the OP is best suited for polynomials. If you have any amount of experience with derivatives, this explanation may help. When dealing with a discrete-time set of data, looking at the difference between values is equivalent to looking at the instantaneous derivative of a continuous-time going through the same points. Think of it this way: when you calculate a derivative in continuous time, you want to shrink delta X to an infinitesimal value and figure out the corresponding delta Y, etc. In this case, the smallest delta X possible is the distance between adjacent points. So when you take the difference between points, you are looking at the first derivative with respect to X. If these values are equal, then you have a linear function. If not, take the differences again to see the second derivative with respect to X. Keep going until you run out of differences to calculate or find a good match.

If the function you’re looking at is not a polynomial, you may be able to match your data fairly closely, but values outside of this rank will likely be inaccurate.

To amplify what ultrafilter said, if you have those five points and nothing else to go on (see below), then no extrapolation can be justified. If you have some model to work with, then you can extrapolate using the model (which is presumably supported by other data you don’t have.) If you have other data with which to test your extrapolation, then you can use anything that works sufficiently well for those other data (essentially creating your own model of how this sort of data behaves.)

As an example of data you know nothing else about, consider these values measured with 1% precision in y (x is known perfectly) –



 x    y
0.0  0.00
1.0  1.00
2.0  2.03
3.0  2.97
4.0  3.98


With no other knowledge, you can’t say that x=5.0 will yield something near y=5.0, even though it “looks” like it will. Here are the next two points:



5.0  7.44
6.0  1.06x10^5


There are many real systems that behave this way (that is, exhibiting a threshold across which positive feedback takes over.)

There may be a sensible way to extrapolate your data, but there’s no way we can say what that is without more information.

CurveExpert is free software that you can use for curve-fitting and extrapolation.

Assuming that they’re not leaving some major features off the list, whoever wrote this doesn’t seem to understand the issue of overfitting, which is fundamental to any sort of modeling and extrapolation. It looks like very nicely put together software, but not something that’s useful for handling real problems.

Why do you think it suffers from overfitting? The example fit shown on the linked page doesn’t look to be overfit, even though the points don’t exactly fit the model. Also, from the CurveFinder screenshot, you’re able to specify which functions to use for the fit, and for polynomials the maximum degree to use.

I’m not saying that the software does suffer from overfitting; rather, that it seems to be designed to encourage overfitting. Imagine someone who sits down with a data set and uses the built-in facility to find the best fitting curve. It’ll probably look pretty good on those data points, and if the user doesn’t understand why that’s an issue, the software doesn’t seem to point that out.

If I’m wrong and the software does have an option like selecting the best curve by some form of cross-validation, then it’s very nice software.

I’ve been playing around with this, it’s pretty nice.

Jinx, if you add a third column for standard deviation, and select Tools->Weighting Scheme->By uncertainty, you can choose which points it uses for the fit. For example, in the data below, the first two columns are the data (x and sin(x)). When you load it, check the box labeled “Last column is Std. Deviation Data”. The small numbers in the third column mean the data will be used as fit points, and the large numbers mean the data won’t be used (actually, they’re weighted by 1/S, where S are the numbers in the third column). So below, only the points with x = 2,3,4,5,6,7,8 are used for the fit. Now you can play around with different fits, and see how well they interpolate and extrapolate the data. For example, fitting an Nth order polynomial, of degree 6, the sine is interpolated pretty well, but extrapolating beyond the ends of the fit has a large error.


         0         0  99999.00
    0.5000    0.4794  99999.00
    1.0000    0.8415  99999.00
    1.5000    0.9975  99999.00
    2.0000    0.9093    0.0001
    2.5000    0.5985  99999.00
    3.0000    0.1411    0.0001
    3.5000   -0.3508  99999.00
    4.0000   -0.7568    0.0001
    4.5000   -0.9775  99999.00
    5.0000   -0.9589    0.0001
    5.5000   -0.7055  99999.00
    6.0000   -0.2794    0.0001
    6.5000    0.2151  99999.00
    7.0000    0.6570    0.0001
    7.5000    0.9380  99999.00
    8.0000    0.9894    0.0001
    8.5000    0.7985  99999.00
    9.0000    0.4121  99999.00
    9.5000   -0.0752  99999.00
   10.0000   -0.5440  99999.00


I’m a pure mathematician, not applied, so I’ll just point this out more plainly than it has before:

The method in the OP gives a polynomial that passes exactly through all the points. This is pretty much not what curve fitting usually wants. What you want is some class of curves (exponential, polynomial (with a known, fixed degree), logarithmic, etc) given by a theoretical model that has a small number of parameters to be adjusted in the “fitting”. Then you want the curve not to pass through all the points in the data set, but to pass near them all.

Remember that every real-world application has error. If you have data that “should” behave quadratically, then the slightest error in one data point will usually turn the result of the OP’s method from a degree-2 polynomial into a degree-4 one. This is what “overfitting” is about, IIRC.