Refining "percentage error" calculation

A and B produce an estimated value of some quantity. Each is estimating a separate value.

A’s estimate (EA): 35
B’s estimate (EB): 2980

It turns out the actual values were:

A’s actual value (AA): 42
B’s actual value (AB): 2541

Who is the better estimator?

Using the classic percentage difference calculation: P = “|(actual - estimated)| / actual”

A’s %diff (PA): |42 - 35| ÷ 42 = 16.7%
B’s %diff (PB): |2541 - 2980| ÷ 2541 = 17.3%

The goal is to provide some objective clue* as to who is the better estimator. By this measure, A and B are equally skilled at estimation - each is off by about 17%. The smaller the P the “better” the estimator, so it could even be said A is slightly better. This isn’t a fair comparison, though, if we know B’s job was tougher in some other way.
There are three refinements I want to add to the calculation above. Each of these should reduce P by some amount. Each reduction will likely each contain some “F-factor” to allow tweaking and tailoring.

  1. Scale (P’s first refinement: PR[sub]1[/sub])
    However, let’s assume the difficult in estimating scales with the value’s magnitude - the bigger it is the harder it is to estimate. In that case, Pat did way better with EB.

How can I modify the formula to reflect this? Something like PR[sub]1[/sub] = P * (1 - S(A) ). where S(A) = some function of A that starts at zero and horizontally asymptotes to 1 as A increases without bound.

It’s been too long since I’ve done this kind of math. Using Wolfram Alpha I landed on a formula for a horizontal asymptote with a “magic number” that can be adjusted to make the formula fairer by moving the “knee of the curve” left or right (which can be set based on discussion, historical analysis, trial/error).

First attempt: S(A) = A / sqrt(A[sup]2[/sup] + F) where F moves the “knee” left or right. (with F = 100 million, A would get a near zero reduction and B would get 25%, aka PR[sub]1[/sub]B = 13%)

Is there a better approach?
2) 0-n complexifying factors (PR[sub]2[/sub])
There could be any number of independent factors that make one estimation tougher than another. Let’s assess each of these factors on a scale (1-5 or 1-10). Some factors may apply to any estimate. Some factors may not apply.

So if A’s value had one complexifying factors and B’s value had two. Each was assessed on a scale of 1-5. So we’re looking at something like this:

PR[sub]2[/sub]A = PA * (1 - C[sub]1/sub )
PR[sub]2[/sub]B = PB * (1 - C[sub]1/sub ) * (1 - C[sub]2/sub )

Is it just a simple matter of defining C[sub]x[/sub] = assessment / scale? Is that fair? I guess it should be assessment / (scale + F) where F > 0 to ensure PR is not unconditionally reduced to 0%

  1. It’s better to over estimate (PR[sub]3[/sub])
    I’m not even sure where to begin on this, but the fact of the matter is that all other things being equal the estimate larger than the actual is the better estimate. Ordinarily I would say it’s not ideal to reduce P to reflect this, rather take out the absolute value and leave in the sign to show which is better. But a number of PRs will be averaged to assess a final “score” if you will.

So perhaps: PR[sub]3[/sub] = P * (1 - if(P >0, F, 0) )
Where F is some value between 0 and 1 (say .1 to start)

Maybe a better measure is a separate metric based on count over’s vs. count unders.
The GQs are:
Is the math sound? How can it be improved? Is the approach sound?


  • I use the word “clue” to acknowlege no formulaic model can definitively declare who is the better estimator, but it does give a worthwhile contribution to an overall comparison.

Rather than saying the math is sound, I’ll state the math is not unsound.

You are trying to come up with an objective quantification for what is essentially a subjective problem - which measure of “error” best serves the purpose.

I’ll note that percentage error is only one kind of error measurement. You yourself have proposed a number of others. Maybe this is a misunderstanding on your part but there’s rarely a set or canonical measurement accepted by everybody. It’s why a basic statistics course takes an entire semester (and doesn’t even cover everything).

The “best” error calculation is going to be the one that has most of the features you desire with few of the features you do not desire. There’s no objectively “better” error measure that is universally better for all problems.

Clearly, you have a notion of which measurement is ‘better’. Now the trick is only to quantify your gut feelings into something you can consistently use. You use the word ‘fair’ but fair is in the eye of the beholder. Until you can quantify all the factors you consider relevant, it’s going to be tough to offer substantive suggestions for any error measurement that will satisfy your problem.

The “natural” solution if the difficulty in estimating increases with scale is to use the logarithm in some way. If you divide the percentage error by the logarithm of the actual answer, you get

for A: (35/42)/Log(42)=.2230
for B: (2980/2541)/Log(2541)=.1496

On the other hand you could take the log of everything, and get this result

for A: (Log 35- Log 42)/Log(42)=-.0513
for B: (Log 2980-Log 2541)/log(2541)=.0199

Either way, you see that B is better, which is the correct answer if you expect errors in estimates to be proportional to the magnitude of the underlying phenomenon. But I agree with AntiBob - the choice of metric for “goodness of estimate” is a subjective one.

Thank you both for your comments.

I agree with the comments on subjectivity. At least, I think so - based on what I’m understanding you to say:

What I mean is: If someone told A, “B is better than you,” A would naturally ask “Why do you say that? How do you know?” Without metrics the conversation degrades to anecdotes and depends on memory and personality. If a formula (with all of the inputs labelled) is used, a more practical conversation can take place, and both A and B are better protected against unjustified opinions. They each have higher confidence in the assessment. There is definitely controversy about the approach (what can a number really tell us?) but many (myself included) are convinced an imperfect number is a better conversation starter than mere recollection and hearsay.

So what are the things important to us in estimating: the 3 points I mentioned - arrived at by bouncing ideas off of each other. Assuming we nailed the factors (arrived at subjectively), we have something to objectively measure. Right?

I should add - the idea isn’t to pit A against B - I just used two individual as an example of the comparison. We’re really striving to define and set objective standards against which individuals can a) negotiate and b) compare against. The idea is “15% is a reasonable expectation. Last year A was at 20% on average and B’s average was at 6%. What were A’s obstacles and what can we learn from B?” Where did 15% come from? Our gut, based on playing with the numbers in a consensus building exercise.
One of the fears that drove me to ask these questions is what if we produce a formula with an unfair bias - some systemic mismeasurement. “A” always gets the lower estimates so doesn’t benefit from the “forgiveness” built into the formula as it applies to B, who routinely gets the higher ones. In other words, did we shape the horizontal asymptote “fairly” in the case of the “scale” refinement. (I haven’t yet digested the implications of using Log() - I’ll probably not get a chance until later tomorrow).
I guess said another way: Not having taken a semester’s worth of statistics, am I playing with fire?