I haven’t been in a statistics class in almost 15 years…
Hypothetical scenario:
I want to provide an incentive for the fast performance of a task and discourage slow performance.
As individuals continue completing the task, I had been maintaining a simple running average of the completion time and I had been using that average to calculate where a given individual fits on the insentive<—>disincentive spectrum
However, it quickly became apparent that people who take an extremely long time to perform the task resulted in pulling the average up, making it easier to be considered qualified for the incentive for increasingly slower and slower times.
So I want to use a different formula, one that will eliminate only the high outliers from the equation used to determine who is eligible for insentive or disincentive while keeping all of the fast times in the calculation.
Should I be using standard deviation? Interquartile range? Is what I’m looking for considered a median, or is another term more applicable?
Why wouldn’t you compare people’s times to a median rather than an average? That would eliminate the leveraging effect of extremely long times without discarding them entirely from your pool.
I would agree that the median is the best measure to use here, unless you have some theoretical mathematical model to fit to. The mean and standard deviation are really only useful measures for Gaussian distributions; the only reason they’re so commonly used is that a lot of things are approximately Gaussian. The median, though (and the interquartile range, which is the natural “width” parameter to use with the median) is applicable to all distributions.
I agree with zut that the median is probably what you want. However, if you really want a mean and you think the times are (more or less) exponentially distributed then a geometric mean (nth root of the product of the n numbers) is probably better to use than the arithmetic mean.
I’m willing to consider any changes. I inherited the logic as I described it above, and am trying to revise it in a more accurate fashion now based on the observations I’m seeing in the data regarding the high outliers.
The high numbers likely indicate interrupted completion of the task, and ideally I only want to be counting uninterrupted tasks, but the only data I get is the total time of completion. I do not have real-time visibility of the precise instants when the task is interrupted or resumed.
If you do a histogram of the data, does it look like it’s pretty evenly spread out, or are there multiple clumps? If it’s the latter, you can figure out where the dividing line between the clumps is (just eyeball it), and throw out everything above that. Otherwise, you should use the median.
If it happens to be a Gaussian, yes. But on the other hand, a Lorentzian distribution doesn’t even have a well-defined mean, and an infinite standard deviation. But, like all distributions, its median and interquartile width are still perfectly well-behaved.
Right, right, I’m just agreeing with you that the median and interquartile range are often more useful measures, even in the Gaussian case where they happen to reduce to the mean and (a constant times) the standard deviation. That is, even with a Gaussian distribution, often the only reason the mean is useful for some application is precisely because it happens to equal the median.