Accounting only for high outliers in a sample

ComeToTheDarkSideWeHaveCookies · April 12, 2010, 7:40pm

I haven’t been in a statistics class in almost 15 years…

Hypothetical scenario:

I want to provide an incentive for the fast performance of a task and discourage slow performance.

As individuals continue completing the task, I had been maintaining a simple running average of the completion time and I had been using that average to calculate where a given individual fits on the insentive<—>disincentive spectrum

However, it quickly became apparent that people who take an extremely long time to perform the task resulted in pulling the average up, making it easier to be considered qualified for the incentive for increasingly slower and slower times.

So I want to use a different formula, one that will eliminate only the high outliers from the equation used to determine who is eligible for insentive or disincentive while keeping all of the fast times in the calculation.

Should I be using standard deviation? Interquartile range? Is what I’m looking for considered a median, or is another term more applicable?

Halp!

ComeToTheDarkSideWeHaveCookies · April 12, 2010, 7:47pm

Aaaand missed the edit window to spell incentive correctly. :smack:

Machine_Elf · April 12, 2010, 8:00pm

Keep everything within plus-or-minus 3 standard deviations from the mean; anything outside that point may regarded as an outlier and discarded.

You can tweak your formula to only discard the points on one side of the histogram.

zut · April 12, 2010, 8:10pm

Why wouldn’t you compare people’s times to a median rather than an average? That would eliminate the leveraging effect of extremely long times without discarding them entirely from your pool.

Chronos · April 12, 2010, 8:15pm

I would agree that the median is the best measure to use here, unless you have some theoretical mathematical model to fit to. The mean and standard deviation are really only useful measures for Gaussian distributions; the only reason they’re so commonly used is that a lot of things are approximately Gaussian. The median, though (and the interquartile range, which is the natural “width” parameter to use with the median) is applicable to all distributions.

Canadjun · April 12, 2010, 8:18pm

I agree with zut that the median is probably what you want. However, if you really want a mean and you think the times are (more or less) exponentially distributed then a geometric mean (nth root of the product of the n numbers) is probably better to use than the arithmetic mean.

ComeToTheDarkSideWeHaveCookies · April 12, 2010, 8:19pm

I’m willing to consider any changes. I inherited the logic as I described it above, and am trying to revise it in a more accurate fashion now based on the observations I’m seeing in the data regarding the high outliers.

The high numbers likely indicate interrupted completion of the task, and ideally I only want to be counting uninterrupted tasks, but the only data I get is the total time of completion. I do not have real-time visibility of the precise instants when the task is interrupted or resumed.

ultrafilter · April 12, 2010, 9:00pm

If you do a histogram of the data, does it look like it’s pretty evenly spread out, or are there multiple clumps? If it’s the latter, you can figure out where the dividing line between the clumps is (just eyeball it), and throw out everything above that. Otherwise, you should use the median.

Indistinguishable · April 13, 2010, 12:05am

Of course, in a Gaussian distribution, the mean is the median, so it’s just a matter of words, in that case.

Chronos · April 13, 2010, 6:08am

If it happens to be a Gaussian, yes. But on the other hand, a Lorentzian distribution doesn’t even have a well-defined mean, and an infinite standard deviation. But, like all distributions, its median and interquartile width are still perfectly well-behaved.

Indistinguishable · April 13, 2010, 6:09am

Right, right, I’m just agreeing with you that the median and interquartile range are often more useful measures, even in the Gaussian case where they happen to reduce to the mean and (a constant times) the standard deviation. That is, even with a Gaussian distribution, often the only reason the mean is useful for some application is precisely because it happens to equal the median.

Topic		Replies	Views
Statistical way to properly kickout an outlier Factual Questions	56	1505	June 3, 2024
Mean v. average Factual Questions	18	1570	November 17, 2008
Using the quadratic mean for standard deviation Factual Questions	12	756	August 30, 2023
Why half-life and 1 standard devation = 68%? Factual Questions	37	8485	July 5, 2011
How does a bell curve distribution change if you cut off the end Factual Questions	14	2546	October 3, 2016

Accounting only for high outliers in a sample

Related topics