Statistical way to properly kickout an outlier

Sigene · May 29, 2024, 3:16pm

I’d like a proper and accepted way to kick out an outlier from a list of reviews. I have 4 reviewers, judging 5 pieces of art. I would like a statistical way to claim the number 5 (in orange) is an outlier (because reviewer A was wrong on art 1), but not have it so discriminating that the number 15 (in green) is kicked out.

THis is a made up example. Essentially, I need some justification to be able to kick out numbers that are clearly outliers…Do I do standard deviation (kicking out anything 2sd from the mean)? or is there another, better statistical device?

Johnny_Bravo · May 29, 2024, 3:48pm

Deleted because I have no real idea what I’m talking about about and we have actual stats nerds round these parts.

mixdenny · May 29, 2024, 4:10pm

Using standard deviations from a curve fit is a good way to eliminate fliers. That works well when the data points are derived from some process that could have errors. But your data points are opinions. It simply reflects the fact that that reviewer didn’t like a particular work of art for whatever reason. Right?

Saint_Cad · May 29, 2024, 4:14pm

What are you doing with the numbers? Something like a Borda count?
I think that unless you can identify a bias giving you a reason to throw out data (The Russian judge’s scores) then you may have to analyze the data with a nonparametric test.

But would you use 2σ or 3σ? And I’d be hesitant to throw out any data that is not clearly an error if it is for a professional paper as outliers do occur in the real-life.

Sigene · May 29, 2024, 4:25pm

I’d like to not get hung up on whether or not the number is ‘correct’ or not or if there is bias or not.

I’m just looking for an appropriate method to justify removing a number.

We can say its a number generator that is designed to spit out numbers within some range…and this one time there was a bug and the number should be thrown out.

Or think of it as clearly an error, but I want to justify throwing it out from a statistical view rather than any other reason or relation to anything else.

thelurkinghorror · May 29, 2024, 4:31pm

There’s no “right” answer, but I’d have established a priori rules to exclude data. Selecting X SDs or Y% of the extremes to exclude is valid. You’d just want to avoid selecting data after the fact that looks wrong to you. As you didn’t establish this ahead of time it’s iffy, but if this is “pilot” data and you’ll be collecting more potentially it might give you a good starting point to know how extreme you should exclude.

Since this involves a known number of reviewers you could also decide that if one reviewer has some degree of disagreement with the rest then you could drop that reviewer from the average, but you wouldn’t want to reject a reviewer because they’re systematically more nitpicky than the others.

Saint_Cad · May 29, 2024, 4:45pm

Except if a number is correct, you shouldn’t remove it because it is inconvenient to your analysis. If there is a bias then the question is throw out that score or all of the scores from that reviewer based on what caused the bias.

md-2000 · May 29, 2024, 5:32pm

The simplest answer to this I’ve seen is “throw out the highest and lowest scores”. This ensures that a single exceptionally uncharacteristic score is eliminated, and if one judge is scoring too high or too low consistently. (Isn’t this how some scores like figure skating calculate?) This eliminates the possibility that a single judge is skewing the results in favour of, or against, one contestant. Note this is a more reliable method with several more judges; say, 6 or 7.

Consider the result: piece, with all, average with top and bottom eliminated:
art1 27.25 32.5
art2 25.25 25
art3 31 30
art4 19.25 19
art5 29 31.5

Assuming my calculator skills are still good, this is the result. Note the only one where the needle moves significantly is the one with the extreme outlier score. And by using a standard formula the same for everyone, you are not discriminating against any one judge or favouring any contestant.

Sigene · May 29, 2024, 5:35pm

Again, I don’t want to get hung up on whether or not the number is correct. Let’s think of it as a exercise in me learning how to throw out a particular score…just the mechanics of it.

Pleonast · May 29, 2024, 5:38pm

This is the way, especially when you only have a few scores. Standard deviation is almost meaningless in this example.

md-2000 · May 29, 2024, 5:38pm

The other point with this method, is it restrains the “Russian Judges”. There’s no point in giving a contestant a score too far out of what it “should” be, if you simply provide the score that gets tossed.

thelurkinghorror · May 29, 2024, 5:45pm

Software might also have tools for this, for example Excel has =TRIMMEAN() that gets a mean, cutting off X% outliers in either direction. It’s not based on the SD but number of data points, so it would work better with larger data sets than you’re potentially working with.

md-2000 · May 29, 2024, 5:46pm

Eliminated scorer using hi/lo method:
A,C
A,D
B,C
B,C
C,D
So no particular judge is singled out either.

Of course, when this scoring method is used in a competition, the process is explicitly stated ahead of time. Picking the scoring method to “adjust” the results after the fact seems unfair.

dasmoocher · May 29, 2024, 6:06pm

If this is real data, any chance it’s a typo? 15 instead of 5, say.

I had a case at work where some measurement variable values provided by an outside company were all centered around 10.0 except one.

There was a 97.3 along with 10.1, 9.97, 10.4,…

It seemed that it should have been 9.73 and was entered incorrectly. We replicated the measurements in-house and we all got values around 10.0.

DSeid · May 29, 2024, 6:11pm

That the most important mechanics of it. There is no acceptable way to look at data and figure out how to throw out a particular score after the fact.

Pleonast · May 29, 2024, 6:27pm

Definitely true for something like an art contest as in the original example.

But other contexts, getting a best measurement in the face of flaky data, post-hoc reasoning might be the only tolerable way to do it. You still must disclose your process, of course.

Huh, Wikipedia even has an article on it:

Stranger_On_A_Train · May 29, 2024, 6:52pm

This is true as a general principle, although if you can demonstrate that the data point is either statistically way outside of an expected distribution, or likely due to a measurement or instrumentation error you can make a valid argument for exception. For instance, in signal measurement for testing on shaker tables you’ll often see strong peaks at 60 Hz intervals (in North America) which is a very clear indication of ‘line noise’ due to improper grounding. But this does mean having a physically sensible (and ideally, repeatable) source of data contamination, and if you are just sampling data and have a ‘flyer’ that you don’t like, you can’t just throw it out because it is inconvenient.

In the case of the o.p.’s example of art reviewers, there is no expectation that their valuations will be normally distributed (random error or variation distributed about a central mean), and in fact you would expect bias both in how each reviewer scales their scores, and in how each reviewer interprets and rates unique works of art. I would expect a very multi-modal distribution for this case along both axes, which actually tells you more about the reviewers than the statistics of the artworks. Not everything is amenable to statistical analysis, and certainly not frequentist statistics employing Fisherian reduction; this is a case where a Bayesian approach to predicting how particularly reviewers will rate each type of artwork, or how an artwork will be generally rated, is the obviously superior approach, and is also why Bayesian analysis dominates real world data science.

Stranger

Measure_for_Measure · May 29, 2024, 8:34pm

The OP gave a made-up example. What’s the sample size that we should consider? How close is the made-up example to the real one?

You could for example throw out the highest and lowest scores of every judge. Or better yet the highest and lowest score of every contestant. That would be an unbiased measure of the underlying score.

Either way if the actual example involves judges, I’d look into the trimmed mean. That’s if we’re discussing one variable (eg score on a contest).

If this is a regression analysis with many variables and you are concerned about unusually large errors, look into robust regression or (better, IMHO) median regression.

Maserschmidt · May 29, 2024, 8:54pm

First, there are statistical methods for addressing outliers, the simplest I know of of being Winsorizing (clipping extremes to match the next biggest outliers). But as many have pointed out, that’s dangerous when you don’t know if the outlier is actually representative or not. (On the other hand, if I understood the original post, I get the sense you’re more interested in a defensible approach than complete validity)

Winsorizing - Wikipedia.

Second, if you’re looking for another statistically defensible approach, you could just use the median instead of the mean. It works better with larger samples in my opinion, but it’s an option.

Measure_for_Measure · May 29, 2024, 9:53pm

Fun fact: in the OP’s example the contestants’ median score, trimmed mean, and winsorized score are identical.

(There are only 4 scores, so the median, the mean of the center 2, and the mean of all 4 when the center 2 replace the outliers, are all the same.)

Also the median is a form of trimmed mean, where all but the center 1 or 2 observations are thrown away.

For real world examples, see the Cleveland Fed’s median consumer price index and 16% trimmed mean price index: Median CPI

Topic		Replies	Views
Fun with small-sample statistics Factual Questions	14	1229	September 30, 2002
How to Make a Point Regarding Statistics/Surveys Factual Questions	16	2974	August 20, 2014
Standard Deviation: How do I use it? Factual Questions	11	1086	July 6, 2003
Need help with a statistics problem Factual Questions	12	442	February 17, 2022
Statistics question Factual Questions	19	1111	January 4, 2003

Statistical way to properly kickout an outlier

Related topics