I’d like a proper and accepted way to kick out an outlier from a list of reviews. I have 4 reviewers, judging 5 pieces of art. I would like a statistical way to claim the number 5 (in orange) is an outlier (because reviewer A was wrong on art 1), but not have it so discriminating that the number 15 (in green) is kicked out.
THis is a made up example. Essentially, I need some justification to be able to kick out numbers that are clearly outliers…Do I do standard deviation (kicking out anything 2sd from the mean)? or is there another, better statistical device?
Using standard deviations from a curve fit is a good way to eliminate fliers. That works well when the data points are derived from some process that could have errors. But your data points are opinions. It simply reflects the fact that that reviewer didn’t like a particular work of art for whatever reason. Right?
What are you doing with the numbers? Something like a Borda count?
I think that unless you can identify a bias giving you a reason to throw out data (The Russian judge’s scores) then you may have to analyze the data with a nonparametric test.
But would you use 2σ or 3σ? And I’d be hesitant to throw out any data that is not clearly an error if it is for a professional paper as outliers do occur in the real-life.
I’d like to not get hung up on whether or not the number is ‘correct’ or not or if there is bias or not.
I’m just looking for an appropriate method to justify removing a number.
We can say its a number generator that is designed to spit out numbers within some range…and this one time there was a bug and the number should be thrown out.
Or think of it as clearly an error, but I want to justify throwing it out from a statistical view rather than any other reason or relation to anything else.
There’s no “right” answer, but I’d have established a priori rules to exclude data. Selecting X SDs or Y% of the extremes to exclude is valid. You’d just want to avoid selecting data after the fact that looks wrong to you. As you didn’t establish this ahead of time it’s iffy, but if this is “pilot” data and you’ll be collecting more potentially it might give you a good starting point to know how extreme you should exclude.
Since this involves a known number of reviewers you could also decide that if one reviewer has some degree of disagreement with the rest then you could drop that reviewer from the average, but you wouldn’t want to reject a reviewer because they’re systematically more nitpicky than the others.
Except if a number is correct, you shouldn’t remove it because it is inconvenient to your analysis. If there is a bias then the question is throw out that score or all of the scores from that reviewer based on what caused the bias.
The simplest answer to this I’ve seen is “throw out the highest and lowest scores”. This ensures that a single exceptionally uncharacteristic score is eliminated, and if one judge is scoring too high or too low consistently. (Isn’t this how some scores like figure skating calculate?) This eliminates the possibility that a single judge is skewing the results in favour of, or against, one contestant. Note this is a more reliable method with several more judges; say, 6 or 7.
Consider the result: piece, with all, average with top and bottom eliminated:
art1 27.25 32.5
art2 25.25 25
art3 31 30
art4 19.25 19
art5 29 31.5
Assuming my calculator skills are still good, this is the result. Note the only one where the needle moves significantly is the one with the extreme outlier score. And by using a standard formula the same for everyone, you are not discriminating against any one judge or favouring any contestant.
Again, I don’t want to get hung up on whether or not the number is correct. Let’s think of it as a exercise in me learning how to throw out a particular score…just the mechanics of it.
The other point with this method, is it restrains the “Russian Judges”. There’s no point in giving a contestant a score too far out of what it “should” be, if you simply provide the score that gets tossed.
Software might also have tools for this, for example Excel has =TRIMMEAN() that gets a mean, cutting off X% outliers in either direction. It’s not based on the SD but number of data points, so it would work better with larger data sets than you’re potentially working with.
Eliminated scorer using hi/lo method:
A,C
A,D
B,C
B,C
C,D
So no particular judge is singled out either.
Of course, when this scoring method is used in a competition, the process is explicitly stated ahead of time. Picking the scoring method to “adjust” the results after the fact seems unfair.
Definitely true for something like an art contest as in the original example.
But other contexts, getting a best measurement in the face of flaky data, post-hoc reasoning might be the only tolerable way to do it. You still must disclose your process, of course.
This is true as a general principle, although if you can demonstrate that the data point is either statistically way outside of an expected distribution, or likely due to a measurement or instrumentation error you can make a valid argument for exception. For instance, in signal measurement for testing on shaker tables you’ll often see strong peaks at 60 Hz intervals (in North America) which is a very clear indication of ‘line noise’ due to improper grounding. But this does mean having a physically sensible (and ideally, repeatable) source of data contamination, and if you are just sampling data and have a ‘flyer’ that you don’t like, you can’t just throw it out because it is inconvenient.
In the case of the o.p.’s example of art reviewers, there is no expectation that their valuations will be normally distributed (random error or variation distributed about a central mean), and in fact you would expect bias both in how each reviewer scales their scores, and in how each reviewer interprets and rates unique works of art. I would expect a very multi-modal distribution for this case along both axes, which actually tells you more about the reviewers than the statistics of the artworks. Not everything is amenable to statistical analysis, and certainly not frequentist statistics employing Fisherian reduction; this is a case where a Bayesian approach to predicting how particularly reviewers will rate each type of artwork, or how an artwork will be generally rated, is the obviously superior approach, and is also why Bayesian analysis dominates real world data science.
The OP gave a made-up example. What’s the sample size that we should consider? How close is the made-up example to the real one?
You could for example throw out the highest and lowest scores of every judge. Or better yet the highest and lowest score of every contestant. That would be an unbiased measure of the underlying score.
Either way if the actual example involves judges, I’d look into the trimmed mean. That’s if we’re discussing one variable (eg score on a contest).
If this is a regression analysis with many variables and you are concerned about unusually large errors, look into robust regression or (better, IMHO) median regression.
First, there are statistical methods for addressing outliers, the simplest I know of of being Winsorizing (clipping extremes to match the next biggest outliers). But as many have pointed out, that’s dangerous when you don’t know if the outlier is actually representative or not. (On the other hand, if I understood the original post, I get the sense you’re more interested in a defensible approach than complete validity)
Second, if you’re looking for another statistically defensible approach, you could just use the median instead of the mean. It works better with larger samples in my opinion, but it’s an option.