I’m not a statistician. When I was undergrad, I worked in a neurophysiology lab recording action potentials from dark adapted/trained Hermissenda crassicornis type B photoreceptors. There were three of us doing this in the lab, and the PhD we worked for put the data together, analyzed it, and published papers about our findings. He was blinded as far as treatment group, etc.
One day the boss came to me about the data he was working with. He showed me a graph with all our data following a nice, straight line except for three points that were way off. It was really bothering him.
I hemmed/hawed and finally told him I knew what was going on. The three points that were off were the three done by the woman in the lab whose technique was sloppy. I had no axe to grind, but it bothered me seeing her not doing what she was supposed to be doing.
He called her into his office and explained that his funding had been cut and he could no longer afford to pay her. She left and now that she was no longer a part of the lab he asked me to remove any data she’d supplied. Voilà!
Removing the lowest or highest scores of each judge could mean one entry gets no score?
First, the main reason not to do post-hoc in the example of a competition is that changing the scoring method is moving the goalposts. The rules, including for scoring, must be explained and applied as provided for ahead of the competition. Otherwise, it is an implication that for whatever reason, the results are not as promised. If everyone retains their standing, then why bother? If it changes the standings, then someone is hurt by losing a standing they would have had.
It’s perfect;y OK in something like a biology or physics experiment to throw out data that may have come from an error (or some imprecision). However, that should be done with caution. It may be valid data.
For an art competition, it’s OK to verify with the scorer - “did you make a typo?” But if the judge insists that is a valid score, then it suggests either there is a problem with the judge, or the artwork, but the score is valid, as you asked a presumably qualified judge to make a subjective judgement. Unless you set the scoring method ahead of time to eliminate high-and-low or some such rule, the score stands.
(Maybe they don’t like abstract art - “My 5-year-old could do better…”)
It’s an obviously SUBJECTIVE assessment. Just like amazonebay reviews. Without knowing the reviewers personally, no one can evaluate the quality of their “ratings”. Likewise, the unknown selection criteria for those reviewers! (how many were cronies?)
So, just leave the folks reviewing the reviews to come to their own decisions.
So, you are looking for a way to eliminate the effects of bias (or error) without having to identify the bias. There is just no rigorous way to do this other than defining the expectation of the result and then rejecting anything that falls outside of that expectation…which, of course, invalidates the entire point of performing a statistical analysis.
There is another obvious bias in reviews on Amazon.com and other online stores; the people who are most likely to respond and post a review are either those who were really impressed by the product (or had an ulterior motive in posting a positive review), and those who feel slighted about the product (or often some aspect of the delivery that has nothing to do with the innate quality of the product). People who are reasonably satisfied are far less likely to spend time posting a review about a product that is just okay, and so you end up with more reviews that are at the extremes than the mean, and thus, just looking at the raw number of stars, even when weighted by the total number of reviews, can be highly misleading in the case where there are a bunch of both highly positive and very critical reviews. In essence, unless you actually read through the text of the reviews and look for patterns of criticism or praise, the ratings are useless.
For this sort of data that would be my advice as well. Reviewers can be all over the map in terms of how they assign a numerical value to their preference. You can have some people who look at a high end art exhibition and think that all of it is all better than the average art they see in for sale, and so have a distribution centered around 7 or 8 with almost no values of 5 or less. Others might want to rate the art relative to other high end art and put it on a bell curve centered at 5 but only give ratings of 9 or 10 for exceptional pieces. Still others might decide to the distribution should be flat so that 10% of the art gets a 1 and 10% gets a 10.
If you just average all of those scores you will underweight or overweight certain reviewers based on how they interpret the scoring system. So to put everything on an even field it makes sense to force the distributions of the reviewers to be the same, either by using a rank statistic, or maybe quantile normalization, if you have large amounts of data and want to be particularly fancy.
More generally I would stand somewhat against those who say that all determination of how to handle outliers must be defined a priori, and that otherwise you have to take all data equally. In some very rigorous situations where decisions will be made solely on the result of a statistical test and issues of potential bias will lead to controversy. For example determining the winner of a contest, or evaluating the results of a clinical trial. But even in those cases it might make sense to throw out certain observations if there was a good reason to. Say we discovered that the artist who painted Art1 had slept with reviewer A’s wife.
But in many cases where the stakes aren’t so high it is far more important that the reported results accurately reflect the data, than it is that a set of specific rules were followed. You often won’t know a priori what problems the data might have, and if you blindly crunch the statistics without taking the time to look at your data you may end up making erroneous claims. If the inclusion of exclusion of one individual sample out of many radically changes your conclusion then you are probably better eliminating that sample or at least altering you methodology so that that sample is not influential. As an example, it would be incorrect to conclude that contrary to an otherwise negative trend, the homicide rate in New York City tripled in 2001. Of course you do need to report what outliers were removed, offer an explanation for why you removed them, and ideally figure out what made them anomalous.
The engineers’ concerns were based on suspicions and incomplete analysis. The engineers only looked at low temperature O-ring data which included 4 failures. By ignoring data about flights at higher temperatures, the calculated probability of failure was lower than what it should have been. This analysis is not only incorrect but also dangerous and led to the disaster. Data from all flights that were recorded should be taken into account. There have been many O-ring failures at both high and low temperatures, but the strength of the association between these two variables had not been measured. The engineers cannot simply ignore the flights with higher temperatures. An in-depth statistical analysis of the efficiency of O-rings at various temperatures is required to truly understand the failure of Challenger. Intuitions and biased analyses, especially by NASA management, are not sufficient to determine the probability of the O-rings failing.
I seem to disagree with the consensus here. There’s a random variable with an unknown distribution. The researcher wants a measure of central tendency that is resistant to outliers. Nothing wrong with that. Indeed the outlier-resistant median is the 2nd most popular measure of central tendency, with the first being the outlier-sensitive mean.
There are plenty of situations where you don’t have a handle on the underlying distribution. Robustness is the keyword in those cases. I’d try a trimmed mean or winsorization if you want to be fancier. There are a wider range of methods discussed in the wiki article. There are also book-length treatments.
I understand lowess is generally estimated with Ordinary Least Squares (OLS) which over-weights outliers. So I’d try a different approach.
Q for OP: Will there be 4 judges in the OP’s actual problem? Or more? If there are exactly 4, you can apply the median (which as I noted upthread is equivalent to the trimmed mean and winsorizing). If there are 5 or more, then I’d personally prefer the trimmed mean for simplicity. But if simplicity doesn’t matter, winsorize away.
We are looking for some correct score for an artistic endeavour. We want an estimate of that score. We expect differences of opinion, and differences from the desired score. Are those differences of opinion random? Do they have a definable distribution? The problem seems to be that the differences are not very random and are highly correlated with the style of artistic endeavour judged and the individual judges biases.
The task becomes one of moderating opinions. With so few samples, there isn’t ever going to be a robust statistical mechanism. This is as much a social problem as it is a statistical one.
So you adopt social mechanisms. Such as telling the judges that the highest and lowest scores will be culled. Whether this makes them more or less likely to give out biased high or low marks is another question. It does tell them such behaviour is likely fruitless.
Another tactic is to tell them that scores more than a given gap from the median will be discarded. That means a highly opinionated reviewer that wishes to have an impact on the score can only do so by guessing a reasonable score, and without knowledge of his fellow judges that is his best estimate of the correct score. They can then provide a score at the low or high end of the guessed acceptable range - so it will still be counted. If they guess right they will pull the score in the direction they want, but if they don’t guess right, their opinion won’t count. Indeed they know that if they just provide an outlier they will most likely cause the final score to move in the opposite direction to their opinion.
Essentially you start from the beginning assuming biased or unreasonable scores. Then you just provide an even handed, if coarse, rule to attempt to ameliorate the problem. The problem being the judges behaviour.
Of course there is also information to be gleaned by not ignoring outliers for reviews.
In particular it is sometimes of interest that a particular work is a far from normal distribution. Works that quite a few people love and quite a few people hate but fairly few are neutral about.
There is probably a PhD or two to be found mining the distributions and various correlations. Probably have been. It rather underlines the OP’s problem.
Maybe there should be a prize for the most polarizing entry.
Consider 100 reviewers of a movie over at Rotten Tomatoes: each movie has 1-5 stars. There will be a measure of central tendency. Usually that’s the mean, the total stars divided by the number of reviewers. Sometimes the median is reported, which is the number of stars where half the reviewers awarded more stars, and half awarded less. Finally there is the mode, the ranking with the highest number of supporters. We discussed more obscure measures of central tendency upthread.
There are also measures of dispersion: how much do the reviewers disagree with one another? Variance and its square root (the standard deviation) is the most popular but I think mean absolute deviation and median absolute deviation are more intuitive and deserve more attention.
I am skipping skew.
The last is the thickness of the tails or the susceptibility to outliers, given the standard deviation. The technical term for that is kurtosis. The Normal distribution tends to have thinner tails than are commonly observed in financial markets. Which is why a number of models went up in flames during financial crises over the past few decades. When you hear claims that derivative prices displayed a one in ten thousand year pattern, that’s because the financial institution naively applied a normal distribution, rather than one based upon longer run data. IBG, YBG: “I’ll be gone, you’ll be gone.”
Which models have assumed a Gaussian? Black-Scholes has been around since the 70s, and that follows lognormality; and LTCM assumed even fatter tails than that (and still died in a fiery financial crash).