Statistical way to properly kickout an outlier

And the geomethic meandian, or any similar metric, usually ends up looking a lot like the median, as long as the median is one of the measures you throw into the soup.

In the past, I’ve use Peirce’s Criterion to remove small numbers of outliers:

I had to implement it from the original paper due to lack of library support at the time. It was remarkably readable, though. Refreshing compared to most modern papers I read.

I like it because it doesn’t require any fudge factors. It’s based on this idea:

The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations.

Not buying it. Not for the stated goal of throwing out a particular score.

This is the crux of it.

No statistician I but lots of the bits described in this thread sound like not only like trying to massage the data but wanting to be sure the massage results in a happy ending.

I agree, which is why the first paragraph of my post was

Your concern can be assuaged if the methodology is put in place before the results are in. Inflation is a pretty good example. During the 1970s Fed officials would homebrew price indices which eliminated outlying sectors as they appeared: the exercises successfully allowed them to fool themselves. But it’s also true that non-representative sectors (like energy) can be poor measures of underlying inflation. Applying robust indices like the trimmed mean or a median index is one approach. Removing food and energy is the most familiar. The first two take care to trim both sides - both the sectors that have high inflation and the sectors that have low or negative inflation. All of them are valid when applied to eras years after they were proposed.

Post hoc methods can also be legit. Does anybody think economic measures during COVID are representative of a normally functioning economy? Certainly not: they are representative of an economy in a medical coma. It’s valid to drop such observations from a dateset or a chart, at least if the subject of the inquiry is long run trends. (Cyclic studies are another matter.)

Put in another way, the median is a valid measure of central tendency. It is also an extremely trimmed mean, an average of 1 or 2 remaining observations in the dataset after the edges are trimmed all the way to the center.

If it makes any difference. I don’t want to do ‘Post hoc’ analysis. I want to put in place a methodology NOW, so that IF such a scenario occurs in the future, I have a reasonable method established that would allow me to kick out the “5” while keeping the “15.”

Thing is it shouldn’t allow you; it does or doesn’t?

Sounds like you want to have such a method determined a priori that doesn’t bias the data but is still statistically valid for whatever you are trying to measure without divulging what you are trying to measure.

A priori there should not be any such method that fits all your criteria. There just can’t be. We can’t know what is truly an “outlier” without some knowledge of how we expect the measurements to be distributed.

The thing is, unless you have additional information - such as some underlying knowledge of what you are measuring or that certain values are likely to be in error - you can’t just look at a random list of samples and automatically know a measurement is truly an outlier rather than a valid but rare (or even not so rare) occurrence.

Even the method of tossing out both high and low numbers can be wrong but usually works out ok. But that’s usually. There’s no guarantee it doesn’t influence the underlying statistics every time you do it. Especially if you blindly apply it no matter what you are measuring.

Basically, without realizing it, you are asking how to fudge the statistics without fudging the statistics. Perhaps with knowledge of what was being measured, some kind of procedure could be determined to validly toss out the occasional extreme value, but there’s no such method that can be determined beforehand with no knowledge of what is being measured and how those measurements are being made.

Hope I get this right, from memory and conversations with a statistician friend.

There is an iterative univariate analysis method called “Lowess” that considers a distribution, chooses outliers to remove, then reconsiders that distribution to remove more.

Also, all these things should be tempered by anything you know or expect about the distribution. Is it Gaussian, or what? For example, it could be a Cauchy distribution. A Cauchy distributed population of values will have a relatively tight cluster of values plus one value relatively far away from the cluster. If you toss that one extreme value, and zoom in on that tight cluster, you will find that it now consists of a relatively tight cluster (much smaller than the first) plus one value relatively far away; it’s still a Cauchy distribution, just a tighter one. An example of a system that generates Cauchy distributions would be a laser pointer on a turntable near an infinitely long straight wall – you spin the pointer and let it come to a stop, then measure the location of its spot on the wall.
Interestingly, taking the average of all the values in a Cauchy distribution is a lousy way to estimate its overall value. You’d do better to pick any one value at random. Weird! Though you’d do far better still to evaluate its median.

Like I say, I’m saying this from memory and I’m not a statistician. But if it’s interesting, it might be a lead to pursue. Of course anybody who knows these things better than I can come along and straighten me out if they like!

Echoing what others are saying, if you want to do this, you need to know the expected form of the distribution.

The difficulty is that to reject an extreme outlier (the “5”), but not reject a marginal datum (the “15”), you need to distinguish between the two. And that can only be done if you have some reasonable model of the scores.

For example, if you expect a Gaussian distribution, you might decide that anything more than 3 standard deviations from the mean should be rejected. Or maybe it should be 4 or 2.5 standard deviations. All of this requires knowledge and analysis of your scores (and before you have any scores, if you don’t want to be post hoc).

Without more specific knowledge of your data, we can’t really give you more specific advice.

Which is why a non-parametric test may be the best and eliminate the throwing out outliers issues.

Of course, by removing the point from the sample, you will be affecting the parameters of the distribution, including the variance (the standard deviation is the square root of the variance). If the point is truly an outlier it shouldn’t impact the parameters of your estimate significantly, but you would need a criteria for how to evaluate that. If you actually have an expected distribution of comparable size you can back into this by applying the Student’s t-test (or the Welch’s t-test if your expected distribution is a significantly larger sample) and applying some specified level of confidence.

Of course, this assumes that the expected distribution is Gaussian. Most real world samples that are not a result of random variation (or error) about a central mean or measurement are not actually Gaussian, and trying to force a non-Gaussian phenomenon into a Gaussian distribution is a frequent cause of statistical malfeasance and resultant bad predictions.

Well, except non-parametric distributions require large amounts of data and have a higher degree of subjectivity in terms of what is truly an ‘outlier’ versus a point that is a legitimate measurement that just happens to be at an unexpected extrema of the phenomenon being sampled. The purpose of using a non-parametric approach is to avoid imposing a rigorous presumption of what the distribution will be, so throwing out ‘unexpected’ results without an epistemic justification is even more suspect. This obviously gets into the philosophy of statistical measurement and prediction and is probably beyond the scope of what the o.p. is looking at, but it is important to illustrate that there is no simple or universal method to identifying outliers in a sample.

Stranger

LOESS. We talked about it, but never used it…

Did You Know? | National Centers for Environmental Information (NCEI).

I agree with this. When I’ve use nonparametric fits, I’ve essentially taken them as given, and assumed things that look like outliers were just extremes still within the distribution. I wouldn’t know how to infer otherwise, which is why it’s not parametric.

My point exactly. Using a non-parametric test would eliminate “needing” to throw out data.

It sounds like you want to manipulate the data IF you have an outlier you don’t like. This is introducing your own bias into the analysis, which invalidates the statistical valid analysis you are attempting to achieve. If you want to put in a methodology that fairly treats all evaluations you have to apply that method, regardless of the evaluations and apply it, not only when there is an outlier that you don’t like.

That is what I want to do. I don’t want to throw out a number if I don’t like it. I want a means to throw out a number because it is wildly out of step with the others. And I want to be able to use that measure for any number that meets this criteria.

I can’t just throw out a number because the reviewer is a jerk and doesn’t like atheists so scores from his bias vs the quality of the work. But I do want a means to throw out a number legitimately without having to say…I don’t like you because you are a jerk.

Jackknifing

Bootstrapping

Handjobbing

This is the point people have been trying to make. How do you know with certainty it is wildly out of step? This means you already have an idea of what the number “should” be.

If you know a particular source of data is not good - you don’t use that source. Right now, you seem to want it both ways, i.e. wanting something you can claim is objective while knowingly including subjective judgments (both yours and the biased reviewer).

Something has to give, whether that’s the statistical validity of what you are doing, somebody’s feelings, or something else. Once you decide to include garbage numbers for whatever reason, you’re stuck with them. Better not to have them in the first place.

Then your premise of a statistically valid analysis is incorrect.