# Standard statistical procedure for this data cleaning need?

I have a huge time series of response variables from a physical experiment. In this experiment we are trying to evaluate dozens of different conditions, and we will change things to each new condition and then wait a while to let things settle out and record a few time intervals with the new condition before moving on to yet another condition. When processing data afterwards, I’m identifying what look like nice stretches to represent each of the different conditions.

But maybe there’s a standard method in statistics for accomplishing this, and if I knew what it was called I might find out that it is available in my environment.

What I’m really going through is picking a time interval when I think things are stable, and then testing that pick by looking at graphs of response variables in the time interval. In each case I’d like to see what looks to my eye like Gaussian noise. What I don’t want to see is points at either end of the interval that look like outliers relative to all the other points in the interval. This is most stark for the last points in the interval, because too wide an interval will catch points that are running rapidly off to a new condition because I accidentally caught points just after making a change. It’s more subtle at the beginning of an interval because the question is whether I have waited long enough before opening my interval.

But in either case I’m doing something like asking what a T test would say about whether this end point here seems a likely member of the population represented by all the other points already included in the interval.

I might try to automate what I’ve been doing the hard way by hand. I’d probably iteratively consider building an interval from some point obviously fairly stable, first by stepping later and later in time until I hit an obvious outlier, and then similarly extending the interval earlier and earlier in time.

But did somebody already describe a standard method for doing this job?

Thank you!!

It’s been a long time, since I did this. I used lab view for data acquisition and then matlab for processing.

Stable Periods : A good approach is to do a fast Fourier transform of your time series and look at the frequency of noise in your data. This way you can identify the noise and then use a band pass filter to reduce it (if needed)

Once the noise is eliminated, I averaged the d/dt of time series data over several readings. Based on a thresh-hold, the time averaged d/dt going to 0 is a good indicator of stable periods.

Good luck

If you have matlab available then using a low pass filter then the del2 laplacian operator would get you the areas where the gradient on the data is zero. You could also try a sobel operator, which is more of an edge finding tool, but where there are no edges there isn’t much of a gradient or change in the data either.

While adaptive techniques based on the data are useful, you might consider doing a little analysis first. Does your system have a “time constant” that characterizes how long it takes to reach equilibrium? If it does, simply discard data until several time constants have passed.

For an adaptive technique, I’d slice the data into intervals of the same length. Compute statistics of each slice (mean, variance, skew, kurtosis, etc). Then do a test to see how similar each slice is the previous and next slice.

You mean that adjacent points are related, until you come to a change of conditions, where there is a breakdown of correlation? Sounds like it might be a case for an auto-correlation test.

If the points in the interval really are related, you should see good autocorrelation over the interval, then poor auto-correlation over intervals including the transition.

Autocorrelation is a sliding-length function, which should give you the first transition. Then you need to do a moving start, because otherwise you’ll be trying to correlate the first interval with the second interval and the third interval etc – which will mess up the indication of subsequent transitions.

Of course, I don’t know your data. You may have to do an initial transformation to get data that shows the correlation you are looking for.

Wouldn’t it make sense to exclude points from after you made status change from being counted in the previous status’s data? I understand that you might want to pick up a few extra points before the status change kicks in, but those points are by definition contaminated.

This may be a way-low-tech suggestion, but if you are manually effecting the changes to the conditions, have you thought about manually controlling the data collection? As @Pleonast asked, if there is an assumed or observed time constant for the system to settle after each adjustment, and you can control the data recording, I would create a new data set for each trial for each set of inputs. If you are confident that your measuring equipment isn’t drifting over time this should eliminate the need for filtering the data which may have unexpected side effects.

This is strictly outside my area of expertise. So feel free to laugh and point after reading my thoughts.

I cringe a little (a lot?) at the idea of “I’m taking data then eyeballing which to discard and which to count”. That seems to just be begging for invalid conclusions.

You’re apparently not doing anything like a double blind trial. But still the subjective nature of your up-front editing feels to unqualified me like an uncontrolled and potentially gigantic thumb on the scale of your conclusions.

I’m not sure precisely what your data looks like, but I’ve used Peirce’s Criterion in the past for outlier rejection:

It’s good to see that there’s a whole Wikipedia article on it now. When I implemented it in my own code, I had to go to the original 1852 paper.

I found that it works best when there are a very small number of outliers, say one or two glitch points in the time series. But it may be ok for larger numbers as long as you have a large enough sample set.

Thank you everybody for the suggestions!

I want to get unsmoothed data for the nice stretches. I need to be able to do simple univariate statistics on the measurements inside each nice stretch.

The time constant idea is already kind of baked in. There are multiple layers of time constants in the system and my raw points are each parametric results from nonlinear fits of exponential decay models. One way of stating my quest is that I want to achieve a high ratio of random noise to decay residual in the nice stretches. My data are strongly heteroskedastic in the sense that different nice stretches will have different time constants, so a decay model is generally one of the many options for a model of what a nice stretch should look like, and methods like Peirce’s criterion (which does not take advantage of the ordered nature of the data) or an analysis like LOESS or LOWESS (which does). But because some nice stretches are brief, I think it’d be better to choose a simpler model like Student’s T test for building into any iterative nice stretch search.

I like the slicing idea, but the data are too sparse. Some of the nice stretches have as few as about six points in them.

Looking at autocorrelation is in the general category of looking at a candidate nice stretch and deciding whether the leading or trailing end point probably belongs or probably doesn’t belong. I think it’s a perfectly good idea, and in fact I think my subjective eyeballing is trying to do that. And I might write the code to do it. What I was looking for is whether others have already worked out the algorithm, added thoughtful extras that wouldn’t have occurred to me, and named it something I could look up and perhaps find available in my environment (I’m using SAS for this project).

Often I can use my physical measurement method with control, so that the time of a condition change is known within the data set, or even chosen on the basis of evolving data. But in some applications of my physical measurement method I don’t have independent access to condition change information. In effect I have to detect it in post processing.

Hey LSLGuy! Yes, the subjective nature of my current method is most… unfortunate. Now, I’m somewhat blinded by the complexity of the project. There are dozens of nice stretches to be identified and keeping track of which one corresponded with which conditions is somewhat beyond my memory; I have to do it in other code, and it’s a full factorial experiment 23232 plus some semi randomly timed augmented points (because they come for free having to do with daily and weekly schedules) so I quickly get lost. I should say also that this four month experiment involves physical experiments and CFD simulations that are paired, and I’m getting the physical experimental data first and following those with the simulations. What I can’t help hoping for (what with being human and all) is agreement between these. But when I analyze nice stretches, I’m producing the physical experiment component, and then the simulation is going to match better or worse without my ability to influence it. So I’m still insulated in that sense, and blind to the CFD result (because it’s still in the future at the time).

I think what we have here are a lot of useful ideas on how to write code that would automate the search – but I still haven’t heard that there’s a name for this search which I might happily find listed amongst the myriad SAS procedures available to me, many still unknown.

I should say something else about my eyeballing method. To be clear, I’m defining “nice stretches” to include all points between a start and end time, and no points outside those times. For most stretches, when I’m done naming the beginning and ending times, I can say that there are n points in a row whose values are scattered in a Gaussian-looking way, but if I added a point at the late end, it would lie way outside the point cloud. I think it’s easy to say that later point should not be included, and I bet in many cases it would be ten or a hundred or even a thousand standard deviations outside the cloud, which might have six to fifty points in it.

It is harder to decide about points at the beginning end, because signals are generally changing slowly there. However, if adding a candidate earlier point would make it the highest or lowest point in the population, I’m inclined to exclude it.

Most stretches can be defined this way.

Unfortunately some stretches are changing all the while, in which case I try to apply the same thinking to the time trend, or what I’d expect the residuals to be after fitting a linear function in time. I’m not actually calculating this, just judging by eye. I’m also thinking future experiments should have longer wait times when I see this. Maybe I can come back later and model how long the wait time should be as a function of the independent physical variables.

I should also say there are a dozen response variables with varying degrees of mutual correlation, and all of the plots of these have to look right, so if any of the variables have an outlier at either end I shorten the interval. I go through a few iterations where I generate all the graphs of the nice stretches without including points outside the stretches, and keep knocking points off the ends when these graphs show the first or last point is an outlier.

I’m not knocking out points that look like outliers that are not on the end of an interval.