Statistics question. Can results be confirmed by splitting the data?

I was told this once years ago and now have a hard time convincing anyone of it. Perhaps I’m remembering it wrong?

The idea is that you have conducted a test and arrived at a finding and want to corroborate it, you can just split the data into random bits and see if the results still hold.

Say you want to test the life expectancy of fruit flies who eat certain radioactive cantaloupe. You find that 10% died in 8 hours and 90% died within 48 hours. Now if you rerun the tape and divide the flies into four groups, eeny, meeny, miny, moe, if the finding is a “hard law” then each group will have yielded the same data. Thus, it’s as though you’d run 5 tests.

Does that make sense? Is it really the same as running five tests?

There would need to be a minimum amount, but that’s the basic idea of sampling, isn’t it? Take a small percentage of the whole, and believe that it corroborates with the whole. So the inverse should hold true, right?

I would think that mathematically, yes, given your subsets are appropriately large, that would work. However, in actual practice, any flaws in the data-gathering process would be the same. So depending on how vulnerable your sampling technique is to factors you haven’t accounted for, analyzing your subsets would be variably meaningful.

IANAStatistician, though.

OP seems to be talking about some form of resampling method, which allow you to estimate statistical significance or validate a statistical model.

You can’t get additional information, as though you’d conducted additional tests, in this fashion (how could you?). You may be thinking of some kind of statistical resampling technique. These are used to assess the amount of statistical uncertainty, not to reduce it. In the example you describe, you’d typically just use standard statistical methods; resampling is used for more complicated cases.

I’m afraid that I think you’re remembering it wrong.

Good statistical methods will give you the same result whether you imagine the data as one big experiment or five (identical) smaller experiments. If you divide the data into five groups, you get more results, but each result is a lot less reliable, and those exactly cancel out.

And of course, since they’re using the exact same data, the five small groups will always (as a whole) give the exact same answer as the test on one big group, so it’s basically double-checking the math, but doesn’t tell you anything at all about whether the conclusion has any relation to the real world.

Now, there might be situations where, based on what you already know about the real world, you might want to break down your data in various ways; for instance seeing if the conclusion holds for children considered separately from adults, or something, but that’s completely based on the particulars of what you’re looking at.

I do know some stats. First of all, the original sample need to be a random sample of fruit flies, and your subsamples must also pass the randomness test. For example, if the subsamples were selected based on some criterion that is somehow correlated with life expectancy, then it wouldn’t work. Also, after you’ve split up the original data to smaller samples, each sample must be large enough to justify using whatever statistical test you are using. But, frankly, if you get the same results with the original data as one big sample or as multiple subsamples, that does not give you a stronger result. What’s more interesting would be to show the variation in the results when you slice up the data.

If you take a random sample of observations, and find an association between two variables in that sample, it is possible that the association is partly due to random variation (i.e., you just happened to draw a sample in which those two variables, by chance, were more highly associated than in the overall population). If this is true, then if you take a number of other random samples, the association in those samples should, on average, be lower. By the same token, if you do not find a lower degree of association in the other subsamples, you can assume that the association you found initially was not inflated by random variation.

Splitting a sample into a number of subsamples is formally equivalent to the process described above. You do the splitting prior to any testing of associations; then you use one subsample to look for the association, and use the other subsamples to confirm it. This procedure is more typically used when you are building multivariate models, which are more subject to the problem of taking advantage of random variation.

The drawback to this is that the subsamples, having fewer observations than the overall sample, provide you with less power to detect real associations.

Basically, this procedure allows you to rule out the possibility that your results are affected by random variation. It does not allow you to rule out the possibility that they are affected by systematic bias.

There’s also the related notion of cross-validation for predictive modeling. That’s clearly not what the OP had in mind, but it’s close enough that I think it’s worth mentioning.

Oh well put. This is what was trying to get at above, and might have said something like, if I knew what I was talking about.

I count 4 tests. What you originally considered as your first test turned out to be a dependent composite of those 4, not an independent 5th test.