Checking Parity of Data Sets?

Sage_Rat · September 28, 2022, 9:10pm

I have a complete set of data Q. From Q, we calculate a bunch of new data but, previous to doing so, we drop about 4% of all information that we’ve probabilistically determined is spam.

We have developed a new set of tools that will perform these calculations, and which also drops about 5% of the data but uses a different method of spam detection. In general, we expect that there’s about 1% difference in what is in and not in the data the old way threw out slightly different things than the new way does. Much will overlap but some will not.

We want to ensure that our new technology is producing the same results as the old tools but, because the data being dropped is not quite the same, we know that there will be some variance between the two sets. More importantly, we know that values that are less common in the results will vary by a greater amount,

For example, if there were 2 penguin owners in a population of a million people, in the old result, then having that go up to 3 or down to 1 is a giant jump relative to the original. The data might have only changed by 1% but our penguinOwnerCount has changed by 50%.

Is there a formula for saying that if there was a 1% change in the dataset then, based on the size of a particular subset as a proportion of the whole, we should expect the new subset size to be within bounds M and N?

DPRK · September 28, 2022, 11:24pm

Perhaps one of the statistical location tests listed on this page?

Topic		Replies	Views
Statistics question - comparing single samples from normal distribution Factual Questions	4	6081	July 10, 2010
Help with Statistical Significance Factual Questions	18	1730	July 19, 2011
A quick stats computation Factual Questions	7	651	November 13, 2008
Wheigted Data in Statistics Factual Questions	8	967	August 2, 2005
Statistical Method Factual Questions	14	972	October 16, 2006

Checking Parity of Data Sets?

Related topics