Checking Parity of Data Sets?

I have a complete set of data Q. From Q, we calculate a bunch of new data but, previous to doing so, we drop about 4% of all information that we’ve probabilistically determined is spam.

We have developed a new set of tools that will perform these calculations, and which also drops about 5% of the data but uses a different method of spam detection. In general, we expect that there’s about 1% difference in what is in and not in the data the old way threw out slightly different things than the new way does. Much will overlap but some will not.

We want to ensure that our new technology is producing the same results as the old tools but, because the data being dropped is not quite the same, we know that there will be some variance between the two sets. More importantly, we know that values that are less common in the results will vary by a greater amount,

For example, if there were 2 penguin owners in a population of a million people, in the old result, then having that go up to 3 or down to 1 is a giant jump relative to the original. The data might have only changed by 1% but our penguinOwnerCount has changed by 50%.

Is there a formula for saying that if there was a 1% change in the dataset then, based on the size of a particular subset as a proportion of the whole, we should expect the new subset size to be within bounds M and N?

Perhaps one of the statistical location tests listed on this page?