Is There Any Way To Determine If Numeric Data Have Been Faked?

Surreal · June 7, 2005, 3:46pm

Is it possible to determine if a large set of numbers has been faked?

My thinking is that most people, if asked to choose a number at random, seem to select odd numbers for some reason. Also, I’ve read that most people will underestimate the number of repititions in a random sequence.

Can this information be used to assess a document’s authenticity? Are there other any other clues to look for??

Thanks.

Squink · June 7, 2005, 4:06pm

Analysis of digit distribution can be helpful in uncovering fraud. Benford’s Law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9).
How a Mathematical Phenomena Can Help CPA’s Uncover Fraud and Other Irregularities

Nanoda · June 7, 2005, 4:07pm

It would depend on the properties of the data set.
Is it large? What are the numbers ‘supposed’ to be? For example, if you ask me to check if the doors on this floor are open (1) or closed (0), the fact that most doors are hydraulically shut, and that the sample size (30 or so doors) and range (0 or 1) is super small, means that I can just sit in my chair and come up with ‘reasonable’ fake data.

If you asked me to simulate 10,000 random numbers between 0 and 1000, then I’m confident my numbers wouldn’t be especially ‘random’.

scr4 · June 7, 2005, 4:12pm

Yes, if it’s faked badly. For numbers that can be any order of magnitude (i.e. financial records), the firs digit is more likely to be a smaller number - 1?, 1??, 1??, etc are far more common than 9?, 99?, 999?, etc. here is a bit more info on it.

For some types of experiments and surveys, you can calculate the expected variation in results. If the reported data doesn’t comform to expectations, it’s a good reason to suspect some type of fraud or at least a cleaning-up of the data. Mendel’s data on pea plants shows less variation than expected, for example, which indicates some kind of bias (e.g. throwing out data sets which don’t fit the model, or some wishful thinking creeping in when trying to decide whether a particular pea is wrinkled or not).

casdave · June 7, 2005, 4:13pm

This is exactly why data on their own have very little meaning.

The method of collection, and of analysis and the ability to reproduce the data to independant other parties are vital.

scr4 · June 7, 2005, 4:14pm

arrgh, in the first paragraph I meant “e.g.” not “i.e.”, and “9?, 9??, 9???, etc”.

CalMeacham · June 7, 2005, 5:19pm

There are other ways. C.E.M. Hansel, in his book ESP and Parapsychology, demonstrated how faked data was found in a series of test results because there was an anomalously high appearance of one number in every fifth position. Lots of statistical methods can be used to detect anomalies. Look in any book on statistics.

I’ve been interested in the Beford Probabilities for ages, but note that it’s only good for the first digit, and it has to be a number that can have any digit as its first number. You wouldn’t use Benford probabilities to determine if a list of refractive indices were faked, for instance, since almost all of them (in the visible) will start with “1”. But you would expect populations of cities or lengths of rivers to have a Benford distribution of the first digit. (The probability of the first digit being n, by the way, is log((n + 1)/n), where we’re using common logs (base 10).

Second, third, fourth etc. digits have 1/10 probability of each of the digits showing up.

TimeWinder · June 7, 2005, 5:20pm

Uh, call me math challenged, but aren’t there TEN digits, and hence a 10% expectation?

TimeWinder · June 7, 2005, 5:21pm

Beat me by a minute. OK, I can see first digits not including zero, thanks.

borschevsky · June 7, 2005, 7:08pm

I don’t think that this is correct. For the same reason that 1 as the first digit is more common than 2, 11 as the first two digits is more common than 12. 21 is more common than 22, etc. Some math is here.

CalMeacham · June 7, 2005, 7:33pm

Very Interesting, Borchevsky But I note that the last line of your cite says:

borschevsky · June 7, 2005, 7:43pm

Yes, the effect decreases as you move to less significant digits, but it never goes away completely. Though for most practical purposes it’s probably too small to be useful after the first few digits.

wevets · June 8, 2005, 12:47am

IIRC, it is now believed that either Mendel or one of his assistant gardeners faked data on his famous pea plant experiments. The data was so close to the expectations of theory that it would have been wildly improbable for him to be able to obtain such good results.

Depending on the data set and the expectation, there are a number of ways fraud could be discovered.

Topic		Replies	Views
Checking Cooked up Data Factual Questions	3	688	November 19, 2002
distribution of prime numbers in made-up data Factual Questions	4	1507	March 27, 2013
Please ID this mathematical/statistical phenomenon: small numerals more common Factual Questions	33	4946	August 13, 2012
Several probabilities questions Factual Questions	7	1814	November 27, 2012
Binary distribution in artificial and natural settings; more "1"s or "0"s? Factual Questions	13	974	January 25, 2005

Is There Any Way To Determine If Numeric Data Have Been Faked?

Related topics