Is it possible to determine if a large set of numbers has been faked?
My thinking is that most people, if asked to choose a number at random, seem to select odd numbers for some reason. Also, I’ve read that most people will underestimate the number of repititions in a random sequence.
Can this information be used to assess a document’s authenticity? Are there other any other clues to look for??
It would depend on the properties of the data set.
Is it large? What are the numbers ‘supposed’ to be? For example, if you ask me to check if the doors on this floor are open (1) or closed (0), the fact that most doors are hydraulically shut, and that the sample size (30 or so doors) and range (0 or 1) is super small, means that I can just sit in my chair and come up with ‘reasonable’ fake data.
If you asked me to simulate 10,000 random numbers between 0 and 1000, then I’m confident my numbers wouldn’t be especially ‘random’.
Yes, if it’s faked badly. For numbers that can be any order of magnitude (i.e. financial records), the firs digit is more likely to be a smaller number - 1?, 1??, 1??, etc are far more common than 9?, 99?, 999?, etc. here is a bit more info on it.
For some types of experiments and surveys, you can calculate the expected variation in results. If the reported data doesn’t comform to expectations, it’s a good reason to suspect some type of fraud or at least a cleaning-up of the data. Mendel’s data on pea plants shows less variation than expected, for example, which indicates some kind of bias (e.g. throwing out data sets which don’t fit the model, or some wishful thinking creeping in when trying to decide whether a particular pea is wrinkled or not).
There are other ways. C.E.M. Hansel, in his book ESP and Parapsychology, demonstrated how faked data was found in a series of test results because there was an anomalously high appearance of one number in every fifth position. Lots of statistical methods can be used to detect anomalies. Look in any book on statistics.
I’ve been interested in the Beford Probabilities for ages, but note that it’s only good for the first digit, and it has to be a number that can have any digit as its first number. You wouldn’t use Benford probabilities to determine if a list of refractive indices were faked, for instance, since almost all of them (in the visible) will start with “1”. But you would expect populations of cities or lengths of rivers to have a Benford distribution of the first digit. (The probability of the first digit being n, by the way, is log((n + 1)/n), where we’re using common logs (base 10).
Second, third, fourth etc. digits have 1/10 probability of each of the digits showing up.
I don’t think that this is correct. For the same reason that 1 as the first digit is more common than 2, 11 as the first two digits is more common than 12. 21 is more common than 22, etc. Some math is here.
Yes, the effect decreases as you move to less significant digits, but it never goes away completely. Though for most practical purposes it’s probably too small to be useful after the first few digits.
IIRC, it is now believed that either Mendel or one of his assistant gardeners faked data on his famous pea plant experiments. The data was so close to the expectations of theory that it would have been wildly improbable for him to be able to obtain such good results.
Depending on the data set and the expectation, there are a number of ways fraud could be discovered.