Somebody explain Benford's Law

While StumblingUpon (at 3AM the other night :wink: ) I came across this site. It explains Benford’s Law …the idea that the leading digit in a data series should follow a predictable pattern.

The Wiki is a bit short on it’s limitations which is really what I’m after.

Seems to me if I were to calculate my average MPG per fill-up in my car, a two would be the leading digit. If I drive ALL highway I can hit 30+MPG, yet driving like a bat out of hell I’d still at least get 20. I’d NEVER have a MPG with the leading digit of one.

Wiki says, *“A set of real-life data may not obey the law, depending on the extent to which the distribution of numbers it contains are skewed by the category of data.” *But what data ISN’T going to be skewed? How is this law applicable? The Wiki article hints that it depends on geometric progressions? Is this the case?

Well your example of average MPG is probably right out because you’re dealing with a very small data set averaged over a long period of time that is expected to be within a certain range.

I would guess that Benford’s law would apply OK for a data set of MPG averaged in 100 millisecond steps in a an hour of city driving environment (assuming it’s not skewed to 0 by fuel injectors completely shutting off when coasting).

As I understand it, Benford’s law is mostly applicable to totals. Anything which involves adding up a number of samples whose digits are truly random will result in a total which follows Benford’s law.

It’s really a way of describing how the carry behavior of the add operation works in positionaly-based number representation systems.

In almost all statistical sceanrios, totals are involved. Certainly in any business accounting, almost every number of interest is an aggregate of transactions.

And of course anything involving multiplication is just repeated addition, so there are many physical processes whose measurements are second order effects & hence Benford-susceptable.

And so it’s easy for peopleto assume Benford’s is applicable to all numbers in all situations. But it’s not really. A running total of your miles per tank would be Benford, even though the individual measurements would not.

There are various places where this is treated at length. There was a great article in Scientfic American (R.A. Raimi “The Peculiar Distribution of First Digits”, Sci. Am. 221 109-119, Dec. 1969), and it’s supposedly in Don Knuth’s classic “The Art of Computer Programming”. There was a pretty good treatment in the American Journal of Physics (J. Burke and E. Kincanon in AJP 59 952 (1991)).For a quick-and-dirty treatment, you can always go to Wikipedia:

Basically, Benford’s law exists because you can break one big boulder into lots of little rocks.
If you take any distribution of numbers that can be truly random and make a histogram of the first digits – using populations of cities, or lengths of rivers, or whatever, you find that "1’ appears as the first digit overwhelmingly – about 30% of the timre, while the larger digits show up much less than the 11% of the time you naively expect them to.

As the scientific American article points out, you can convince yourself that this is likely by graphing the fraction of the time each digit occurs as a function of the total length of the count. As you go through the first nine difits, 1 through 9 are equally represented. But as soon as you hit “10” you get a stretch where “1” gets overrepresented. By the time you get to 19 , “1” has been the first digit ten times out of 19, or over half the time. Over the next ten digits the fraction of "1"s gets smaller, since “20” through “29” start with “2”, but you can see that the fractions of the time that high numbers like “8” and “9” get to be the first digit still remain low. By the time you get to the "90"s, they finally catch up, but what then? Now you get to “100”, and now “1” is the first digit gfor the next 100 numbers! And then you get the same cycle repeating. “9” doesn’t get to be the first difit until you get to the 900s. But then “1” gets to be the first digit for the nest THOUSAND numbers. You can see that the larger digits just never get a chance to catch up.

Again, as the Sci Am article points out, the fact that it doesn’t matter which units the length of rivers are measured in doesn’t affect the distribution tells you what the distribution must be. The only distribution that’s true for is the one in which the propbabvility of that digit’s beiong first is equakl to the relative length on a slide rule. or P(n) = log (n+1)/log (n), so 1 shows up as the first digit logsub*10/sub = 0.30103 or 30.1% of the time. (As a corrolarry, you get the probability of the digit’s being first in other bases by using logs in that base system. “1” shows up as the first non-zero digit in binary 100% of the time – Duh. More interesting is that 1 shows up as the first digit in base 3 about 2/3 of the time.)

All of this is developed with more mathematical rigor in the AJP article.
The procedure won’t work with non-random tanles, like refractive indices, or dielectric constants. But it’s a hoot to see the Benford proabilities emerge from yoyr hand-made histograms of city populations or river lengths taken from any random atlas.

There’s nothing magic about base-10 representation of numbers: something similar would happen in every other base, except for binary (where the rule would be that every number starts with 1).

In base-n arithmetic, the probability of a random number starting with 1 would be (log 2)/(log n).

And, generally, in base-n arithmetic, the probability of a random number starting with digit d would be:

(log (d+1) - log d)/(log n).

No, it would be log[sub]n/sub, (I mistakenly put the denominator as a separate log function in my post above). In generaly, the Benford probability for a number m in base n is

log[sub]n[/sub]((m + 1)/m)

I think your differently expressed formulas always give the same result as mine. Note that in mine I say “log” without giving the base of the logarithm, because in my formulas it doesn’t matter: you can use logarithms to any base, as long as it’s the same base for all logs in the formula.

I came across Benford’s with StumbleUpon also, not very long ago. The above quoted bit is how I currently understand it. Use a different representation, get different values – there’s a relation between the percentages and the radix (base? I’m not sure that the terms are equivalent) of the counting system.

When counting in binary, a pathological case, every number (except 0 itself) leads with “1”, so it’d be very close to 100%. Though I don’t have the statistics on hand, I’d expect that counting in bases greater than 10, e.g., hex, would further exaggerate the prominence of the leading “1”.

And, on preview, I see that others have already made a similar point. And in much more detail and accuracy. D’oh!

In binary, the “1” would definitely show up 100% of the time as the first digit – once you eliminate the zero, there ain’t nothin’ else left!

As you go to bases larger than decimal, like base 12 or hexadecimal, the probability of 1 being the first digit decreases from 30.1%, as the above formulas indicate, but “1” will always be the most prominent digit, no matter what the base.

Yeah…but because of the value 0 itself, you can never actually reach 100%. Pedantic, I know, but that’s what I was thinking when I said “very close to”.

It’s interesting…as I typed “further exaggerate the prominence”, I was thinking of the ratio between the probability percentage and the uniform distribution percentage. That is, even though the probability of a leading “1” decreases as the base increases, the ratio between that probability percentage and the uniform distribution percentage would increase as the base increases.

Now that I think about it a bit more, that’s intuitively correct. Benford’s is a log value, while the uniform distribution is simply a dividend. That is, as n increases, 1/n approaches 0, and log[sub]n[/sub]2 / (1/n) would approach infinity, thus my thought about it being “further exaggerated”.

But then, that’s all just a silly digression that popped into my head.