Please ID this mathematical/statistical phenomenon: small numerals more common

Why do you say that? Benford’s law is quite literally the statement that numbers have logarithms modulo log(10) which are uniformly distributed.

Now, as it turns out, data which is normally distributed (and thus not at all uniformly distributed) becomes very nearly uniformly distributed when you take its residue modulo a value which is not very high in comparison to its standard deviation. (In other words, the fractional component of normally distributed data is approximately uniformly distributed so long as the variance isn’t too low*).

So log-uniform data satisfies Benford’s law exactly (but the necessary condition for satisfying Benford’s law is not quite full log-uniformity; there’s a modulus involved). And log-normal data does not satisfy Benford’s law exactly, but Benford’s law will provide a good approximation, if the standard deviation of the logarithm is significantly larger than log(10).

[*: An excellent account of this fact via Fourier theory (and its applicability to explaining Benford’s Law) can be found in Chapter 34 of The Scientist and Engineer’s Guide to Digital Signal Processing, available here: http://www.dmae.upm.es/Webperson

In short, a random variable’s residue modulo some period is described by a periodic probability density function, whose Fourier series is given by the values of the random variable’s characteristic function at the multiples of the period. Thus, it will be uniformly distributed just in case the original variable’s characteristic function is zero at the nonzero multiples of the period.

In particular, if a random variable is normally distributed, then its characteristic function is Gaussian centered at 0, with dispersion parameter inversely proportional to the standard deviation (remember, we’re looking at the characteristic function, not the density function). Thus, so long as the standard deviation is reasonably large relative to the sampling period, the characteristic function decays so quickly that its values at the nonzero multiples of the sampling period will be nearly zero, and so the random variable’s residue modulo the sampling period will be nearly uniformly distributed.]

I crack up whenever I pass one of those. (toronto) I picture them always having to explain it: No, we’re the fifth Third Bank. Do you see? It’s like there are quite a few Third Banks, and we’re the fifth one to open. Let me put it another way…

Or rather, I should say, if the standard deviation of the logarithm is large in ratio to log(10) [i.e., the standard deviation of the base 10 logarithm is large]. My previous wording gave the implication that it specifically mattered whether this ratio was larger than 1. The larger, the better the approximation, but, as always, there’s not some specific cut-off point…

But on the other hand, with a uniform distribution (of any sort, log or otherwise), you have to specify what range you’re uniform over (and if the answer is “over an infinite range”, then you should be prepared to explain the nonstandard mathematical framework you’re using to make that legal). And you could certainly contrive distributions which are log-uniform over some range but which do not follow Benford’s law.

Like I said, it’s the logarithms modulo log(10) which need to be uniformly distributed [over the finite range from 0 to log(10)]. No nonstandardness needed.

But I see your point that log-uniformity over an arbitrary finite range isn’t sufficient; if a log-uniform distribution over some range is to satisfy Benford’s law, that range needs to have length a multiple of log(10). [Possibly an infinite range using only a finitely additive concept of distribution]

Still, Benford’s law is quite literally a claim that logarithms modulo log(10) are uniformly distributed.

ignore this post

I keep writing this thing, and then not being sure, and backing off… perhaps someday

Well, whatever. I’ll write it again, and then you can help me puzzle through whether it’s flawed:

And if data actually satisfied Benford’s law not just in base ten but in arbitrary bases (as would be expected if it were actually an intrinsic natural law and not somehow a cultural artifact), then its logarithm would have to be uniform modulo log(b) for arbitrarily large b, and thus uniform simpliciter over an infinite range, mathematical warts and all.

I’ll buy it.

Whoops, the full link was meant to be to here.

I didn’t say you should expect it. It just happens to be true. I once read a paper giving a strong argument why it should be so. I think you are a mathematician, so I will summarize the argument. If you plot the number of numbers up to n that begin with 1,2,3, against n you obviously get a sawtooth and the ratio to n does not approach a limit. So you do the obvious thing, you apply Cesaro summation to smooth it. It still doesn’t approach a limit, but it gives a distribution much closer to Benford’s law. So do it again, apply Cesaro summation to the second sequence. Much smoother, really good enough for Benford’s law, but it still doesn’t converge. Do it again. And again. Take the limit of the iterated Cesaro sums as the number of summations goes to infinity. That limit is precisely Benford’s law.

So it ultimately comes down to the fact that if you write a random list of numbers with no upper bound (or limit on the number of digits) then you likely get a Benford type distribution and the more numbers you write down the likelier it is. At any rate, as an empirical law it works. Greece was cooking the books and the distribution was unlikely.

Well since you’re practically begging us to find a flaw… :slight_smile:

I don’t have a problem with what you wrote logically, but I wonder if it’s useful. Much real data will satisfy Benford’s law in base 10, but most will not satisfy it in an arbitrarily large base. The height of trees measured in feet should satisfy Benford’s law, for example, but if you choose a large enough base, say base 1,000,000,000, it won’t. But that doesn’t mean tree heights are cultural artifacts.

The base b implicitly sets a minimum spread for the data of of at least a factor of b, and better several factors.

But most of those aren’t normal distributions.

Phone number area codes always had the middle digit as 0 or 1 (until recently), and the bigger cities had lower numbers, which dialed quicker (NYC=212, LA=213,Chi=312). So area codes tend to have more small digits. Then the exchange (middle 3 digits) were also originally given out starting with lower digits (but not 0 or 1). So the first areas to get phone service (usually the downtown business districts) tend to use lower digits in their numbers. Plus the many 1-800 numbers. Also, often people/businesses can choose the numbers they get; and they tend to go for lower digits.

And house addresses are given semi-sequentially (usually jumps of 4 or 6 between each house on the same side of the street. But they usually start over at each block, and there are usually 10-12 houses on each side, so the last 2 digits usually don’t go past 50 or 60. So more of the smaller digits.

And even in accounting: 0 and 5 are more common than other numbers, in prices, money transfers, etc. Plus many items are priced at $x.99, with a few cents sales tax added that becomes something in the small digit range.

So several of the examples given are NOT normal distribution. In fact, I’d guess that most numbers used in “everyday numerical data” are not normal distribution, but have some intended pattern behind them.

I specifically said phone numbers would not satisfy Benford’s law, so I’m not sure what you’re arguing about here.

I had thought about the jumps of (typically around here) six in addresses, but decided it wasn’t relevant. Benford’s law applies to the most-significant digit(s), not to the least-significant digit.

I could have made clear that I was thinking of street addresses over the entire country. Within a single address numbering system, it might fail, but over many systems, with different maximum address numbers, I expect it will hold. Presumably there’s a distribution of what that maximum is. There will be some largest value which may skew things a bit, but then for smaller maximum values, I’d expect a smooth distribution, with different cities and towns covering a large range. But the skipping of four or six numbers between addresses, or using even on one side of the street and odd on the other won’t matter.