The Zipf Mystery!

I don’t think this is mundane or pointless, but I must share it: Zipf’s Law.

Here is a slightly dry explanation from Wikipedia:

Here is a much more interesting video that talks about it and related phenomena like the Pareto Principle (the 80/20 rule): Vsauce.


One hundred and fifty views and no one else thinks this is as amazing and cool as I did? Check out the video if you don’t belive me!


Well I thought so. I guess about 270 other Dopers already knew about it and weren’t impressed. :smiley:

Before I view the video, tell me if it takes into account the most-used words in vernacular English, which are “youno,” “like,” “umm,” “uhhh,” “Imean,” and “well.” If not, it’s not all that valid a study.

Ummm no.

Written, not spoken, English.

Very cool. Thanks.

I’ve always thought Zipf’s Law was pretty neat, along with the somewhat related Benford’s Law. Stuff like this is fascinating.

It’s highly interesting but I already knew about it. I love that video and I showed it to quite a few people when I discovered it.

Is that kind of distribution found in other places other than language? Like, with the weeds in my yard, does the most common weed appear more than twice as often as the 2nd place weed, 3x as often as the 3rd place weed, etc. That kind of distribution seems very well ordered and I would think it would appear elsewhere.

What about with individual letters in a language (or maybe phonemes). Does the most common one appear 2x the 2nd most common, 3x the 3rd most common, etc.

Here is an article about how Zipf’s Law works for the sizes of a country’s cities, as well.

I don’t know about weeds, but the video says that the principle has been found to occur in many other areas.

Seems odd that the article doesn’t posit that Zipf’s law might be because of the 3/4s power law: that the decreased need for resources per person actually creates a feedback loop. (They do mention the immigration issue, which is also a feedback loop.)

Similarly, people use the words they hear. So there’s a feedback loop mechanism in there, too.

Now, of course, that doesn’t explain the exact frequencies, but a feedback loop does suggest why the increase would not be linear.

The Yule or Yule–Simon distribution naturally shows up as the result of, among other things, various stochastic processes that involve preferential attachment or similar “feedback” mechanisms. For example (as per Simon), if the probability that the next word written is a word that has appeared a certain number of times is proportional to the total number of words that have appeared that many times, and the probability that it is a new word is some constant, then one obtains this type of distribution. Similarly, if the growth of the population of a city (or type of weed…) is roughly proportional to the size of that city, then the city populations will conform to a Yule-type distribution. And Yule’s original example was the number of species found in biological genera, nothing to do with linguistics at all.
The applicability of Zipf’s Law to the frequencies of letters and phonemes (as opposed to words and phrases) might be explainable as the result of maximizing the amount of information transmitted by each symbol.

The video (yes, it’s 20 minutes long, but it’s really pretty interesting) goes into some of these questions.

Wow, that is one terrible, terrible article.

So, naturally being an interested scientifically inclined person, I test this on my own country.

Sydney 4.6 million
Melbourne 4.2 million

Ok, so Australia is weird, and maybe we don’t count as one of "every country in the world. Lets try our nearest neighbor

New Zealand:
Auckland: 1.5 million
Wellington: 400,000

Well, but that was a bit unfair, cherry picking two of the most recently established countries on the planet. Maybe I should go for another more long-lasting nearby country

Jakarta 9.6 million
Surabaya 2.8 million

Um … somewhere in Europe maybe? Randomly pick my most recent holiday destination


This is not inspiring confidence!!

(The language thing is cool though. It’s only the extension to geography that I’m just a leeeetle bit suspicious of…)

Zipf actually observed that the number of cities with population greater than x is approximately proportional to 1/x, which makes more sense and seems at least very roughly OK for both Australia and Indonesia (feel free to check other countries, as well as entire continents)

That seems to be equivalent to a claim that population distributions follow a generally exponential distribution: f(x) = exp(-kx) for some value of k. Which seems like a perfectly reasonable hypothesis, and one I’m happy to believe.

‘The parameter k has the same value for any country in the world’ is a somewhat stronger claim, and I’d be interested in seeing the evidence for it.

‘All countries in the world have their second and third cities roughly half and a third the size of their biggest city’ is a really really strong claim, and appears to be total rubbish.

Is Zipf’s pattern also true on a per-person basis? Although the overall language may have a certain word ranking, any given individual may have a different word ranking in their own personal writings. So if my writings or your writings were analyzed, would they exhibit the same pattern even though the individual words in the rankings might be different? What about authors? If the words in Stephen King’s books were ranked, would it follow the pattern? What about an unusual writer like E. E. Cummings? Or is Zipf’s pattern only something seen when the language as a whole in analyzed?

Ah … How can I put this delicately … :smiley: