The Zipf Mystery!

commasense · October 17, 2018, 9:50pm

I don’t think this is mundane or pointless, but I must share it: Zipf’s Law.

Here is a slightly dry explanation from Wikipedia:

Wikipedia:

Zipf’s law states that, given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.[1]

Here is a much more interesting video that talks about it and related phenomena like the Pareto Principle (the 80/20 rule): Vsauce.

Enjoy!

commasense · October 18, 2018, 10:14am

One hundred and fifty views and no one else thinks this is as amazing and cool as I did? Check out the video if you don’t belive me!

Malleus_Incus_Stapes · October 18, 2018, 7:01pm

Cool.

commasense · October 18, 2018, 7:28pm

Well I thought so. I guess about 270 other Dopers already knew about it and weren’t impressed.

Musicat · October 18, 2018, 7:38pm

Before I view the video, tell me if it takes into account the most-used words in vernacular English, which are “youno,” “like,” “umm,” “uhhh,” “Imean,” and “well.” If not, it’s not all that valid a study.

commasense · October 18, 2018, 8:00pm

Ummm no.

Written, not spoken, English.

Attack_from_the_3rd_dimension · October 18, 2018, 8:37pm

Very cool. Thanks.

pulykamell · October 18, 2018, 9:26pm

I’ve always thought Zipf’s Law was pretty neat, along with the somewhat related Benford’s Law. Stuff like this is fascinating.

swampspruce · October 18, 2018, 9:28pm

It’s highly interesting but I already knew about it. I love that video and I showed it to quite a few people when I discovered it.

filmore · October 18, 2018, 9:36pm

Is that kind of distribution found in other places other than language? Like, with the weeds in my yard, does the most common weed appear more than twice as often as the 2nd place weed, 3x as often as the 3rd place weed, etc. That kind of distribution seems very well ordered and I would think it would appear elsewhere.

What about with individual letters in a language (or maybe phonemes). Does the most common one appear 2x the 2nd most common, 3x the 3rd most common, etc.

SingleMalt · October 18, 2018, 9:37pm

Here is an article about how Zipf’s Law works for the sizes of a country’s cities, as well.

commasense · October 19, 2018, 3:34am

I don’t know about weeds, but the video says that the principle has been found to occur in many other areas.

BigT · October 19, 2018, 4:06am

Seems odd that the article doesn’t posit that Zipf’s law might be because of the 3/4s power law: that the decreased need for resources per person actually creates a feedback loop. (They do mention the immigration issue, which is also a feedback loop.)

Similarly, people use the words they hear. So there’s a feedback loop mechanism in there, too.

Now, of course, that doesn’t explain the exact frequencies, but a feedback loop does suggest why the increase would not be linear.

DPRK · October 19, 2018, 10:04pm

The Yule or Yule–Simon distribution naturally shows up as the result of, among other things, various stochastic processes that involve preferential attachment or similar “feedback” mechanisms. For example (as per Simon), if the probability that the next word written is a word that has appeared a certain number of times is proportional to the total number of words that have appeared that many times, and the probability that it is a new word is some constant, then one obtains this type of distribution. Similarly, if the growth of the population of a city (or type of weed…) is roughly proportional to the size of that city, then the city populations will conform to a Yule-type distribution. And Yule’s original example was the number of species found in biological genera, nothing to do with linguistics at all.
The applicability of Zipf’s Law to the frequencies of letters and phonemes (as opposed to words and phrases) might be explainable as the result of maximizing the amount of information transmitted by each symbol.

commasense · October 21, 2018, 4:57pm

The video (yes, it’s 20 minutes long, but it’s really pretty interesting) goes into some of these questions.

Aspidistra · October 21, 2018, 8:47pm

Wow, that is one terrible, terrible article.

So, naturally being an interested scientifically inclined person, I test this on my own country.

Sydney 4.6 million
Melbourne 4.2 million
Fail

Ok, so Australia is weird, and maybe we don’t count as one of "every country in the world. Lets try our nearest neighbor

New Zealand:
Auckland: 1.5 million
Wellington: 400,000
Fail

Well, but that was a bit unfair, cherry picking two of the most recently established countries on the planet. Maybe I should go for another more long-lasting nearby country

Indonesia:
Jakarta 9.6 million
Surabaya 2.8 million
Fail

Um … somewhere in Europe maybe? Randomly pick my most recent holiday destination

Iceland:
Reykjavik:120,000
Kopavogur:30,000
Fail

This is not inspiring confidence!!

(The language thing is cool though. It’s only the extension to geography that I’m just a leeeetle bit suspicious of…)

DPRK · October 21, 2018, 9:05pm

Zipf actually observed that the number of cities with population greater than x is approximately proportional to 1/x, which makes more sense and seems at least very roughly OK for both Australia and Indonesia (feel free to check other countries, as well as entire continents)

Aspidistra · October 21, 2018, 9:43pm

That seems to be equivalent to a claim that population distributions follow a generally exponential distribution: f(x) = exp(-kx) for some value of k. Which seems like a perfectly reasonable hypothesis, and one I’m happy to believe.

‘The parameter k has the same value for any country in the world’ is a somewhat stronger claim, and I’d be interested in seeing the evidence for it.

‘All countries in the world have their second and third cities roughly half and a third the size of their biggest city’ is a really really strong claim, and appears to be total rubbish.

filmore · November 8, 2018, 6:53pm

Is Zipf’s pattern also true on a per-person basis? Although the overall language may have a certain word ranking, any given individual may have a different word ranking in their own personal writings. So if my writings or your writings were analyzed, would they exhibit the same pattern even though the individual words in the rankings might be different? What about authors? If the words in Stephen King’s books were ranked, would it follow the pattern? What about an unusual writer like E. E. Cummings? Or is Zipf’s pattern only something seen when the language as a whole in analyzed?

Jasmine · November 8, 2018, 6:59pm

Ah … How can I put this delicately …

Topic		Replies	Views
Most commonly appearing first digit in (large) lists of numbers Factual Questions	5	944	May 21, 2004
Why are there so few words starting with N? Factual Questions	85	3958	December 17, 2021
Please ID this mathematical/statistical phenomenon: small numerals more common Factual Questions	33	4949	August 13, 2012
Rare words Miscellaneous and Personal Stuff I Must Share	118	2463	September 21, 2023
Somebody explain Benford's Law Factual Questions	10	1415	August 1, 2008

The Zipf Mystery!

Related topics