Before I view the video, tell me if it takes into account the most-used words in vernacular English, which are “youno,” “like,” “umm,” “uhhh,” “Imean,” and “well.” If not, it’s not all that valid a study.
Is that kind of distribution found in other places other than language? Like, with the weeds in my yard, does the most common weed appear more than twice as often as the 2nd place weed, 3x as often as the 3rd place weed, etc. That kind of distribution seems very well ordered and I would think it would appear elsewhere.
What about with individual letters in a language (or maybe phonemes). Does the most common one appear 2x the 2nd most common, 3x the 3rd most common, etc.
Seems odd that the article doesn’t posit that Zipf’s law might be because of the 3/4s power law: that the decreased need for resources per person actually creates a feedback loop. (They do mention the immigration issue, which is also a feedback loop.)
Similarly, people use the words they hear. So there’s a feedback loop mechanism in there, too.
Now, of course, that doesn’t explain the exact frequencies, but a feedback loop does suggest why the increase would not be linear.
The Yule or Yule–Simon distribution naturally shows up as the result of, among other things, various stochastic processes that involve preferential attachment or similar “feedback” mechanisms. For example (as per Simon), if the probability that the next word written is a word that has appeared a certain number of times is proportional to the total number of words that have appeared that many times, and the probability that it is a new word is some constant, then one obtains this type of distribution. Similarly, if the growth of the population of a city (or type of weed…) is roughly proportional to the size of that city, then the city populations will conform to a Yule-type distribution. And Yule’s original example was the number of species found in biological genera, nothing to do with linguistics at all.
The applicability of Zipf’s Law to the frequencies of letters and phonemes (as opposed to words and phrases) might be explainable as the result of maximizing the amount of information transmitted by each symbol.
So, naturally being an interested scientifically inclined person, I test this on my own country.
Sydney 4.6 million
Melbourne 4.2 million Fail
Ok, so Australia is weird, and maybe we don’t count as one of "every country in the world. Lets try our nearest neighbor
New Zealand:
Auckland: 1.5 million
Wellington: 400,000 Fail
Well, but that was a bit unfair, cherry picking two of the most recently established countries on the planet. Maybe I should go for another more long-lasting nearby country
Indonesia:
Jakarta 9.6 million
Surabaya 2.8 million Fail
Um … somewhere in Europe maybe? Randomly pick my most recent holiday destination
Iceland:
Reykjavik:120,000
Kopavogur:30,000 Fail
This is not inspiring confidence!!
(The language thing is cool though. It’s only the extension to geography that I’m just a leeeetle bit suspicious of…)
Zipf actually observed that the number of cities with population greater than x is approximately proportional to 1/x, which makes more sense and seems at least very roughly OK for both Australia and Indonesia (feel free to check other countries, as well as entire continents)
That seems to be equivalent to a claim that population distributions follow a generally exponential distribution: f(x) = exp(-kx) for some value of k. Which seems like a perfectly reasonable hypothesis, and one I’m happy to believe.
‘The parameter k has the same value for any country in the world’ is a somewhat stronger claim, and I’d be interested in seeing the evidence for it.
‘All countries in the world have their second and third cities roughly half and a third the size of their biggest city’ is a really really strong claim, and appears to be total rubbish.
Is Zipf’s pattern also true on a per-person basis? Although the overall language may have a certain word ranking, any given individual may have a different word ranking in their own personal writings. So if my writings or your writings were analyzed, would they exhibit the same pattern even though the individual words in the rankings might be different? What about authors? If the words in Stephen King’s books were ranked, would it follow the pattern? What about an unusual writer like E. E. Cummings? Or is Zipf’s pattern only something seen when the language as a whole in analyzed?