Also norange, although I think the change happened in French before the word was borrowed.
I found an online list of Spanish words here. It has about 175,000 words. Spanish has a much more phonetic spelling system than English. I ran my program on it, this time sorting on the third column since I think you’re more interested in the frequency distribution of initial letters rather than the letters which are disproportionately initials. The results are below. I don’t know what character encoding is used for the six non-ASCII letters. It’s not UTF-8; each character is one byte. So I just printed the non-ASCIIs as the hex value of the byte.
a 12.59% 11.88% 1.0594
d 3.48% 11.62% 0.2997
c 4.05% 11.61% 0.3492
e 9.26% 11.22% 0.8255
r 8.11% 8.27% 0.9808
p 1.95% 7.74% 0.2523
i 7.08% 5.80% 1.2209
s 9.14% 5.07% 1.8039
t 3.90% 4.57% 0.8518
m 3.00% 3.74% 0.8017
f 0.86% 2.87% 0.3018
b 2.30% 2.80% 0.8220
v 0.93% 2.56% 0.3628
l 2.87% 2.07% 1.3854
o 6.02% 1.88% 3.2066
h 0.49% 1.77% 0.2785
g 1.18% 1.59% 0.7453
n 5.47% 0.93% 5.9104
u 2.73% 0.62% 4.4199
j 0.73% 0.60% 1.2245
q 0.23% 0.32% 0.7199
z 0.60% 0.25% 2.3476
y 0.11% 0.10% 1.0900
e1 1.01% 0.05% 21.7322
e9 0.66% 0.02% 29.5652
f3 0.29% 0.02% 15.7532
k 0.00% 0.01% 0.1659
x 0.20% 0.01% 13.4032
fa 0.02% 0.01% 1.7810
w 0.00% 0.00% 0.1537
ed 1.04% 0.00% 259.9015
f1 0.11% 0.00% 189.6382
Thanks markn. Those 6 other letters are in Latin-1 (ISO 8859-1) and you probably could have just printed them directly and gotten the expected characters.
As far as the results, they are also neither Zipf nor Pareto. This is not unexpected, at least not by me.
Of course Zipf’s Law won’t hold for alien races. It doesn’t hold for anything. It can’t, because the harmonic series diverges. Though if you limit the scope, there are finite domains where it’s a half-decent approximation.
How is X not at the absolute top of the list of “infrequent initials”? It’s not all that uncommon in the interiors of words, but almost impossible at the start.
Oh, and is that frequency in use, or in the dictionary?
When I was in grade school, I figured out that there are three names for stupid people that start with N: nitwit, numbskull, and nincompoop.
This is frequency in my dictionary, the one that comes with MacOS. In this dictionary, there are 293 words starting with X, and 6348 total instances of X (in 6323 words), so initial X’s are 4.6% of the total X’s. By comparison, there are 6098 words starting with N, and 143,656 instances of N (in 108,769 words), so initial N’s are 4.2% of the total N’s. This isn’t exactly what I calculated earlier, but it shows that there is a higher percentage of words with multiple N’s than multiple X’s, so that contributes.
This is why: we were taught not to use the N-words…
I think X is not at the top because it’s not that common in general. It’s very rare at the beginning of words, but also rare elsewhere. Y is a much more common letter, especially at the end of words.
Good thing there are no infinite alphabets then.
I just discovered that Wikipedia has a table of first letter frequencies:
If you sort the dictionary list, it does look a bit more like Zipf distributution than what markn posted above. Notably, P is in third place with a significantly lower fraction and C moved up to second with a somewhat greater fraction. However, it’s still not that great a match for the classic Zipf dist. The distribution for first letters from texts is much closer to Zipf.
Thanks for looking to that interesting video.
If you think about it, that’s what really matters. The dictionary just includes words once each, as long as their usage in conversation or writing is greater than some smallish, nonzero frequency.
Nematodes more than make up for that in frequency terms…
In short, if all the matter in the universe except the nematodes were swept away, our world would still be dimly recognizable, and if, as disembodied spirits, we could then investigate it, we should find its mountains, hills, vales, rivers, lakes, and oceans represented by a film of nematodes. The location of towns would be decipherable since, for every massing of human beings, there would be a corresponding massing of certain nematodes. Trees would still stand in ghostly rows representing our streets and highways. The location of the various plants and animals would still be decipherable, and, had we sufficient knowledge, in many cases even their species could be determined by an examination of their erstwhile nematode parasites.
-Nathan Cobb
@Filbert. You have some very weird friends.
Very cool animals, super colorful.
Nurse shark - all I got.
Snout, snoot, snort, sniff, sneeze, snot…
Well I’ll be doggoned.
I’ve been authorized by Spiro Agnew to strenuously disagree with the premise of the OP.
Strong words from a guy whose name is an anagram of “grow a penis.”
Looking at the second post, one wonders why Wheel of Fortune contestants rarely seem to choose the letters P, C, M and B in the bonus round (being given NESTLR, which they didn’t used to).