ETAONI SHRDLU

BVCC · March 23, 2001, 8:39pm

Dear Cecil,

In reporting the result of his letter-counting project, your correspondent Skip Newhall say, “You will notice that the order of n and i is reversed from that in etaoin, though the counts differ by less than 0.2 percent, a statistically insignificant difference.” As you no doubt noticed, he’s wrong. With a sample this large, virtually any difference is bound to be statistically significant. In fact, the P-value is something less than 0.0000001. That is (assuming his sampling design is sound) we can be reasonably certain that the observed difference in the two letters represents an actual difference in the English language. If n and i in fact occur at the same rate in English, we would find n ahead of i by as much as it was in Skip’s sample less than 1 time in 10 million. If i is in fact more common than n, Skip’s sample is correspondingly less likely.

BVCC

yabob · March 23, 2001, 9:50pm

In fact, assuming the sampling design is sound is a BIG question mark, IMO. I’ll bet that letter distribution would vary drastically with source - comparing classic literature versus popular magazine copy loaded with modern day trade names, for instance (lots more x’s). Or current computer trade rags - lot’s of j’s from java and acronyms derived from java-this and java-that. How you are going to weight all those sources to arrive at “average” English usage is questionable.

In fact the frequencies from the “Brown Corpus”, often used as a sample for things like this comes out as:

etaoin srhldcu

according to this source:

http://lists.village.virginia.edu/lists_archive/Humanist/v05/0181.html

Arnold_Winkelried · March 23, 2001, 9:56pm

Welcome to the SDMB, and thank you for posting your comment.

Please include a link to Cecil’s column if it’s on the straight dope web site. To include a link, it can be as simple as including the web page location in your post (make sure there is a space before and after the text of the URL).

Cecil’s original column can be found on-line at this link:
What’s the origin of the mysterious phrase “etaoin shrdlu”?

The follow-up can be found here: Another note regarding ETAOIN SHRDLU (near the bottom of the page)

moderator, «Comments on Cecil’s Columns» **

tracer · March 24, 2001, 8:51pm

Arnold Winkelried wrote:

I’m surprised Cecil didn’t notice something important about the, ahem, “English-language” works selected for survey by Skip Newhall. They included:[ul][li]Crime and Punishment (Fyodor Dostoevsky)[/li][li]The Iliad, The Odyssey (Homer)[/li][li]Peer Gynt (Henrik Ibsen)[/li][li]Faust (Johann Wolfgang von Goethe)[/li][li]History of the Peloponnesian War (Thucydides)[/li][li]The Forged Coupon (Leo Tolstoy)[/li][li]Several writings of Karl Marx[/li][li]Don Quixote (Miguel de Cervantes)[/li][li]Works by Plato and Virgil[/ul][/li]… none of which were originally written in English!

Arnold_Winkelried · March 25, 2001, 7:10pm

I noticed that also, tracer, but one could presumably assume that the translaters had a sufficient command of the english language that the translation can be considered to be “normal” english writing. Of course, there would be some foreign patronyms in the books that could somewhat skew the results, but in a long book such as “Crime and Punishment” the ratio of foreign patronyms to english words would probably be tiny.

The main problem I saw with the sampling is that, from the sources mentioned, it seems pretty clear that this person wrote a computer program for the letter count, and used sources available on the internet: classic novels which can be found for example at the Gutenberg project, and then newspaper and magazine articles. But I would guess that modern novels are under-represented in the sample since those could not be found on the internet.

tracer · March 26, 2001, 11:20pm

I’ve written a few novel-length stories and posted them on my website, if Skip Newhall wants to use them as representative literary samples.

raoulortega · March 27, 2001, 7:29pm

Not only can you gather statistics on pairs, but you can, through tree structures, gather stats on much longer character strings. Twenty-five years ago I wrote a program in Fortran on a PDP-10 that went to 7 or 8 character strings. (It ran out of memory, which if I remember correctly was 512K, with 10Meg hard drives).

Then you write a second program which takes these probability tables as input, and generates random text. Unlike the “million monkeys at a million typewriters”, this program can actually generate snippets of recognizable text.

You have to include space as a valid character, by the way, in order to get words to begin and terminate properly. You can also include punctuation, too.

ndorward · March 29, 2001, 5:48am

I take it that raoulortega was attempting a version of Rubenking’s BREKDOWN algorithm for the creation of new texts out of a preexistent corpus. The curious may find an article by John Tranter at:

http://www.austlit.com/johntranter/prose/nonfiction/brekdown.html

which discusses the algorithm & provides two rather entertaining samples of its work–new poems created in the style of Matthew Arnold & John Ashbery. --N

GalaciDaliciducleic_Acid · April 9, 2001, 2:40am

In one of those interesting little coincidences, I’m just reading an early short story by the irrepressible Thomas Pynchon, who is known for giving his characters fabulous names like Tyrone Slothrop, Oedipa Maas, Clayton “Bloody” Chiclitz, and Blodgett Waxwing. This one features a character named Etienne Cherdlu, who is, moreover, given to signing his name “80N”.

Just thought it was cute.

Topic		Replies	Views
Letter Frequency/Usage Cafe Society	21	2862	June 1, 2009
Etaoin Shrdlu Cecil's Columns/Staff Reports	5	2173	May 5, 2005
lorem ipsum/etaoin shrdlu Cecil's Columns/Staff Reports	7	1919	March 25, 2011
Jeopardy etaoin shrdlu Factual Questions	7	1172	October 11, 2001
Why are there so few words starting with N? Factual Questions	85	4079	December 17, 2021

ETAONI SHRDLU

Related topics