In reporting the result of his letter-counting project, your correspondent Skip Newhall say, “You will notice that the order of n and i is reversed from that in etaoin, though the counts differ by less than 0.2 percent, a statistically insignificant difference.” As you no doubt noticed, he’s wrong. With a sample this large, virtually any difference is bound to be statistically significant. In fact, the P-value is something less than 0.0000001. That is (assuming his sampling design is sound) we can be reasonably certain that the observed difference in the two letters represents an actual difference in the English language. If n and i in fact occur at the same rate in English, we would find n ahead of i by as much as it was in Skip’s sample less than 1 time in 10 million. If i is in fact more common than n, Skip’s sample is correspondingly less likely.
In fact, assuming the sampling design is sound is a BIG question mark, IMO. I’ll bet that letter distribution would vary drastically with source - comparing classic literature versus popular magazine copy loaded with modern day trade names, for instance (lots more x’s). Or current computer trade rags - lot’s of j’s from java and acronyms derived from java-this and java-that. How you are going to weight all those sources to arrive at “average” English usage is questionable.
In fact the frequencies from the “Brown Corpus”, often used as a sample for things like this comes out as:
Welcome to the SDMB, and thank you for posting your comment.
Please include a link to Cecil’s column if it’s on the straight dope web site. To include a link, it can be as simple as including the web page location in your post (make sure there is a space before and after the text of the URL).
I’m surprised Cecil didn’t notice something important about the, ahem, “English-language” works selected for survey by Skip Newhall. They included:[ul][li]Crime and Punishment (Fyodor Dostoevsky)[/li][li]The Iliad, The Odyssey (Homer)[/li][li]Peer Gynt (Henrik Ibsen)[/li][li]Faust (Johann Wolfgang von Goethe)[/li][li]History of the Peloponnesian War (Thucydides)[/li][li]The Forged Coupon (Leo Tolstoy)[/li][li]Several writings of Karl Marx[/li][li]Don Quixote (Miguel de Cervantes)[/li][li]Works by Plato and Virgil[/ul][/li]… none of which were originally written in English!
I noticed that also, tracer, but one could presumably assume that the translaters had a sufficient command of the english language that the translation can be considered to be “normal” english writing. Of course, there would be some foreign patronyms in the books that could somewhat skew the results, but in a long book such as “Crime and Punishment” the ratio of foreign patronyms to english words would probably be tiny.
The main problem I saw with the sampling is that, from the sources mentioned, it seems pretty clear that this person wrote a computer program for the letter count, and used sources available on the internet: classic novels which can be found for example at the Gutenberg project, and then newspaper and magazine articles. But I would guess that modern novels are under-represented in the sample since those could not be found on the internet.
Not only can you gather statistics on pairs, but you can, through tree structures, gather stats on much longer character strings. Twenty-five years ago I wrote a program in Fortran on a PDP-10 that went to 7 or 8 character strings. (It ran out of memory, which if I remember correctly was 512K, with 10Meg hard drives).
Then you write a second program which takes these probability tables as input, and generates random text. Unlike the “million monkeys at a million typewriters”, this program can actually generate snippets of recognizable text.
You have to include space as a valid character, by the way, in order to get words to begin and terminate properly. You can also include punctuation, too.
I take it that raoulortega was attempting a version of Rubenking’s BREKDOWN algorithm for the creation of new texts out of a preexistent corpus. The curious may find an article by John Tranter at:
which discusses the algorithm & provides two rather entertaining samples of its work–new poems created in the style of Matthew Arnold & John Ashbery. --N
In one of those interesting little coincidences, I’m just reading an early short story by the irrepressible Thomas Pynchon, who is known for giving his characters fabulous names like Tyrone Slothrop, Oedipa Maas, Clayton “Bloody” Chiclitz, and Blodgett Waxwing. This one features a character named Etienne Cherdlu, who is, moreover, given to signing his name “80N”.