Has anyone looked for shakespeare in the equivilants of monkeys on type writers? I am led to udnerstand it is likely that to be or not to be… will very likely appear in the digits of pi or e and so on.
It seems the kind of nerdy thing someone would do. Any luck?
I checked the first few hundred digits of pi and I couldn’t even find the letter ‘t’ in there. It was just a bunch of numbers.
Seriously, though, it will depend on what character encoding you use, i.e. how you arbitrarily assign a digit or group of digits to a given letter. Even the standard ASCII character set is somewhat arbitrarily defined, so there’s nothing stopping you from finding a sufficiently distributed set of pairs of digits and declaring that ‘14’ translates to ‘t’, ‘15’ translates to ‘o’, ‘92’ is ‘b’, ‘65’ is ‘e’, etc. With that encoding, the first two words of that phrase are found right at the outset.
You will be searching a long time. How long? Well, long enough to find “meaningful” stretches of anything in a series of random numbers. The monkeys typing Shakespeare is really a rumination on the nature of infinity, i.e. it’s really, really big. How big? As big as it needs to be and a little more.
Here’s a little more info that demonstrates how your preferred encoding affects things. If you’ve settled on using ASCII, these are two possible representations of that text:
You could also use mixed case, include spaces and punctuation, etc. to come up with many more possible strings of digits. However, due to the nature of the ASCII character set, all of the letters are going to fall in the range of 65-90 (uppercase) or 97-122 (lowercase). Therefore, as these samples demonstrate, any lowercase strings will contain a preponderance of sixes, sevens and eights, and uppercase strings will be cluttered with ones. The higher frequency of these digits reduces the likelihood of finding the strings in a pseudo-random sequence (but you can never be sure).
The examples above are expressed in base 10, but you could substitute another base to skew the distribution of digits to your liking.
Again, any numeric representation of text is arbitrary. There’s nothing special about ASCII, it just happened to fit the needs of the computer and teletype designers who created it more than half a century ago. There are plenty of otherencodingschemes you could use as well. It would be every bit as valid for you to create your own encoding (as in my previous example) that fits the available digits. It worked for the Bible Code.
Once you’ve determined what strings of digits will satisfy your criteria, here’s a search engine that will attempt to locate them in the first 200 million digits of pi. For what it’s worth, I searched for my phone number (10 digits) and it wasn’t there.
People get the wrong idea about this point. It’s not like, “get enough monkeys together and they’ll type out Shakespeare. It’s an amazing parlor trick. Try it.”
The point is to get people thinking about how big infinity is. “Imagine how many keystrokes it would take to type out Shakespeare by randomly punching digits. That’s how big infinity is”.
Now that this thread has segued into a discussion of infinity, lemme throw in a little personal editorial comment:
I think math students at, say, the Algebra I level, should get a little more instruction on thinking about infinity. The usual instruction (at least when I was in school) is simply along the lines of:
“Infinity is NOT a number. Repeat a large (but finite) number of times: INFINITY IS NOT A NUMBER!!!”
So we basically got any thinking about infinity forcibly drummed out of our heads, except for the notion that “sets” can be of infinite “size”, but even that is vague for a lot of students because we got infinity drummed out of our heads.
I always felt there should be a more comprehensive discussion of “infinity” at the introductory Algebra level. Assuming, of course, that you have teachers who know enough to teach it right.
ETA: I looked for “2 be or not 2 be” in pi. I found some 2’s.
No it isn’t; it ended on 6 October 2011. And anyway, that “experiment” wasn’t really along the lines of the original proposal; instead of inspecting a purely random stream of data for one of Shakespeare’s plays in its entirety, it generates random data nine letters at a time, and if that sequence matches something from a play, it is kept. The process is repeated until all nine-letter sequences occurring in Shakespeare are accounted for. This is a much, much easier and faster way of recreating the works of Shakespeare.
It’s not exactly what the OP is looking for, but have a look at W.R. Bennett’s interesting article “How Artificial is Inteligence?” in American Scientist, 65 (6) pp. 694-702 Nov-Dec 1977. He used a random number generator along with probability matrices for individual letters, then with pairs of letters, then sets of three letters, and so on. With “fourth order” virtual monkeys he was producing strings of words. If the probabilities were drawn from unfamiliar foreign languages, you could easily fooled into thinking the gibberish was a real example of that language.
Monkeys on typewriters wouldn’t be typing ASCII numbers, but letters, so the assignments ASCII uses aren’t a factor, and there wouldn’t be any more 9s or 1s for that reason.
To store the monkey typing as computer values, you could make up any number scheme you want, but if you are testing the distribution for randomness, you don’t use the numbers, you use what they represent. To do otherwise is like taking a highly-compressed JPG image and counting how many blue dots there are. Many blue dots will be due to compression artifacts, not the original source.
Likewise, counting how many ASCII 9’s there are isn’t testing the distribution of A’s and B’s.
[QUOTE= Geoffrey K. Pullum]
The number of 9-letter sequences over the alphabetic characters a to z is 5,429,503,678,976 (and as that figure of 5.5 trillion is being mentioned in the press stories, it looks like he’s ignoring spaces, punctuation, case, fonts, paragraph breaks, etc., but what the heck, let’s pretend Shakespeare’s work is a bunch of strings over {a b c d e f g h i j k l m n o p q r s t u v w x y z}). There are a few scientific curlicues in the way Anderson does things, but basically he just takes random 9-grams and does a fixed-string search over the Shakespearean corpus to see if he has 9 more letters he can mark off as done.
[/QUOTE]
Every two digits are used to create an ASCII value. That means possible ascii values 0-99.
However, ASCII codes 00 to 31 are non-printing. For these values I used the IBM PC extended ASCII set (which is the default on Windows PCs). I saw this as preferable to printing nothing, or spaces.
Yeah, I briefly read the original article upthread, and what Jesse Anderson is doing can only very, very charitably be described as “a million monkeys at a million typewriters banging out the works of Shakespeare.” It’s not at all in the spirit of the thought experiment (or whatever you want to call it.)
I’m not a coding expert, but how about every two digits represents a letter? 01-26 is A through Z, then 27-52 is A through Z, then 53-78 is A-Z, and 79-00 is A-V. That means you have a slightly lesser chance of seeing W-Z, but I would think it would make it easier to find words then using all of ASCII.
Can anyone try this, and see how soon we get “tobeornottobe”, or some other interesting phrase?