Remember Me?

 Straight Dope Message Board Remember Me?

 Thread Tools Display Modes
#1
02-26-2020, 03:58 PM
 Guest Join Date: Aug 2003 Posts: 23,466

## How would you perform this calculation regarding all potential book combinations

Let's say you wanted to write every possible 200 page book in existence.

That's about 50,000 words, maybe 300,000 characters. Excluding capitalization and only using letters, numbers and common punctuation. I'm guessing there are 60 characters, roughly.

So what's the calculation to determine how many books you'd have?

Is it 60^300,000 or something else?
__________________
Sometimes I doubt your commitment to sparkle motion
#2
02-26-2020, 04:14 PM
 Member Join Date: Feb 2003 Location: San Diego, CA Posts: 1,394
60^300,000 would give you all the possible character combinations, but to get "number of books", you would need to exclude all the "unreadable" combinations of letters. 300,000 "a"s in a row may fill 200 pages, but would not be a "book".

Next somehow you need to filter all those combinations to then get the 50,000 words - which means placement of blanks and punctuation that makes "sentences".
callmeishmael
callme ishmael
call meishmael
call me ishmael
So you would need to filter by some standard dictionary (and have logic enough to identify names like "Ishmael").

Then you need to filter those by sentences that make some (semblance) of sense.

The conversion from raw character combinations to words and sentences would be a difficult and complicated process. You would also end up with books containing the exact same words in the exact same order, and being perfectly "understandable", but having very different meaning just based on the punctuation.
#3
02-26-2020, 04:40 PM
 Guest Join Date: Oct 2016 Posts: 13,330
FWIW, what you are describing is known as the Library of Babel.
#4
02-27-2020, 04:33 AM
 Guest Join Date: Jun 2001 Location: Nashville, TN Posts: 541
Quote:
 Originally Posted by cormac262 60^300,000 would give you all the possible character combinations, but to get "number of books", you would need to exclude all the "unreadable" combinations of letters. 300,000 "a"s in a row may fill 200 pages, but would not be a "book". Next somehow you need to filter all those combinations to then get the 50,000 words - which means placement of blanks and punctuation that makes "sentences". callmeishmael callme ishmael call meishmael call me ishmael So you would need to filter by some standard dictionary (and have logic enough to identify names like "Ishmael"). Then you need to filter those by sentences that make some (semblance) of sense. The conversion from raw character combinations to words and sentences would be a difficult and complicated process. You would also end up with books containing the exact same words in the exact same order, and being perfectly "understandable", but having very different meaning just based on the punctuation.
Randall Munroe discussed something similar to this in this What If? column. Back in 1950, Claude Shannon determined that English transmitted about 1.1 bits of information per letter. So, you could estimate that the 300,000 characters would create about 2^(300,000 * 1.1) meaningfully different books in English. That number has 99,340 digits.
#5
02-27-2020, 06:10 AM
 Guest Join Date: Dec 2009 Location: the Land of Smiles Posts: 21,132
Quote:
 Originally Posted by Kimble Randall Munroe discussed something similar to this in this What If? column. Back in 1950, Claude Shannon determined that English transmitted about 1.1 bits of information per letter. So, you could estimate that the 300,000 characters would create about 2^(300,000 * 1.1) meaningfully different books in English. That number has 99,340 digits.
Nitpick: Well over 99.999999% of those books would still have misspelled words or bad grammar. Even among the books which pass that test over 99.99999999% would have fake facts, poorly developed plots, or would otherwise get an F in any English Composition class.

A compact way to represent the Library of Babel (with a 60-character alphabet) would be to just write: "the base-60 expansion of arctan(1)." Admittedly it would be an effort to find exactly where the U.S. Constitution is written in those digits (especially if you insist that commas be misplaced just as in the original), but you'd have a similar search problem using the more conventional Library of Babel.

Representing all books concisely reminds me of the old-timers who only knew a thousand different jokes.
SPOILER:
To save time they memorized and numbered all the jokes. "Number 431." "Har de har har har. That's a real whiz-banger of a joke, Billie! Ha ha ha! ... Hahh!"

Newcomer shows up and tries to join in the fun. "Number 522." Dead silence.

"Whassa matter? Isn't #522 a funny joke?"
"Oh, #522 is a funny enough joke. You just don't tell it very well."

Did you like this joke? If so, just say "Number 814" next time to get a good laugh.
#6
02-28-2020, 03:26 AM
 Guest Join Date: Apr 2012 Location: Straya Posts: 1,274
Previous experiments in typing multiple letter combinations to create readable books have not been very successful.
#7
02-28-2020, 09:32 AM
 Charter Member Moderator Join Date: Jan 2000 Location: The Land of Cleves Posts: 87,418
Actually, it's not known that pi (or pi/4) is normal. There are numbers which are known to be normal, but they generally amount to "List all of the books, in order".
#8
02-28-2020, 12:50 PM
 Guest Join Date: Feb 2009 Posts: 15,485
Some article I saw once discussed making fake text by analyzing thousands of source documents to create an "odds table" of letters - what are the odds that for example, "e" follows "l" or "a" follows "k"? Then they extended it to an odds table for the preceding two letters. If you include spaces and some punctuation, you can make random quasi-English-looking words and sentences that would make Lewis Carrol proud. Perhaps you could filter out any "novel" where the occurrence of spaces was extreme - not enough separate "words" or too-long words? The trouble with randomness is that words are not random, and their association is not random. you could extend the letter logic to instead take all 10,000 commonly used words and come up with a table - what are the odds "red" follows "the", etc.? But with successive refinements of comprehensibility, you are removing a degree of randomness and reducing the output. And you risk missing the odd novel which quotes foreign language, or onomatopia to describe something, or has totally made-up proper names or novel words ("hobbit"?).

So you are best off saying "any random collection of characters".
#9
02-28-2020, 01:00 PM
 Guest Join Date: Dec 2009 Location: the Land of Smiles Posts: 21,132
Quote:
 Originally Posted by md2000 Some article I saw once discussed making fake text by analyzing thousands of source documents to create an "odds table" of letters - what are the odds that for example, "e" follows "l" or "a" follows "k"? Then they extended it to an odds table for the preceding two letters. If you include spaces and some punctuation, you can make random quasi-English-looking words and sentences that would make Lewis Carrol proud....
Using an odds table based on three preceding letters (i.e. tetragraph statistics) just now I generated random text with the same stats as Darwin's Origin of Species. Here's an excerpt:
Quote:
 which gradaptility of gras anothe of the led that to the degreason gardly have eminatural severa are to charact of species be legs of hight fine same; anot a climily be procend some moder afterst noticultant adaption, years has instably different somes heave detely sal; and the have surrelaterier, but fathe pland we could perictly under jaws of that of migrangement, have facesservate having has in this degreat of then would from are the spack at the more not bees, which two in a size of so-call songe of growth. As F; justries, but that not the in is in and to play, one of huntructionall the see my visinglistitudescent as having and like improportainstimall headinature. Nor fresember to becommongard, that should by surrese of ineve grough well-gland or repland less will regious new liever slight, the enomachecked by to the one; fore for been of the Glace of divilition to and inhabits confinitions of the can allief fere on the production, to cerous species of the in Daws not at is lended; as largumined: but wellighbouth the lants of name generium parts; an acted in due two belose or in thosediated, large naterbalapable ord the one specked on more save, and In that select: which it damplex reasonsiders; contince tincreason of time are ferespecies showere could cur, I am could in a greathe pare in the up a case of butely
(The entropy was about 2 bits per character.)

 Bookmarks

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is Off HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home Main     About This Message Board     Comments on Cecil's Columns/Staff Reports     The Quarantine Zone     General Questions     Great Debates     Politics & Elections     Cafe Society     The Game Room     Thread Games     In My Humble Opinion (IMHO)     Mundane Pointless Stuff I Must Share (MPSIMS)     Marketplace     The BBQ Pit

All times are GMT -5. The time now is 08:12 PM.

 -- Straight Dope v3.7.3 -- Sultantheme's Responsive vB3-blue Contact Us - Straight Dope Homepage - Archive - Top
Copyright ©2000 - 2020, vBulletin Solutions, Inc.

Send questions for Cecil Adams to: cecil@straightdope.com