Googlewhacking puzzle

InvidiousCourgette · May 23, 2004, 1:38pm

A googlewhack is a two word query, that when submitted to www.google.com returns exactly one result. Both of the words must be real words (and listed on www.dictionary.com); quotation marks are not allowed; and finally if the one result returned is a word list then it does not count. More information about this curious hobby is found on www.googlewhack.com, a site which also accepts contributors googlewhack discoveries. As I write I notice the latest find to be simoniac moose.

Now I was wondering, with google currently referencing 4 point something billion web pages, is the total number of possible googlewhacks out there increasing or decreasing with time? More interestingly (?!), can anyone see a way of estimating the number of referenced (English language) web pages for which the number of possible googlewhacks is a maxima?

Andy · May 23, 2004, 2:44pm

You’re aware that “Invidious courgette” returns…just a single result?

InvidiousCourgette · May 23, 2004, 3:27pm

Yes

I should explain: I went to see Dave Gorman’s Googlewhack Adventure at the theatre. Of course as soon as I got home, I had a go - and after an hour of trying I stumbled on “Invidious Courgette”.

… and of course that got me wondering as to whether the World Wide Web was becoming richer or poorer with respect to googlewhacks. Hence the post!

Of course, next time google updates its database we can tick off two googlewhacks - “simoniac moose” and “invidious courgette” as they will now appear on these pages as well

Pasta · May 23, 2004, 6:26pm

I cannot believe I got the blue screen of death (in XP!) just before I finished this post! Starting over…

Here’s my take… There are several numbers involved in the scaling law of interest:

V = the size of the vocabulary of the language, in number of words. The FAQ at dictionary.com put the number of words in English (including scientific terms) at two million, give or take. I’ll use V = 2x10[sup]6[/sup].
L = the length of a typical web page, in number of words. The top story at cnn.com right now has 788 words, so I’ll use L = 1,000.
N = The number of web pages.

A rule of thumb known as Zipf’s law tells us that the r[sup]th[/sup] most common word in a language occurs at a frequency [symbol]k[/symbol]/r[sup]a[/sup], with a near 1. I’ll take a=1. For V=2x10[sup]6[/sup], [symbol]k[/symbol]=0.066 (obtained by requiring the sum of probabilities to be 1.)

Now let’s take two specific uncommon words – words whose ranks r are near V. The probability that those two words occur together on a web page can be approximated by

P[sub]combo/sub = L[sup]2[/sup][symbol]k[/symbol][sup]2[/sup]/(r[sub]1[/sub]r[sub]2[/sub]),

for L/r[sub]i[/sub]<<1. The probability that these two words form a googlewhack is simply the probability that they occur on exactly one web page:

P[sub]gw[/sub] = N P[sub]combo[/sub](1 - P[sub]combo[/sub])[sup]N - 1[/sup].

We can now ask what N needs to be to maximize P[sub]gw[/sub]. Since f(x) and log(f(x)) are maximized for the same x, I’ll work with

log(P[sub]gw[/sub]) = log(N) + N log(1 - P[sub]combo[/sub]) + (terms not dependent on N).

The derivative of this with respect to N shows that the maximum P[sub]gw[/sub] occurs at

N[sub]max gw[/sub] = -1/log(1 - P[sub]combo[/sub]).

Using the numbers above, the rarest words have a maximized googlewhack probability for N = 2.1 billion.

alterego · May 23, 2004, 7:06pm

Google doesn’t index the board. See http://boards.straightdope.com/robots.txt

Shade · May 23, 2004, 10:37pm

I followed your maths and got N=1biln, but I was sloppy and expect I got a rounding error. Right order of magnitude, anyway.

Wow, that’s a really interesting calculation. I feel obliged to nitpick the assumptions, but aren’t coming up with anything yet.

Shade · May 23, 2004, 10:46pm

Aha! How about if we take into account a distribution for L? I can’t be bothered to do the maths but my gut tells me that for a large N there’s going to be a long tail of webpages with only a few words on, making googlewhacks quite likely. But there’s probably still going to be some max…

Pasta · May 24, 2004, 1:42am

Oh, there’s plenty of handwaving to nitpick, if you wanted. As you point out, L is not a fixed number, and the tails will play a role. Actually, in trying out a few googlewhacks of my own, I found that a successful whack tended to return a very long, disjointed document.

Something else to play with would be to perform the sum over all possible doublets (rather than just taking a representative one) and then maximizing. You’d have to first complicate my expression for P[sub]combo[/sub], though, since my approximation won’t work for all doublets. As soon as you write anything of the sort, though, it becomes impossible to calculate anything analytically.

If there were a “typical set” of doublets (in the statistical mechanics sense), maybe one could concoct a more robust calculation, but I don’t think such a thing will work here. Perhaps if I find time later I’ll play with this idea.

Also, there’s the issue of a dynamic language. There’s really no reason to fix the number of words available, and one could fold in the frequency at which new words are added to the lexicon.

But ignoring all of that, it’s still a neat question.

TJdude825 · May 24, 2004, 4:37am

True, but if you report your googlewhack on some other site that Google does index, it’ll make it no longer a googlewhack. You’d think that most googlewhackers would know not to do this, but if even a few do, the number of googlewhacks will decrease. OTOH, more are discovered all the time (how frequently? I don’t know…)

InvidiousCourgette · May 24, 2004, 6:18am

The Whack Stack on www.googlewack.com has had 358000 googlewhacks posted in the last 2 and a bit years… that is 400 Whacks per day. To that one would have to factor in unreported Whacks. Fortunately googlewhack is not indexed by google.

A very nice bit of maths/stats Pasta. I am sure a lot of googlewhackers will be able to sleep easier now, knowing that they probably are living in a golden age for whacking. Maybe in 20 years time, a googlewhack will be a much rarer find indeed.

When I first considered the problem, it occured to me that the frequency of the individual words that constitute a googlewhack would be important. For example google lists the following frequencies:

F_simoniac = 1510
F_moose = 1380000
F_invidious = 76800
F_courgette = 31300

Multiplying these numbers to produce a googlewhack score gives the following results:

simoniac moose = 2.1 billion
invidious courgette = 2.4 billion

It would seem likely that when choosing words to googlewhack, a score of approximately this magnitude may maximise your chance of a successful Whack.

It is probably coincidence, but it is interesting that Pasta’s approximation for Nmax gw is of a similar order of magnitude.

InvidiousCourgette · May 25, 2004, 2:53pm

I was thinking a little more about this problem :o

I don’t think that Zipf’s law can apply for the population we are considering, in the area that we are considering - i.e. fairly uncommon words.

Using Pasta’s numbers of
V = 2 million
k = 0.066
L = 1000

Then the chance of *the most * uncommon word appearing in any given web page would = L . k / V = 0.000033

If we assume that of the 4 billion pages that google indexes, half are English language, then the number of web pages that we might expect to contain the most uncommon word in the English language would be = 2000000000 x 0.000033 = 66000

Hence it appears that Zipf’s law overestimates the frequency of uncommon words appearing on the world wide web, with somoniac, for example, only mustering 1500 apperances. Maybe the constant “a” needs a serious tweek, or perhaps this is pushing the envelope too far for the law.

Can anyone see a way to rescue the calculation? Notwithstanding these comments, it remains a canny piece of work anyway.

Topic		Replies	Views
Is it still possible to Googlewhack? Factual Questions	9	1853	July 30, 2020
Google Game: One hit searches (googlewhacked!) In My Humble Opinion	86	2761	April 7, 2002
With a knicknack Googlewhack... Miscellaneous and Personal Stuff I Must Share	53	1985	August 24, 2004
The Made Up Word Google Game Miscellaneous and Personal Stuff I Must Share	20	1339	May 4, 2009
Find the most prolific misspelling In My Humble Opinion	51	1772	July 9, 2002

Googlewhacking puzzle

Related topics