A googlewhack is a two word query, that when submitted to www.google.com returns exactly one result. Both of the words must be real words (and listed on www.dictionary.com); quotation marks are not allowed; and finally if the one result returned is a word list then it does not count. More information about this curious hobby is found on www.googlewhack.com, a site which also accepts contributors googlewhack discoveries. As I write I notice the latest find to be simoniac moose.
Now I was wondering, with google currently referencing 4 point something billion web pages, is the total number of possible googlewhacks out there increasing or decreasing with time? More interestingly (?!), can anyone see a way of estimating the number of referenced (English language) web pages for which the number of possible googlewhacks is a maxima?
I should explain: I went to see Dave Gorman’s Googlewhack Adventure at the theatre. Of course as soon as I got home, I had a go - and after an hour of trying I stumbled on “Invidious Courgette”.
… and of course that got me wondering as to whether the World Wide Web was becoming richer or poorer with respect to googlewhacks. Hence the post!
Of course, next time google updates its database we can tick off two googlewhacks - “simoniac moose” and “invidious courgette” as they will now appear on these pages as well
I cannot believe I got the blue screen of death (in XP!) just before I finished this post! Starting over…
Here’s my take… There are several numbers involved in the scaling law of interest:
V = the size of the vocabulary of the language, in number of words. The FAQ at dictionary.com put the number of words in English (including scientific terms) at two million, give or take. I’ll use V = 2x10[sup]6[/sup]. L = the length of a typical web page, in number of words. The top story at cnn.com right now has 788 words, so I’ll use L = 1,000. N = The number of web pages.
A rule of thumb known as Zipf’s law tells us that the r[sup]th[/sup] most common word in a language occurs at a frequency [symbol]k[/symbol]/r[sup]a[/sup], with a near 1. I’ll take a=1. For V=2x10[sup]6[/sup], [symbol]k[/symbol]=0.066 (obtained by requiring the sum of probabilities to be 1.)
Now let’s take two specific uncommon words – words whose ranks r are near V. The probability that those two words occur together on a web page can be approximated by
Aha! How about if we take into account a distribution for L? I can’t be bothered to do the maths but my gut tells me that for a large N there’s going to be a long tail of webpages with only a few words on, making googlewhacks quite likely. But there’s probably still going to be some max…
Oh, there’s plenty of handwaving to nitpick, if you wanted. As you point out, L is not a fixed number, and the tails will play a role. Actually, in trying out a few googlewhacks of my own, I found that a successful whack tended to return a very long, disjointed document.
Something else to play with would be to perform the sum over all possible doublets (rather than just taking a representative one) and then maximizing. You’d have to first complicate my expression for P[sub]combo[/sub], though, since my approximation won’t work for all doublets. As soon as you write anything of the sort, though, it becomes impossible to calculate anything analytically.
If there were a “typical set” of doublets (in the statistical mechanics sense), maybe one could concoct a more robust calculation, but I don’t think such a thing will work here. Perhaps if I find time later I’ll play with this idea.
Also, there’s the issue of a dynamic language. There’s really no reason to fix the number of words available, and one could fold in the frequency at which new words are added to the lexicon.
But ignoring all of that, it’s still a neat question.
True, but if you report your googlewhack on some other site that Google does index, it’ll make it no longer a googlewhack. You’d think that most googlewhackers would know not to do this, but if even a few do, the number of googlewhacks will decrease. OTOH, more are discovered all the time (how frequently? I don’t know…)
The Whack Stack on www.googlewack.com has had 358000 googlewhacks posted in the last 2 and a bit years… that is 400 Whacks per day. To that one would have to factor in unreported Whacks. Fortunately googlewhack is not indexed by google.
A very nice bit of maths/stats Pasta. I am sure a lot of googlewhackers will be able to sleep easier now, knowing that they probably are living in a golden age for whacking. Maybe in 20 years time, a googlewhack will be a much rarer find indeed.
When I first considered the problem, it occured to me that the frequency of the individual words that constitute a googlewhack would be important. For example google lists the following frequencies:
I was thinking a little more about this problem :o
I don’t think that Zipf’s law can apply for the population we are considering, in the area that we are considering - i.e. fairly uncommon words.
Using Pasta’s numbers of
V = 2 million
k = 0.066
L = 1000
Then the chance of *the most * uncommon word appearing in any given web page would = L . k / V = 0.000033
If we assume that of the 4 billion pages that google indexes, half are English language, then the number of web pages that we might expect to contain the most uncommon word in the English language would be = 2000000000 x 0.000033 = 66000
Hence it appears that Zipf’s law overestimates the frequency of uncommon words appearing on the world wide web, with somoniac, for example, only mustering 1500 apperances. Maybe the constant “a” needs a serious tweek, or perhaps this is pushing the envelope too far for the law.
Can anyone see a way to rescue the calculation? Notwithstanding these comments, it remains a canny piece of work anyway.