These things have always left me a little curious.
When Google says “Did you mean…” and corrects your typo, or gives you a more common spelling of something, where does it get that information? Is it from what other users search for most, or what’s most common in Google’s database? Example: search for “Cecil Addams” and Google quite rightly corrects it to “Did you mean: Cecil Adams”. Evidence that it’s based on other people’s searches: sometimes, when you click the “did you mean…” alternative, it gets no hits, meaning its not in any of the pages indexed by Google. So it must have come from somewhere else.
Sometimes in a search, the Google results contain pages that look like someone filled in a form on a website and hit “enter”. I don’t have a real example at hand, but what I mean is, you search for something like ‘foo’, and Google brings up a bunch of links to pages containing foo, but also links to pages containing search results for foo. So you’ll have:
I’d imagine they have an algorithm similar to that of a basic spell checker but which compares what you typed in to what would yield a large number if hits if you substituted it.
This type of URL encoding is common on many webpages, especially ones that involve database lookup. The server reads the “term” variable out of the URL and generates a page by getting the information for “foo” from its database. No form is needed. Just browsing through the menus on the site will get you those same URLs, though much more slowly.
I did a search for “mxptylk” and it correctly suggested that I was looking for “Mxyzptlk”. How it knew that is just one of the mysteries of Google IMO My WAG is that it gets the terms from the webpages it searches, but I have never had the “did you mean” alternatives yield zero results. Can you give us an example?
In the cases I’m thinking of, the “searched” thing is for an item that is just one example among millions, e.g. a particular model of DVD player or a word definition. There’s no link to it from the main site with an attached query string that I can see.
I’ll try to come up with an example of the “no results found”. I’ve usually encountered it when I search for something obscure. It goes like this:
me: Google, search for “SomeObscureThing”
Google: Did you mean “SomeOtherObscureThing?”
me: maybe. (clicks it)
Google: No results found.
Ah I think I got it. If you search for a phrase in quotes, and misspell one of the words in quotes, it will give you a “did you mean” that is still in quotes and could give zero results.
My WAG is still that it bases its substitutions on how many hits your word and the alternatives would yield, and doesn’t take phrases into consideration.
It may be buried in the Site Map or Site Index. These are automatically generated pages that link to every page on the site, including result pages for every product in the database. Websites can also submit their sitemaps to Google, which causes the Googlebot to update the links. Google does not fill out forms. However, if a form is filled and generates a query string, and then other websites link to that URL, Google could add the link in that case.
Now, that site must have tens of thousands of words - how could there be a link to each possible word? It seems that somehow, Google is indexing the deep web.
There is a link to every possible word right in the link you provide, because every word has links to the nearest 20. All it takes is for one of those links to make it to the Googlebot (through another page or a link submission), and every word on the site gets indexed.
I suppose we could keep going back and forth like this so I’m gonna go ahead and shut my yap until a more credible source chimes in.
A lot of these are “Search-related advertising,” which are simply sites full of paid links intended to show up on Google searches. Here is a column discussing this annoying phenomenon (go to www.bugmenot.com if you need a login to read the article).
That sounds exactly like what I’m talking about, Geobabe. Just another form of spam, I guess. I can imagine Google’s engineers are working on depolluting the search index as we speak.