What's up with the way Google indexes the SDMB?

This issue came up in this thread. It seems like Google is doing something different in the way it handles the SDMB domain. For example, if I do a Google search on the following string:

I get these results. (The “site:” qualifier tells it to only search that domain.) If I do the same domain search for “Cecil” I only get 2 hits. The same domain searches for “1920”, “1920s”, or “1920’s” (as in “1920’s style raygun”) turn up no results. Searching for “Opal” also turns up no results. Presumably it’s only indexing threads that are visible without a search at the time it happens to crawl the site. But shouldn’t it continue to return and reindex old threads? The 4th result from my “sdmb” search is a thread from 2000 which would seem to indicate that it does continue indexing old threads. But wouldn’t that guarantee that common things like “1920” and “Opal” would turn up something?
Do the URLs of old threads change? Does SDMB use norobots.txt to prevent crawling of the entire site (presumably to save the hamsters)? If either of those is what’s occuring, then why does the thread from 2000 show up?

Those would appear to be threads that were linked on other websites for some reason.

Oops, hit submit too soon.

Anyway, note that a couple of the top results were popular threads (Horror of Blimps, LOTR) that were heavily linked to on other websites. That’s probably the only reason Google indexed them.

That’s “robots.txt”, and yes it does. The “robots.txt” file appears at the root of a domain, and you can read it yourself:

http://boards.straightdope.com/robots.txt

contains:

In other words, “no bots, please”. As noted, it still apparently crawls into a few pages linked from elsewhere. If that’s what’s going on, I could argue that if it REALLY obeyed the robots.txt file, it should check that first.

Ammend that. It probably lists the links, but doesn’t crawl them.

It might be crawling boardreader links

yabob, you’re correct. It’s “robots” rather than “norobots”. I did know that, I’ve created a few “robot.txt” files. I guess that’s what I get for staying up too late on the computer. But you would think that if it was obeying the file then it shouldn’t index the boards at all. I’m not sure what the rules are in this case. Is there a seperate command to tell a crawler to not index anything or is it only possible to stop it from crawling the site?

If it doesn’t actually fetch the page, but just keeps the board URL in its index, there’s no load on this site from the robot (other than fetching the robots.txt file, which I assume it caches). It might generate actual human traffic from people who do google searches, but this is presumably desireable. The robots.txt file just allows a site to give permissions for access by automated programs, mainly as a traffic control measure. Questions concerning how URL’s pointing to the site are to be published (or not) is a separate issue.

Off the top of my head, I don’t know of a formal mechanism to tell page generating programs and/or page designers “don’t link to this”, but I’m sure somebody is going to tell me about one. There IS the preventative measure of refusing requests whose referer headers point outside your site, like a lot of sites do to prevent a gazillion links to their graphics, but that simply causes the requests to fail, it doesn’t advise page generators not to construct them.

I’m not talking about merely linking to it or storing the URL. It must be actually fetching the page. It couldn’t index it if it didn’t know the content. Searching on the site:boards.straightdope.com returns a handful of seemingly random pages, none of which have the string “the” anywhere in the URL as far as I can see. So it must have at some point retrieved and indexed those pages as containing, among other things, the word “the”. But I would assume that every page of every thread contains “the” so what gives? Maybe it happened to hit the site at a time or times when “robots.txt” was momentarily unavailable. But then why does it continue to maintain those pages in its index? I’ve always been under the impression that Google checks on previously indexed pages periodically and that pages that become unfetchable eventually time out and are removed. That doesn’t appear to be happening here. Maybe this gives some interesting clues about what’s going on under the hood at Google.

Google is reticent to publish detailed descriptions about what they actually do, of course, but I would not be surprised if they indexed URL’s partially by the link text on pages linking to them - “sdmb” is highly likely to appear in in the linking pages. When you think about it, link text should be highly relevent. “the” is a bit common a word to try for any tests, but you might note that things using content words that appear on practically all SDMB pages come up with nothing:

“welcome” site:boards.straightdope.com
“subscribe” site:boards.straightdope.com
“visited” site:boards.straightdope.com
“calendar” site:boards.straightdope.com

etc. Google publishes that they obey the robots.txt file on their webmaster info pages. If somebody with a robots.txt file which had been in place for a sufficient time found the google search bot fetching pages from their site, they would be risking a lot of adverse publicity.

In fact, if Google didn’t do that, “google bombing” wouldn’t work:

I think that you’ve explained it. Google does in fact index by link text. This is what makes “Google Bombing” possible, the most famous example being “Miserable Failure” which returns the whitehouse.gov biography of George W. Bush as the first link. There now appears to be a counter bombing going on as Jimmy Carter’s biography comes up second and Michael Moore’s homepage comes up third. I can understand Carter but I don’t see how anyone can consider someone who’s movie has made $150,000,000 (regardless of his politics) a “miserable failure”. But let’s not turn this into a political thread.

You beat me to it!

Also note that all of the Googled threads are referenced just by the URL, and not by the title of the page, as is the case with most Google output. This would also seem to be evidence that Google is not actually checking the page itself (in compliance with robots.txt).