SDMB not crawled by search engines?

Ever notice that no matter how many times you Google something, it never picks up any pages from SDMB? Is there some kind of firewall here to keep search engines out?

I HAVE seen SDMB pages in google occasionally. Not often, I will admit.

There is a recognized standard to preclude crawling by webrobots:

http://www.searchengineworld.com/robots/robots_tutorial.htm

http://boards.straightdope.com/robots.txt does not exist, but http://www.straightdope.com/robots.txt does, and contains:


User-agent: *
Disallow: /bonus/

This suffices to keep well-behaved robots from crawling past the published straight dope “front door”, which might be how they would normally reach the message boards.

There could be some blocks placed against known search engines at other levels, of course, either at a firewall, or simply by IP blocking in vBulletin.

BTW, that “Disallow:” line advertises another path under www.straightdope.com, which produces the straightdope banner and footer with the text “Hey! You’re not supposed to be rooting around in here!” for the content. Does any admin care to comment on what’s in /bonus/?

DUH!

Excuse me. I’m an idiot. I stated that backwards. The www.straightdope.com/robots.txt file allows all robots in, except that it disallows them from crawling “/bonus”. So, it DOESN’T stop robots from crawling the links from the front page, only down that mysterious /bonus path.

More on this. There is also a robots <META> tag which is supposed to be honored:

http://searchengineworld.com/metatag/robots.htm

The tag doesn’t seem to be present in the SDMB pages. IP’s of known search engines could still be blocked by other mechanisms, as I said.

Hehe, I found that, too. But you dare to ask? :eek:

This thread may be interesting to you: Why isn’t the SDMB indexed on Google?

No limiting on our side, it seems. At least Google limits its crawling on dynamically generated sites, but we know by now of at least two archiving sites (www.archive.org and www.boardreader.com) which didn’t encounter any form of resistance when spidering our site.

I’m still eager to hear from our ‘board-officials’ if they think losing some bandwidth to traffic generated by crawlers is an issue.

And about the /bonus/ thing. :slight_smile: