Google Sitemaps for board crawling woes

alterego · October 22, 2008, 1:34am

I see that letting the spiders on was crashing the board. I would suggest only allowing Google on during the test period. You can broadcast the board content to Google and specify certain properties for each page. The particularly interesting ones for reducing crawling load are:

[ul]
[li]How often the pages on your site change. For example, you might update your product page daily, but update your About Me page only once every few months.[/li][li]The date each page was last modified.[/li][/ul]

Mark threads that are too old to be posted in per board policy as never updating. Mark threads that can be posted in as updated every so often. Broadcast all new and updated threads to Google each night, or however often, using Sitemaps. Googlebot should come in once, download all the data, and then only come in for the updates after that. This will significantly reduce crawling load because you are only using Google, and Google knows exactly what content to get. In theory.

I see that some vBulletin Sitemap extensions have already been developed.

http://www.google.com/support/webmasters/bin/answer.py?answer=40318
http://www.google.com/search?q=vbulletin+sitemap+google

alterego · October 26, 2008, 10:32pm

I was honestly surprised when Jerry flipped the switch allowing the spiders on. It’s almost as if he hasn’t analyzed the web logs to see just how many spiders are hitting this site. If you don’t put proper limits in place, such as this one, your only choice will be to add more nodes to the cluster.

Have you also considered showing spiders the static archive of a given page and users the dynamically generated version? This kind of cloaking is completely legitimate and saves on CPU cycles.