How deep could I make the Google Crawler dig?

KellyCriterion · September 23, 2010, 1:11pm

Let’s say hypothetically (and I may still do this if I can bothered), I build a webpage and put it on the internet. The page contains random words, and has a single link on it to another page. When that link is clicked, “page 2” loads with another random assortment of words. “Page 2” will also contain a single link to “page 3”, which, if clicked, will show another page with a random assortment of words. This cycle would repeat itself “theoretically” forever.

I would code the pages to load dynamically - I wouldn’t need to actually sit there and create a gazillion pages. The whole thing would be dynamic, and each page would be a different absolute URL.

How long before the Google Crawler says “screw this”?

Anthony_N · September 23, 2010, 5:34pm

Google’s crawler seems pretty smart about indexing dynamic pages. Also about detecting useless content. If the content was just random words, it’s very likely that nothing would even get indexed.

I ran a site with a large database of geographic points of interest with dynamically generated pages and Google seemed to index it inconsistently. I surmised that there is likely a limit on how many pages it will index.

If you want to invoke the correct behavior on a large dataset, you use Google’s site map tool and submit an xml file that contains all the URLs in your site. This seems to work much better than expecting the crawler to follow links forever.

KellyCriterion · September 28, 2010, 1:53pm

Thanks for the answer.

control-z · September 28, 2010, 3:35pm

Yes I’ve seen limits too. Google could get over 700,000 unique URLs from one site I administer, but doesn’t.

Markxxx · September 28, 2010, 3:43pm

A huge part depends where it’s linking and who’s linking to you

If you have a link on page 100 and that link is from The New York Times or Time Magazine, it’s highly likely Google is gonna crawl deeper and work it’s way back and front.

This is why it’s useful to have sites link directly deep into your site rather than just the home page.

Gary_Robson · September 28, 2010, 4:54pm

It would be an interesting (and easy) experiment. So that it’s not “random content,” pull a bunch of books from Project Gutenberg. Set up a site that looks like it has static pages, but have each deliver one page (or, heck, one paragraph) from a book, with next & previous links at the bottom. You could swiftly create a site with many thousands of serially-linked pages.

tellyworth · September 29, 2010, 12:41am

No need to do the experiment. Bad guys try exactly that all the time, have done for years. I don’t know precisely what countermeasures Google use, but detecting duplicate content is certainly one of them. They’re generally pretty good at catching and de-indexing artificial sites like those described in this thread, particularly large ones.

sharding · September 29, 2010, 12:46am

Related: Official Google Webmaster Central Blog: To infinity and beyond? No!

Cugel · September 29, 2010, 12:57am

I’d never really thought of this before, my site has about 42000 pages (all static html, no javascript linking and/or generated), most linking to many other pages. Google only sees about 8500 apparently.

Further thinking helps explain why Google chooses a certain page to show as the “hit” when a better page exists - it hasn’t bothered to index it.

Gary_Robson · October 1, 2010, 6:15pm

The experiment I proposed didn’t have any duplicate content.

Very interesting and very to-the-point.

Topic		Replies	Views
How is Google able index dynamic content? Factual Questions	4	1123	March 6, 2015
Deep Web question Factual Questions	1	851	January 3, 2003
Web designers (and readers) your opinions please. In My Humble Opinion	7	959	August 18, 2004
How to optimize searching for a web site? In My Humble Opinion	5	822	December 22, 2007
JavaScript, keywords and spiders Factual Questions	12	774	October 8, 2002

How deep could I make the Google Crawler dig?

Related topics