How deep could I make the Google Crawler dig?

Let’s say hypothetically (and I may still do this if I can bothered), I build a webpage and put it on the internet. The page contains random words, and has a single link on it to another page. When that link is clicked, “page 2” loads with another random assortment of words. “Page 2” will also contain a single link to “page 3”, which, if clicked, will show another page with a random assortment of words. This cycle would repeat itself “theoretically” forever.

I would code the pages to load dynamically - I wouldn’t need to actually sit there and create a gazillion pages. The whole thing would be dynamic, and each page would be a different absolute URL.

How long before the Google Crawler says “screw this”?

Google’s crawler seems pretty smart about indexing dynamic pages. Also about detecting useless content. If the content was just random words, it’s very likely that nothing would even get indexed.

I ran a site with a large database of geographic points of interest with dynamically generated pages and Google seemed to index it inconsistently. I surmised that there is likely a limit on how many pages it will index.

If you want to invoke the correct behavior on a large dataset, you use Google’s site map tool and submit an xml file that contains all the URLs in your site. This seems to work much better than expecting the crawler to follow links forever.

Thanks for the answer.

Yes I’ve seen limits too. Google could get over 700,000 unique URLs from one site I administer, but doesn’t.

A huge part depends where it’s linking and who’s linking to you

If you have a link on page 100 and that link is from The New York Times or Time Magazine, it’s highly likely Google is gonna crawl deeper and work it’s way back and front.

This is why it’s useful to have sites link directly deep into your site rather than just the home page.

It would be an interesting (and easy) experiment. So that it’s not “random content,” pull a bunch of books from Project Gutenberg. Set up a site that looks like it has static pages, but have each deliver one page (or, heck, one paragraph) from a book, with next & previous links at the bottom. You could swiftly create a site with many thousands of serially-linked pages.

No need to do the experiment. Bad guys try exactly that all the time, have done for years. I don’t know precisely what countermeasures Google use, but detecting duplicate content is certainly one of them. They’re generally pretty good at catching and de-indexing artificial sites like those described in this thread, particularly large ones.

Related: Official Google Webmaster Central Blog: To infinity and beyond? No!

I’d never really thought of this before, my site has about 42000 pages (all static html, no javascript linking and/or generated), most linking to many other pages. Google only sees about 8500 apparently.

Further thinking helps explain why Google chooses a certain page to show as the “hit” when a better page exists - it hasn’t bothered to index it.

The experiment I proposed didn’t have any duplicate content.

Very interesting and very to-the-point.