How does Google cache web pages that require a subscription?

Sometimes in the results of a Google search, I’ll see pages that require you to subscribe to see the info. When I click, I get the prompt to register (I think the New York Times is a classic example). Yet, if I click the “cached” button, Google has a copy of the content. How does Google do it? Does it have a subscription to every web site on the planet, or do the websites allow “robots” in by default?

When certain webpages are uploaded (eg New York Times pieces) they let people read them for free for a certain period of time. In the case of the NY Times, its one week. After this they put it into archive and you have to pay to see it. Google caches the pages before they are archived. Dunno about the legality of it though.


IIRC, you can also append “&partner=Google” to the end of a NYTimes URL to view pages reserved for registered users.

I always guessed sites allowed Google’s spiders to browse/archive them so that the pages will turn up in search engines and people will be inclined to subscribe. But if they had their choice they probably wouldn’t want their pages cached.

Sites do have the option to not have their pages cached. I believe they can just use the NOARCHIVE option.