How much of the internet does Google have cached?

When you do a search on Google, most, if not all, of the results have a cached link. The same for images.

I assume this to mean that Google has a copy of the site stored on their local servers from the most recent time they crawled it.

Which sorta makes sense. To be able to do a keyword search they have to have the entire text readily available.

Just how many millions of gigabytes does Google have in their data centers? And how much of the internet is stashed on them?

First link … Eric Schmidt CEO estimates they indexed 0.004% of 5-million-terabytes.

Probably a rather small amount: Google focuses on the Web, publicly-accessible FTP sites, and the text-only component of Usenet, with a relatively small foray into email, and ignores things like VPNs, private FTP sites, all the binaries being shot around Usenet, Bittorrent, all of the other ways to share files online, most of the rest of email, and all of the other things the Internet is being used for these days.

The Internet is somewhere between an ocean, a cesspool, and an iceberg: Most of its mass is unseen; hidden in the depths and relatively inaccessible.

This is a much harder question than you may have anticipated. First of all Google, doesn’t index the majority of the web let alone the entire internet. Figuring out what percentage of the web they index is step one to answering the question and even that isn’t known. I have read that Google indexes maybe 10% of the web as a guesstimate. Most of the web is the “Deep Web” that is contained in sites that Google can’t see or doesn’t have a good way to index. The SDMB is a good example. Some threads are indexed but far from most of them so Google doesn’t make a good search tool for the hundreds of thousands or millions of websites like this out there whether they are just recreational or business related.

Google has always been secretive about how many data centers they even have exactly let alone how much capacity they have. We do know that they just run on hot-swappable plain-vanilla computers similar to a regular desktop computer but they have at least tens of thousands of them in their larger data centers if not more. This means that they can upgrade their capacity easily by switching out servers and they can just do it on the fly especially if they have extra rack space at that data center. It is an extremely efficient and flexible design. They cache most most what they do index because they need it to build the search algorithms or to recalculate them in the future.

There are Youtube videos and articles available covering the Google data centers. They are impressive but the information is almost always out of date by the time it is released because they move so fast.