Google's capacity - saving every search

I had an argument with a friend yesterday evening, which led to us making a bet.

He claims that Google records every search ever made - including all search data.

That is, given a search for “cats and dogs” both the query and the result set are stored. If another person searches for “cats and dogs” the same (saved) data set will be served. This, he says, is backed up by the fact that there is an “auto suggest” drop down in the Google search textbox, that suggests search terms as you type. He infers that these suggested terms are stored queries that other people have made and for which a dataset is immediately available.

I claim that this is impossible due to the sheer quantity of searches made (even in a single day), which would create a massive storage problem, (apart from being somewhat pointless; the internet being very dynamic)

I accept that Google collects metadata and statistics on all searches, but not the complete data - this is how the auto-suggest drop-down contents are obtained.

Now, my friend is most argumentative, and in the end I would expect that neither of us will accept defeat. However, I would like to pose the following questions, to bolster my argument:

What is the estimated capacity of the Google servers? (specifically those that deal with searches, rather than Google maps, Gmail etc etc)
What is the estimated average amount of data processed via Google search every day/month/year?

I can’t speak for Google, but my company does recommendations (People who bought this also bought this) and we record every single request and every answer set. It’s a huge volume of data and we keep every single one. It’s some of the most valuable data we own. I have no problem believing the Google keeps their data in the same way. I’m sure, like our data, it gets processed and manipulated so the original raw data isn’t used after a short while, but we keep the raw data around in case we need it.

Here’s an article from Jan 2008 - Google Processing 20,000 Terabytes A Day, And Growing | TechCrunch Suprisingly, I couldn’t find a more resent article in a quick Google search.

According to the privacy FAQ, they do store every single search you make.

***What is the estimated capacity of the Google servers? (specifically those that deal with searches, rather than Google maps, Gmail etc etc)
What is the estimated average amount of data processed via Google search every day/month/year? ***

Yes, Google does store everything. They have multiple data centers around the world and comsume enormous amounts of electricity. Example, they opened a data center in the Columbia River valley, Oregon because of the cheap hydropower.

A couple of years ago, there was a NYT article that detailed in-depth Google’s data storage needs and power consumption requirements. I’m too lazy to find it but it should be easy to find via Google and/or nyt.com.

It’s assumed they do but no one knows anything for sure. What Google uses the searches for is to provide you with search results faster. For example if you look for Markxxx, Google can look back at the last search for Markxxx and bring up the cache results of that search.

Then later on when no one is looking the Google Spiders can crawl the web and look for information and links relating to Markxxx and add that into their information and search results

This way they can also remove spammy sites and prevent the site from being reindexed into their search results easier.

They also just opened a server farm in Lenoir, NC because of the spare electricity capacity in that area - they used to make furniture there but most of those places have closed.

The Google center is in The Dalles, Oregon. And it’s truly huge and sucks a lot of energy.

I worry about the amount of info that Google has. I don’t use Gmail because the searches are already too much. If you’re a heavy Internet user, the amount of personal info they have is staggering. If you ever Google yourself, they have your name. In most cases you’ll at least Google some friends and family, so they’ll have those. If you use Maps for directions, they have your street address and probably the addresses of some of your friends, or other places you drive to such as doctors’ offices. They have info on all your sexual fetishes, your political beliefs, your hobbies, and what kinds of products you shop for.

I’m not clear on the “anonymizing IP addresses after 9 months” part. Does that mean they keep all your searches associated with each other, but with some anonymous identifier instead of the IP address? If so, then you could still be identified if you’ve Googled yourself or your family, or by other means. Or, do they just keep one master of all searches performed, without associating one search with another?

Let’s say a search is 2 kB of data. That should be plenty for the search string, client meta data, and results (stored as references to elsewhere archived information).

If you’ve got $100M to spend on storage each year, and a terabyte costs $2000, this means you can store 800,000 searches per second. That would be like one billion users submitting a search every 20 minutes.

These are all obviously way ballpark numbers, but it seems pretty feasible to me.

So?

(Serious question. Why is that a big deal? My credit card issuer has most of the same info. My health insurance company has the rest. All of that can be subpoenaed, I believe, and so the government has all that too. Hasn’t seemed to make life miserable so far.)

I wonder how much Google pays per 1TB of storage. You can buy 1TB drives for ~$100 these days. Keeping it powered is probably the biggest cost.

As for how they process that much data, they figured out a really cool way of doing it called map reduce.

Wanna take it a step further…According to my brother, the iPhone’s ‘GPS’ ability is partially based on finding routers in the area, getting their MAC address and cross referencing it to where they originally found those routers (a company called Skyhook drove all over the country with picking up router signals and unique ID’s from them). This was further shown to be true by my brother. His roommate has a old router that hasn’t been plugged in in several years. He plugged it in, and his iPhone thought it was at his old apartment.

They obviously can’t store the results, since then every time anyone ever searched for “cats and dogs” again they’d get the same set of hits, and new pages would never be added.

Google datacenters are generally located near hydroelectric dams, precisely because the electricity is cheaper there. That, and being near a river simplifies their considerable cooling problem.

I don’t think that follows, and I’m pretty sure they must store the search results. You have to know what results were presented to the user, then which ones were clicked on and which weren’t, to refine your algorithm. There’s a huge amount of valuable information (in aggregate, no one cares what any individual user does) in being able to capture search results and see what is clicked on.

This doesn’t mean static results since previous results can make up a part of the scoring algorithm. The users search history, geographic location, and preferences can also factor into the results.

You do get the same pages every time you search for “cats and dogs.” You’ll get a different set next month or next week or possibly tomorrow, but for today the servers have to do a vastly smaller amount of work.

Here’s an article about google’s data facilities link.

The article is more than a year old. They explicitly have technology for handling and parallel processing datasets mesured in petabytes.

Their total storage capacity is probably in the high petabytes if not low exabytes range (speculation).

They’ve got plenty of space to store those trivial little result sets, who sent them, and probably your individual DNA sequences if they wanted to.