Simple enough question - I’m rather versed in computer matters, but I don’t know squat about database design etc. Can someone explain in layman’s terms how a database like Google can search for “monkey poop” among 50 billion indexed Web pages, not to mention organize them and present me the results, in 0.03 seconds?
Most of their processing power goes towards creating an index that tells them the contents of webpages. The SDMB uses a similar mechanism for its searches, so when it processes this post overnight, it’ll record in its table that it contains the words “most”, “their”, “processing”, “power”, and so on. When it goes to search, it just looks in that table.
The other part of the equation is that Google, unlike the SDMB, has an estimated 400,000 servers dedicated to the task, with the most popular portions of the index loaded into RAM.
Exactly how Google works is a closely guarded secret and it likely evolves over time. However, the general idea of indexes is most of it. You have probably run a search on your own computer that can take ten minutes to find the files containing “SDMB” or something. That is because Windows literally searches every single bit of every file to find that string. Imagine if windows pre-processed all those files so that it has a table much like the index of a book that already knows which files have that string in them. That would be extremely fast and it wouldn’t just have to be for that string. It could look up multiple strings in the index and return only those files that contain both of them or any combination you want. Google is successful because it seems to have invented some types of super-indexes that work really well.
http://www.monstersmallbusiness.com/grow/grow-how-google-works.asp
It’s not just the database that makes it fast.[ul]
[li]as ultrafilter mentioned, indexing documents by words (and by pairs of terms, triplets of terms, …). This is called an “inverted index.” It’s very fast, but not 0.03 seconds fast if you have billions of documents indexed.[/li][li]as friedo mentioned, distributing indexes over thousands of machines allows that entire index (which is truly huge) to be stored and gives redundancy if any machine fails. But they still have to recombine the records from the different indexes in some way. This is a hard problem, since relevance score is based on properties of individual documents (which may be split up over many indexes) as well as properties of the entire web.[/li][li]they have developed their own network protocols to optimize the transfer of data from those thousands of servers[/li][li]their indexes and recombination algorithms are very cleverly optimized[/li][li]they precompute many things for each document (such as PageRank, which is based on outgoing and incoming links)[/li][li]anytime and approximation algorithms: the ranking algorithm is stopped early; the results you see are just an approximation of what the “true” results would be if the algorithm was allowed to run to completion. They can get away with that because the web is so big and has so many redundant documents.[/li][/ul]This is a company that has “more Ph.D.s per square foot” than any other. That could be hype, of course, but they do hire many, many Ph.D.s. It’s more than engineering that makes it fast.
I’d like to focus on a small part of this question. Freido says they have 400,000 servers working on this. That might even be on the low side, considering that these indexes contain copies of a significant portion of the entire rest of the world’s WWW pages. But let’s take that number as a given. Where the heck do they put four hundred thousand servers? At two square feet each, that’s sixteen football fields! How many buildings, and how many floors on each?
Man, forget the all the PhD programmers working on fast algorithms and such. I shudder just to think of all the maintenance staff it must take to keep those suckers running!
I don’t know the answer to your question but I have read that Google takes a formerly novel approach to its servers. Back in the day, companies that needed massive computing power usually got really massive and expensive computers. In an alternate universe, the un-Google company would build the world’s largest computer to finally cache the whole web and the computer would cost billions to develop. Meanwhile, the real Google people realized just how many PC’s they could buy with all that money so that is what they did. Google uses really plain, not very great boxes and treats them as semi-disposables. I don’t know how they network new ones fast enough but it must be just some plug and go system.
And where does all the heat go? Presumably you could duct it off and heat a small town.
For a good look at how the massiveness of the Web (and the servers that index it) is stored, including some info on “how do they get all that power?” and “where does all the heat go?” check out this article from last month’s Wired magazine. I found it fascinating.
What you left out is not all google searches are equal.
Google had data centers lots and lots of them. When you go to google.com you get assigned to a data center. It gives you Google’s results from THAT particular center. If you go to a different data center you will often (but not always) get different results, depending on keywords.
Here is a very good tool
http://www.webrankinfo.com/english/tools/google-data-centers.php
You type in your keyword and see how the Google search results vary.
Back when Google first hit it big, I remember reading an article about them which stated that they had an entire floor of a high rise building which was nothing but servers (I would strongly suspect they’re blade servers). Since then, Google has spread it’s operations out to different facilities around the world. If you go to google.ca, you’re connecting to machines located in Canada, while if you go to google.com, you’re most likely connecting to machines in the US. This is one of the reasons why Google moving into China was signficant, because they didn’t simply buy the domain name google.cn, but they actually set up shop in China.
I wish I could find an article on it, but I can’t at the moment. The power requirements for data centers are simply unbelievable. As it turns out, a data center with a measily 200 or so machines (plus needed accessories like air conditioning) starts to eclipse what the local power company can supply.
For something like the new google center being built in Oregon, really large data centers need to be close by to a power plant and a water source for cooling.
I would guess that they would use the same servers that they are selling to parties that want to search their internal databases. It looks like the racks that they are selling hold 20 servers each which means a paltry 20,000 racks to house.
Of course they may also be using far more sophisticated servers for themselves, since they know what sort of demands are being put on them.
I second ZipperJJ’s recommendation of that Wired article. One amazing claim it makes, which seems hard to believe, is that “the planetary machine [the network of huge datacentres used by Google et al] is on track to be consuming half of all the world’s output of electricity by the end of this decade.”
I doubt it. Blade servers are still quite expensive, and they’re all proprietary architectures. Once you’re locked into a blade architecture there’s no switching, and it’s very difficult to customize blades for specific applications.
Google has been able to build their server farms quickly by buying cheap, interchangeable commodity parts. I wouldn’t be surprised if the bulk of their machines were no-name 1U rackmounts.
A wizard did it.
Last I saw, Google uses a kabillion (around half a million, all together) “white box” machines running a Linux-based OS that they crafted for the purpose. Very basic machines, not the latest whiz-bang stuff.
Think along the lines of the “Linspire” boxes that Walmart sells, but with rackmount ears.
You’re overestimating the floorspace by an order of magnitude. A rack takes up less than 4 ft. Square of space. You can easily pack a two dozen servers into one rack. Even doubling the space taken (so you have aisles between the racks), a server should take up significantly less than a square foot of floor space.
If they’re 1U servers, you can fit 42 of them in a standard rack (about six feet tall). And you can build racks taller than that if you want.
They are using gigi servers with fizzle stack bypass valves.