how complex from programming standpoint is Google's server farm?

let’s suppose some grad students decide to build an emulator. They have this simulation of a thousand computers networked together performing a distributed algorithm reminiscent of Google’s distributed database. Some of these virtual computers periodically break down, the load changes depending on time of day and similar, so an attempt is made to approximate what is really happening on a big server farm. Well, and once they made it all work virtually and got in good graces with Discworld’s Lady Luck, let’s say they get a grant to buy 1000 machines and actually try out how it will all work in reality.

So how complex would that be? Can a small group of engineers make it work in a year? Let’s say not as efficiently and wonderfully as Google does it, sacrificing lower utilization of hardware resources for simplification of the programming task?

How many engineers and how much time did it take for Google to do the server farm in the first place? How about their competitors like Baidu or MSN Search?

I can’t really comment on the hardware or overall software system that google uses since it’s a bunch of orders of magnitudes larger than anything I’ve used, but I have designed and worked with smallish (8 to 20 nodes) distributed application frameworks, so I’m going to throw my assessment in anyway.

It’s not that difficult from an engineering point of view, but most programmers just aren’t really used to doing things massively parallel (even just on a single machine with many processors) and unreliable networks and nodes complicate things quite a bit if you want to do it right.

It does require a “paradigm shift” in the way you design your systems. You can’t rely on any complex operation to succeed or fail or even give you any indication of failure, so you have to rely on timers, over-parallellazing or other tricks preferable using some general framework that watches all the tasks that are being done.

Really large scale distributed operations can only be done using fairly restricted algorithms. For instance, you can’t use any operation that examines each point of data in one node, since you realistically cannot even access all that data.

A relatively easy to understand technique (which google has helped to popularize) is to use a map/reduce system, where map operations transform each data point into some (or no) output data, without taking into account any other data point, so no side effects like counting data points or the like. Reduce operators aggregate a subset of data points (that may be the result of map or reduce operations and vice versa) into single new data points, also without side effects outside the single bunch of data points that are input to the reduce operators.

A design like that places a bunch of restrictions on the data model, and also implicitly requires a non-consistent data model; a new site entered into google’s search engine will not show up at the next search, but only once all the processing is done.

That said, there are fairly easy to understand frameworks that are freely available and do much of the coordination etc required for building these large systems. Using those, a talented and interested small group of programmers can certainly build a system that is at least comparable at the design level with the google search or gmail system in, say a year (depending on how complex google’s rating system has become), and the distributed frameworks could be build by a handful of talented programmers in a year or so (including putting it into production, fixing all the bugs and making it reasonably easy to maintain).

Replicating all the interesting stuff google has build on the scale it has will be a multi-year, multiple billions of dollars project at least, though.

ETA: if you’re a programmer and interested in seeing how map/reduce works with or without actual distribution, I suggest you take a look at couchdb.

The difficulty in writing programs for a server farm depends upon what problem you’re trying to solve. Google has its own proprietary map-reduce infrastructure, but Hadoop is an open-source equivalent. Their first step would probably to install it on their machines and familiarize themselves with it.

I’m a little uncertain about what you’re asking. What sort of problem are they supposed to solve?

wrt map-reduce, I have seen mentions of it with specific association to Google. Along the lines of “they took to heart the power of functional programming and implemented this wonderful technique”.

So is this the dominant and best distributed architecture nowadays? Let’s say before Google, how did Altavista run their servers? Did they have map-reduce too? If they didn’t, did they suffer some major weaknesses that Google avoided by adopting map-reduce? Does Baidu use map-reduce today?

Also, I am guessing that there is more to the server farm than just the basic algorithm. You also need to administer it, track what is happening, detect hardware failures and similar. Do the existing open source frameworks already incorporate that?

What algorithm? Map-Reduce is not an algorithm, it’s a design paradigm. There are many different algorithms that are map-reduce algorithms.

I’m convinced that any large scale distributed system right now implements at least some sort of map/reduce techniques to make stuff work - just because if you can use it for your problem, it’s demonstrably the correct thing to do - but there’s a lot of other things too, especially in the data storage/retrieval part. Google, Amazon, Facebook and many others all have their own custom-build distributed data stores of varying complexity. Then there’s all kinds of virtualization - Amazon even lets you rent virtual machines and data stores on their server farm. The more you add in functionality, the more complex it becomes.

The point is; once you need to spread your data over a certain number of nodes (fairly quickly) or your system stretches over a country or two (probably later) you will have to implement asynchronous and “side-effect free” solutions just because you can’t with any efficiency access all the data “at once”.

Functional programming makes it much easier to do this on a low level, and it does help you internalize the whole technique, but unless you’re running on a massively multithreaded machine or a very tight cluster, you’ll have to lift the ideas out of the local algorithm and into a much more loosely coupled parallel “batch job” kind of design - and whether you’re using functional programming or not on the local nodes really isn’t all that relevant - at least not for the kind of stuff google is doing most of the time (AFAIK google is mainly using Python and Java, neither of which could be considered to be really useful for functional programming. Especially not Java).