would Google App Engine + MySQL Cluster constitute silver bullet scalable web app framework?

I may be wrong, but it seems that GAE encourages people to use some weird google NoSQL database storage.

Now, suppose we don’t feel like doing that. Suppose we want to build a platform for web programming the traditional way, but so that the app could scale in commoditizedly built in fashion, what with the load balancing and stuff. So that, let’s say, if I manage to write an equivalent of MediaWiki in this platform, it could then out of the box handle if not quite “wikipedia” than at least a pretty darn popular, heavily visited wiki. Without me dealing any load balancing whatsoever.

So could something like that be achieved by combining GAE and an existing clustered database platform like MySQL Cluster? Or by using some other existing platforms?

If the answer is no, is it a matter of common knowledge “why not”? Are there some inherent limitations to how GAE and any other such system handle load balancing that prevents these platforms from giving us a true scalable hosting experience in a seamless fashion?

Or are there perhaps some inherent limitations to this in the sheer structure of the computer science reality, e.g. in the same way as NP completeness issue predicts fundamental limits for the power of platforms that deal with relevant problems?

The common knowledge and insurmountable reality of it is that “scalability” is accomplished by many distributed computers – possibly hundreds or thousands of them. The standard RDBMS like MySQL, Oracle are not architected for this out-of-the-box. These classic relational databases expect to be close to the operating system file system – even with a cluster “shared disk” setup, the RDBMS is not translating many layers to hit raw disks. In a high scalable environment, the individual rows of a 100 terabyte database would be located on different machines. These are the realities that the NoSQL type of databases are addressing. The NoSQL databases are not “weird”; in fact, if you were to start a computer science thesis to write a database to handle huge web workloads, you’d end up reinventing NoSQL. There are PhDs studying this space and they’ve all converged at the same conclusion – a classic monolithic database is not the right solution.

Yes, the big sites like Myspace, Facebook, Flickr use MySQL but they rig them with complicated custom software to replicate terabytes of data across thousands of machines to all the datacenters in various parts of the world. On top of that, they put dedicated RAM cache servers in front of the MySQL to minimize pure database i/o.

Check out the good articles at http://highscalability.com.

first of all, let me clarify my OP. I am not here focused on what happens to hundreds of terabytes of data. Suppose we only have 50 gigs of data. We just want to spread them out into a few database servers so that the thing would run faster, i.e. scale decently during heavy usage.

If my assumption here is invalid, i.e. if spreading out a 50G database into multiple servers will not speed things up under any setup, please say so.

I do not quite understand Ruminator’s point about the “shared disk” setup.

Is the main limitation on such systems the problem of inability to do fine grained locking of rows being used on one server such that it would not interfere with query processing on other servers?

Or maybe my question here is too vague to be usable. How about this. Suppose we were working not with our real servers but rather with MagicServer ™ with some arbitrarily different set of parameters. E.g. maybe MagicServer has disk that is 20 times faster than any currently existing disk. Or much faster processor. Or whatever other major deviation from the current machine reality.

Well, so is there any such hypothetical parameter tweak that could get a distributed SQL database to run a lot faster on a cluster than the equivalent database on a single server? Or are “cluster” and “SQL” incapable of fruitfully mixing under any set of parameters?

The amount of data you have isn’t as relevant as the nature of the data and the load patterns. If you’re trying to serve 10000 qps, and each query requires only some small subset of the data, it can make a monumental difference to spread that data across servers, so each takes some of the load. Of course, if every query requires the same subset of the data, it won’t really help you to distribute the database (multiple replicas of the same db would, and caching would really be important).

As with most of your questions, this one is un-answerable because the design necessities of a database backend depend entirely on the specifics of the application in question.

But if you’re interested generally in scalable cloud solutions for relational databases, check out Amazon RDS which supports MySQL and Oracle.

This. I’m not quite sure what to make of the original question.

In my experience, caching is going to become important much sooner than clustering for performance. Clustering can help performance but it’s not magic, and it introduces a whole bunch of other complications (namely replication). You need to analyse the specifics, adjust things to suit the load patterns of your particular application, and generally know exactly what you’re doing (or have experts in that stuff covering your arse, as I do).

Ruminator, FYI the software used by WordPress.com for its clustering is freely available. It was complicated to get to that point, but the results can be replicated (if you’ll pardon the pun) without quite so much difficulty.

ok, that’s fine. So in essence no generalized distributed SQL database for all web apps is possible. So, perhaps, the next step would be to consider the possibility of a distributed SQL database framework for a particular class of web application. E.g. a framework for running wikis, or forums, or dating sites. Or is the difference even between two wikis, like between MediaWiki and TikiWiki, sufficiently large that the same clustered database optimization engine would not cover both of them in a “good enough” fashion?

Your question is worded in such a way as to assume one of two answers, and it’s not that simple.

For one, clustering is not optimization. I’d guess that the usual initial result of clustering is to make performance worse, not better. It gives you the means to scale, but scaleable and fast are not the same thing. This might be the answer to your original question.

Second, the software itself is only part of the equation. Traffic and usage patterns are important. Does your content change frequently or rarely? Are your requests distributed broadly across many different content pages, or concentrated on just a few? Things like that will determine the kinds of optimizations you need.

Third, your network topology matters. Are you in one DC or several? What sort of replication links do you have? Do you need to replicate everything?

Fourth, writing software for a distributed environment is quite a lot harder than writing typical web software. Your attention is on DB distribution, but you’re also working with distributed application code. It is possible to drop in a new DB layer to add clustering (see the HyperDB link I provided above), but that’s just step one. You need to understand what’s going on. Testing, debugging and managing the application just got a lot harder. (Think about cache invalidation in a distributed environment, for example).

Fifth, DB clustering is not the only thing you need to do at this scale. Caching is important, as Ruminator said. Multiple types of caching at different layers. You may also need to handle certain types of writes as special cases. Different types of data, and different access patterns, need different database optimizations - you may need to put certain tables in separate databases with different configurations to handle them efficiently.

Sixth, your sense of scale is off. There’s no such thing as “only” 50 gigs of data when you’re distributing it amongst data centers.

And so on. I’m no expert, this is just the stuff that comes to mind.

I suppose what I’m getting at is, this isn’t a question of whether or not you have the right framework. A generalized distributed SQL database for all web apps is possible - HyperDB is one example of such a framework, and it’s not that hard to get it up and running with a distributed database. But that’s just the first step. Having a distributed database doesn’t solve anything by itself; it’s necessary but not sufficient for scaling up.

got it. So basically the issue is really complex and so we should not expect any neat breakthroughs on the software level.

Well, so let’s just sit tight and wait until the database throughput problem gets solved in hardware, such as with faster hard drives.

I’d say this is such a vast oversimplification that I don’t even know how to reply to it. Oracle, for example, launched in the 1970s. There’s been over 30 years for the database throughput problem to be “solved in hardware.”

Scaling very large applications will probably always be very complex. The definition of “large” will change, as will the nature of the applications. But don’t hold your breath waiting for hardware improvements to make scaling at the limits (meaning the kinds of things Google, Twitter, Facebook and the new players who follow them will be doing).

Wait, who said database throughput was the problem?

do you think it isn’t? Is the main problem the complexity of load balancing between php/Ruby/etc scripts on different servers? Or what is the biggest issue in terms of scalability for a web app?

I don’t know what the main problem is, or even if there is just one big issue. I wouldn’t bet that’s it - raw disk I/O usually isn’t the problem, in my experience. (I assume that’s what you mean).