MySQL best practice: How to combine multiple queries into one to save on network overhead?

Right now, there’s an old table (rendered in PHP) like this:

Product 1: $20
Product 2: $14
Product 3: $15

Product 134: $194

Each product uses a getPrice() function over a new MySQL connection over the internet to a remote database. But with 134 products, it makes 134 requests one after another. If I could somehow combine all of those calls into one, it’d only be a few kB of data sent over the network in a second.

However, what if I can’t rewrite the getPrice() function? Is there any sort of client-caching layer or proxy that can aggregate all those queries into one and make only one network request?

I’ve looked into memcache, MySQL’s query_cache, and several other ones, but they seem to only cache serverside queries into memory, but doesn’t do anything for network overhead if the client table makes all those separate requests to begin with.

Any ideas? =/

Let’s start with why you can’t rewrite getPrice(). There may be a solution that’s far less stupid than trying to do this client-side.

I’m an Oracle DBA so what I’m suggesting might not be do-able in MySQL.

  1. Create an ODBC link to the table and get to the data that way.

  2. Create a view in MySQL and access the data from the view.

  3. If the data is pretty static, create a temp table locally and update that table once a day or so.

Memcached is only “server side” where Memcached sits. There is no reason why it can’t be on the same machine that the client sits on.

This is a good argument to do stuff in house as much as much as you can. Doesn’t always work though I know. If you can’t change the server getPrice() function, I think your stuck. I was going to suggest it simply return XML and have the client process it.

How exactly is GetPrice used? Do you call it once and it returns the data for all 134 records after making 134 calls? If so, I would suggest replacing it with a select *.

If, however, you fetch the price only when the parent record is referenced, I’d get rid of GetPrice and use a join to the price table.

What do you mean by “can’t rewrite”? Can’t rewrite the SQL end, or can’t rewrite the client end? What do you mean by “aggregate into one” when you “can’t rewrite”?

That would be my suggestion as well- unless that price is changing often, that data is likely to be static enough to query and drop into a local table once, and then query that for the rest of the day/session/transaction or whatever.

Sorry for not being clearer about this! I mis-spoke when I said getPrice() couldn’t be modified – it CAN, but it has to keep working wherever it’s currently used (some 10,000+ pages).

What it does right now is make a MySQL connection every time it’s called, to return the price of one item at a time. Sometimes this function is used for a single product on a product page (An apple is $4), sometimes it’s used dozens of times in one page to pull the prices for a lot of products in a category page (All fruits: apples $4, cherries $2, etc.).

It works fine when it just has to make that one call on the Apples page, but when it’s used 50 times in a row on the All Fruits page, with a separate network connection for each one, it’s very slow.

Is there a way to rewrite getPrice() so that it’ll make one call per page, whether it’s getting 1 price or 1000?

I thought about:
[ol]
[li]Changing the category pages to use a different function altogether that pulls MySQL data in bulk, while leaving individual product pages to still use getPrice().[/li]
This would still require rewriting dozens of category tables.
[li]Modifying the getPrice() function to send all its calls to some sort of aggregator function that collects and combines individual getPrice() requests, makes the actual network query (in a batch) no more than once per second, and then returns the prices to each individual getPrice() call.[/li]
I don’t know if this would work even in theory, though – would the first getPrice() call just stall the entire script while it’s waiting for the aggregator function?
[/ol]


Not sure I follow, sorry. So Memcache puts queries into memory so they don’t have to be run through the actual DB again, right?

So are you saying in a situation like this:
[web server][memcached] <----> [database server]

The web server would a call to the DB server the first time, but subsequently the same queries can be served directly from memcached and not have to go over the internet again?

Isn’t ODBC a translation layer for databases and applications, kinda abstracting away the protocols and such?

If so, how does it help with multiple queries? Would it be smart enough to notice 50 different getPrice queries, combine them into one actual query, and fetch it all at once?

That’s our fallback solution. We’re looking into MySQL master-slave replication, or maybe just rsyncing the database to the client. Is there a better way to do this (creating live backups of the database while still keeping it online)?

One possible approach to this:

The first time getPrice is called on a page execution, it loads all the data necessary for calculating prices on all products and caches it somewhere in page memory–in an array perhaps? Then, subsequent calls check to see if that page array has been loaded, and call that instead of loading mysql data again.

Overall, I think you may be better off developing a second function. That’d be my approach.

It looks like the data needed for ALL products could be downloaded in a 40kB CSV. Would it be bad practice to just load that into an array? And then the CSV could be delta synced to the live DB once a minute or so?

Is there any difference between this and a “real” MySQL replication?

I’m not familiar with Memcached (looking into it more), but now I’m wondering if that’s exactly what it does. I’ll look more into it before reinventing that functionality, if so…

Memcached or another key-value store is certainly capable of solving your problem. I’d still try to create a function or set of functions that can return information for multiple products at once but I’m also not always the one deciding what the best use of my time is.

I have to say that I’m very curious about a site that has more than ten thousand files to give pricing information for a catalog that could fit in a 40 kilobyte file. That does not sound ideal.

After some research, we are going to try implementing this as a Redis data cache – seemed simpler than Memcached, if only for the hashes instead of key-value pairs. For now MySQL will remain the canonical database, with Redis periodically pulling changes from it and then replicating it to local Redis stores at each server (local development machines, test server, etc.). If that works well, at some point we might just store all the data in Redis and skip MySQL altogether.

I’m already doing this for newly-added category pages and tables, but I don’t really want to go through hundreds of existing pages to fix every place where getPrice() is potentially used more than once. At least having it pull from a local Redis memory cache would be a lot faster than going over the Internet for every single getPrice().

We are in the process of simultaneously developing a modern website; these are just band-aid fixes or the existing site before the new one is ready.

Heh. “Curious” is a… polite… word for it. It’s an older website where most of the data is stored in flat HTML files. Whoever (or whichever lineage of unlucky heirs) coded it decided to manually make a new page for every product instead of using a database.

So now, before we can make it 1 dynamic page instead of 10000 static ones, we first have to migrate all the data (product images, descriptions, related downloads, etc.) to an actual database. It’s definitely not ideal :slight_smile: Right now we can only count on price being accurate in the database. Some items don’t even have their names filled out yet.

Yes, but that’s still server side isn’t it? I thougt you were looking for something more like this:

…as discussed here Cliend data management and caching or here

I initially wanted to do client-side caching, but I was led to believe Google would never see it if everything was JavaScript-injected instead of PHP rendered serverside – so scratch that idea.

However, by replacing local SQL queries with local Redis calls (no network traffic in either case), we saw a 5x speed increase. Combined with having a local Redis cache per web server, it will be much faster than querying MySQL over the Internet. It might be even faster if we use the phpredis library (written in C) instead of Predis, but that’s another test for later.

Thank you all for the help, especially amanset who made me realize that Memcached can actually live on other machines, not just the MySQL server. I mistook it for a origin-side HTTP cache. We ultimately chose Redis because it seemed more modern and had cool features we might one day use, but Memcached would’ve worked fine for this basic scenario too.

Sorry, I’ve had a weird few days and so dropped off, well, the internet.

That’s pretty much how it works, yes. You can go more advanced as well, with a separate Memcached server which serves stuff globally, over the network, and then each web server has its own local memcached with a much lower TTL on keys. That only really works with static data (like here) but is pretty handy for a distributed/load balanced environment.

Check if in local MC, if so serve that data.
If not check in remote MC, if so serve that data and update local MC.
If not, get from database, serve the data, update remote MC and then update local MC.

Thanks for checking back in.

In these cases, how is the cache typically kept up to date?

So far, the script I’m working on would query MySQL every X minutes for updates (by timestamp) and update the cache. Everything in the cache lives forever (no expiration) and only gets changed by this recurring delta sync. Then this master cache would get replicated to slaves via some arcane black magic built into Redis that I’ll have to learn more about on Monday.

Is that ok, or is there a better/more efficient way to do this?

With fairly static data you could really just rely on the TTL (best practices would dictate always having one just in case, you don’t want old data hanging around forever).

For data that changes fairly regularly it is a case of whatever script is updating the database deletes from Memcache right after the database update is done. PHP, for example, has a class that makes this fairly trivial:

This is a bit better than polling the database as otherwise you can have up to X minutes where your cache is out of date.