If I’m dealing with a table of millions of rows, which one of the two select statements would be the better one to run? Please discuss.
select sum(ColumnA + ColumnB)
from enrightsTable
select (Sum (ColumnA) + Sum (ColumnB))
from enrightsTable
My initial reaction was that the first choice is more efficient, as it only has to go through the table one time, since (I assume) it adds the values of the columns up for each row before it goes to the next one.
In any sensible implementation, both queries will result in just a single pass through the entire table. Any DBMS which insists on two passes - (one for each column) would probably not make it out of the freshman IT lab it was dreamed up in.
Yeah, it’s six of one or half a dozen of the other - even if you start counting addition operations… it’s pretty much 2 * (number of records - 1) either way.
Given that both are the same performance, which one will make sense to you six months from now? If I saw the second with no commenting, I’d wonder what the hell the DBA was smokin’ when he wrote that.
Database optimization is not my specialty, but isn’t there a chance that the basic aggregation operators are cached per column, so that a full table scan is not needed at all? In that case, sum(A) + sum(B) would be the fastest, unless the DB engine is smart enough to see the equivalency.
I know that most real-world DB’s store quite a lot of statistics per column in order to aid the optimizer. Storing the results for the basic aggregation operators would seem like a logical thing to do, but I don’t know if it actually happens. Can’t you look at the query plan to find out?
OK, I just tried it on an MSSQL2K database (the only one I have access to on short notice, except for MySQL and PostgreSQL but I don’t know an easy way to view the execution plan for a query there, and I’m a little pressed for time – any takers?) and it does indeed do a full index scan. And no, it does not go through the table twice for the sum(A + B) query.
So on MSSQL at least, the answer is: no, it does not matter, so pick the most readable one.
I would expect the two forms of the query to have the same performance. Both involve a table or index scan and a number of additions. The number of rows scanned and the number of additions will be the same regardless of which form of query you use.
Optimizer statistics keep track of things like row counts and data distributions, not column sums. Also, optimizer stats are often inaccurate - keeping an absolutely accurate row count at all times (for example) would tend to create a synchronization bottleneck in the system. Most of the time an optimizer can come up with a good query plan with statistics that are somewhat out of date. So most optimizers don’t store accurate sums or other aggregate values.
There are database systems called data warehouses that are used for statistical analysis of large masses of data. The data in a warehouse tends to be static - it is only updated when a new mass of data is loaded in, which may happen once a day (or even less frequently). The DBA of a data warehouse can specify that certain aggregates (counts, sums and averages) be pre-calculated when the data are loaded. This greatly improves the performance of queries that can use the pre-calculated aggregates. I assume, however, that the OP isn’t using a data warehouse.
If your DBMS supports triggers it should be possible to cache your own sums for any columns you like. Just make the trigger add any inserted value to the sum and subtract any deleted value. You should be aware, though, that the table where you store the cached sum can become a concurrency bottleneck. Every time you change the sum you will have to get a lock on the row (or table) holding the sum, which will block any other transaction that tries to update the sum at the same time.
I can imagine that one form of the query might cause an overflow while the other would not. For example, suppose ColumnA contains mostly large positive numbers, while ColumnB contains mostly large negative numbers (large in absolute value, that is). The form sum(ColumnA) + sum(ColumnB) might cause an overflow, while sum(ColumnA + ColumnB) might not.
I’d be interested to know whether the results look exactly the same if these three queries were executed in the reverse order to the above; I’m pretty sure it won’t make a difference, but there’s just a slim chance (and I mean slim) that the apparently better performance of the later queries was in some way prejudiced by the execution of the earlier ones - if, for example, some index, cached object or something didn’t require rebuilding.
It’s likely that the first query caused the data to be faulted into the page cache, so the remaining two executions were against in-memory data. When measuring query execution time it’s always a good idea to run each query until the execution time stabilizes.