Which SQL statement is more efficient?

Enright3 · October 12, 2004, 4:48pm

So, that’s the question.

If I’m dealing with a table of millions of rows, which one of the two select statements would be the better one to run? Please discuss.

select sum(ColumnA + ColumnB)
from enrightsTable

select (Sum (ColumnA) + Sum (ColumnB))
from enrightsTable

My initial reaction was that the first choice is more efficient, as it only has to go through the table one time, since (I assume) it adds the values of the columns up for each row before it goes to the next one.

FYI This is Transact SQL.

Thanks!

E3

ticker · October 12, 2004, 4:57pm

In any sensible implementation, both queries will result in just a single pass through the entire table. Any DBMS which insists on two passes - (one for each column) would probably not make it out of the freshman IT lab it was dreamed up in.

Lord_Ashtar · October 12, 2004, 5:00pm

What ticker said. Either one should work just fine. I’d probably go with the first one only because it’s a little bit shorter.

chrisk · October 12, 2004, 5:08pm

Yeah, it’s six of one or half a dozen of the other - even if you start counting addition operations… it’s pretty much 2 * (number of records - 1) either way.

ultrafilter · October 12, 2004, 5:47pm

Given that both are the same performance, which one will make sense to you six months from now? If I saw the second with no commenting, I’d wonder what the hell the DBA was smokin’ when he wrote that.

Walton_Firm · October 12, 2004, 6:46pm

Database optimization is not my specialty, but isn’t there a chance that the basic aggregation operators are cached per column, so that a full table scan is not needed at all? In that case, sum(A) + sum(B) would be the fastest, unless the DB engine is smart enough to see the equivalency.

I know that most real-world DB’s store quite a lot of statistics per column in order to aid the optimizer. Storing the results for the basic aggregation operators would seem like a logical thing to do, but I don’t know if it actually happens. Can’t you look at the query plan to find out?

Walton_Firm · October 12, 2004, 7:00pm

OK, I just tried it on an MSSQL2K database (the only one I have access to on short notice, except for MySQL and PostgreSQL but I don’t know an easy way to view the execution plan for a query there, and I’m a little pressed for time – any takers?) and it does indeed do a full index scan. And no, it does not go through the table twice for the sum(A + B) query.

So on MSSQL at least, the answer is: no, it does not matter, so pick the most readable one.

JCorre · October 13, 2004, 12:42am

Ok here are the results from my DB. The table used was well over a million rows.

17:32:36 SQL> select sum(variablea)+sum(variableb) from JCorre;

SUM(VARIABLEA)+SUM(VARIABLEB)

                           2.8749E+10

17:33:34 SQL>

Little less than a minute.

17:35:01 SQL> select sum(variablea+variableb) from JCorre;

SUM(VARIABLEA+VARIABLEB)

                      2.8749E+10

17:35:26 SQL>

25 seconds
17:36:55 SQL> select (sum(variablea)+sum(variableb)) from JCorre;

(SUM(VARIABLEA)+SUM(VARIABLEB))

                             2.8749E+10

17:37:16 SQL>

21 seconds

So, ultrafilter, at least on my DB the smokin DBA actually wrote the fastest query.

Its funny. I would have guessed the first one would have been fastest and that schema is how I write most of my code.

ultrafilter · October 13, 2004, 12:44am

Hey, as long there’s documentation, it’s all good.

Jeff_Lichtman · October 13, 2004, 7:04am

I would expect the two forms of the query to have the same performance. Both involve a table or index scan and a number of additions. The number of rows scanned and the number of additions will be the same regardless of which form of query you use.

Optimizer statistics keep track of things like row counts and data distributions, not column sums. Also, optimizer stats are often inaccurate - keeping an absolutely accurate row count at all times (for example) would tend to create a synchronization bottleneck in the system. Most of the time an optimizer can come up with a good query plan with statistics that are somewhat out of date. So most optimizers don’t store accurate sums or other aggregate values.

There are database systems called data warehouses that are used for statistical analysis of large masses of data. The data in a warehouse tends to be static - it is only updated when a new mass of data is loaded in, which may happen once a day (or even less frequently). The DBA of a data warehouse can specify that certain aggregates (counts, sums and averages) be pre-calculated when the data are loaded. This greatly improves the performance of queries that can use the pre-calculated aggregates. I assume, however, that the OP isn’t using a data warehouse.

If your DBMS supports triggers it should be possible to cache your own sums for any columns you like. Just make the trigger add any inserted value to the sum and subtract any deleted value. You should be aware, though, that the table where you store the cached sum can become a concurrency bottleneck. Every time you change the sum you will have to get a lock on the row (or table) holding the sum, which will block any other transaction that tries to update the sum at the same time.

I can imagine that one form of the query might cause an overflow while the other would not. For example, suppose ColumnA contains mostly large positive numbers, while ColumnB contains mostly large negative numbers (large in absolute value, that is). The form sum(ColumnA) + sum(ColumnB) might cause an overflow, while sum(ColumnA + ColumnB) might not.

Mangetout · October 13, 2004, 7:34am

JCorre:

Ok here are the results from my DB. The table used was well over a million rows.

17:32:36 SQL> select sum(variablea)+sum(variableb) from JCorre;

SUM(VARIABLEA)+SUM(VARIABLEB)
                           2.8749E+10
17:33:34 SQL>

Little less than a minute.

17:35:01 SQL> select sum(variablea+variableb) from JCorre;

SUM(VARIABLEA+VARIABLEB)
                      2.8749E+10
17:35:26 SQL>

25 seconds
17:36:55 SQL> select (sum(variablea)+sum(variableb)) from JCorre;

(SUM(VARIABLEA)+SUM(VARIABLEB))
                             2.8749E+10
17:37:16 SQL>

21 seconds

So, ultrafilter, at least on my DB the smokin DBA actually wrote the fastest query.

Its funny. I would have guessed the first one would have been fastest and that schema is how I write most of my code.

I’d be interested to know whether the results look exactly the same if these three queries were executed in the reverse order to the above; I’m pretty sure it won’t make a difference, but there’s just a slim chance (and I mean slim) that the apparently better performance of the later queries was in some way prejudiced by the execution of the earlier ones - if, for example, some index, cached object or something didn’t require rebuilding.

Jeff_Lichtman · October 13, 2004, 8:36am

It’s likely that the first query caused the data to be faulted into the page cache, so the remaining two executions were against in-memory data. When measuring query execution time it’s always a good idea to run each query until the execution time stabilizes.

JCorre · October 13, 2004, 4:18pm

When I get a chance I’ll run the queries against a different (but same size) table on a different database and post the results later today…

Topic		Replies	Views
SQL experts: Is count(*) or count(1) better? Factual Questions	23	4576	February 14, 2008
Is this legitimate use of SQL? Factual Questions	16	1402	May 8, 2003
SQL experts please help with a monster query. Factual Questions	21	1365	April 20, 2007
Can anyone reccomend a fast-response SQL forum. And can anyone help in the meantime? Factual Questions	3	684	March 2, 2008
SQL Query Concatenation Factual Questions	14	977	December 29, 2005

Which SQL statement is more efficient?

SUM(VARIABLEA)+SUM(VARIABLEB)

SUM(VARIABLEA+VARIABLEB)

(SUM(VARIABLEA)+SUM(VARIABLEB))

Related topics