Why does this sql query take so long?

I came here to suggest this. Query plans are the first thing to look for when analysing slow-running queries. In Management Studio, before running your query, click on the button labelled “Include Actual Execution Plan”. Then, once your query has run, it’ll have two tabs in the results pane - one with the actual results, one with the execution plan. If you can paste a screenshot of your original query’s execution plan, we might be able to give you pointers on why it’s running slow and how to potentially speed it up.

I would also recommend that doing a SELECT * in any situation where performance is vital is a bad idea, particularly on subqueries. SQL Server has built-in optimisation algorithms that work much better when given a list of columns to select. Also, that makes your original query more readable and easier to maintain for someone else.

Too late for the edit window: this page has a good overview of how using SELECT * can adversely affect query performance:

Also: sql - Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc - Stack Overflow

I’m not saying that in this particular case the query plan will necessarily be different (since there’s only one one column returned in the view anyway) but in general, having specific column names is strongly advised.

I don’t think the problem is the “select " in the subquery. There is only one column in distinctaccountsbet (it wouldn’t be allowed in the subquery if there were more). The "” will be changed into that single column early in the query compilation process.

The performance problem is more likely to be in the “distinct” in the definition of the distinctaccountsbet view. The “not in” subquery amounts to a kind of join where, for each row in the outer query it checks whether there is a match in the inner query. The “distinct” in the view definition doesn’t have any effect on the result, and a smart query optimizer should recognize this and ignore or eliminate the “distinct” as irrelevant. It’s possible, though, that the query plan does a sort to eliminate duplicates every time it runs the subquery. This could take a lot of time, depending on the data.

You can test my theory by rewriting the definition of distinctaccountsbet without the “distinct” and running the query. If it runs quickly, you’ll know where the bottleneck is.

If your query takes too long to wait for this, you can also ask SSMS to provide an estimated execution plan, I beeive it is CTRL-L, so you can see what method it is using. The trick, of course, is being able to recongize inefficient plans and knowing other ways of getting the same results.

In SQL Server IN/NOT IN re-run the sub-query for every row in the driving table (instead of just resolving it 1 time, I’m pretty sure of this, but there is the possibility that it is situation dependent), this is not the same as a join. In addition, because it’s NOT IN, it needs to scan the entire subquery to verify it’s NOT IN (unless there is an index on that field).

I am almost certain NOT IN is the problem in this case.

If I understand correctly, if you use NOT EXISTS, the subquery stops processing the moment it finds a single exception in its result set, whereas with NOT IN, the subquery must return all results before returning to the main query. This also contributes to the slowness.

This is not true. NOT IN and NOT EXISTS do the same thing, and a well-written DBMS will use the same processing methods. There is no reason to continue processing a NOT IN subquery once a match has been found.

As for the comparison I made to joins: in terms of processing techniques, a NOT IN subquery really is like a join. The outer query can be thought of as a table, and the subquery as another table in which the DBMS looks for qualifying rows. In fact, some advanced database systems convert almost all subqueries to joins in their internal processing. NOT IN and NOT EXISTS subqueries can be converted to outer joins, like so:



select c1
from o
where c2 not in
    (select c3
     from i)


becomes (please excuse any syntax problems on my part):



select c1
from o left outer join i on o.c2 = i.c3
where c3 is not null


BTW, I worked on query processing for years for Relational Technology (makers of Ingres), Sybase, Cloudscape and Informix.

Checking in to my own thread :smiley:
My requirements were a one-off query. I didn’t mind it taking 0-5 minutes but I was concerned that it had got to 30+ minutes and still hadn’t finished (normally I’m pretty good at complex one-off queries). But my previous post indicates that I acheived my goal, and specifically how I did so.

To use an oft-used phrase in non-fiction books I’ve been reading, the continued replies go “beyond the scope of” my purposes.

But it’s interesting reading. Nothing against it! I just wanted to point out that I don’t need help any more :slight_smile:

Thanks again for the help I did get. And for the interesting further replies.

Yes, conceptually they should be the same, but I have had examples in which the SQL Server optimizer created a different execution plan depending on whether I used join or in, the performance differences were dramatic. I’ll try to find an example.

Here’s a tip on this subject, it’s that nested table scan that’s the problem:

http://www.codeproject.com/KB/database/SQLServerTipsGuidelines.aspx

"If you currently have a query that uses NOT IN, which offers poor performance because the SQL Server optimizer has to use a nested table scan to perform this activity, instead try to use one of the following options, all of which offer better performance:

* Use EXISTS or NOT EXISTS
* Use IN
* Perform a LEFT OUTER JOIN and check for a NULL condition"

Heh.

Not sure what your ‘Heh’ means.

What I mean is that usually when my boss asks for an awkward set of data that is highly specific I can construct a query that gets the correct data fairly quickly.

I realized about an hour after I posted this that the outer join version of the query should say IS NULL, like this:

I had a big long response to this typed up, but it was too involved here. Instead I’ll just say that I consider myself an experienced beginner after 9 years as a DBA, and I can and have commented on/offerend solutions to many of the requests for help you have had. None of them have been that complex.

Well, a well-written DBMS, sure. But this is a Microsoft product! :wink:

Actually, I like SQL Server. But this is a fairly well-known bug … er, I mean quirk, in the optimizer.

Oh, BTW, here’s Joe Celko’s explanation–this is from SQL For Smarties, Second Edition, pg. 197:

Take that for what it’s worth. :wink:

I think I’ve figured out why NOT IN can be slow when using Microsoft SQL Server. This product is based on the original Sybase DBMS (they even had the same name). Sybase and Microsoft had a joint marketing deal, but Microsoft eventually found a reason to cancel the contract and take the existing code as their own product.

Sybase’s original subquery processing strategies were, um, unique. They had a lot of performance problems and semantic problems (i.e. queries with subqueries would sometimes give the wrong results). We patched the existing system for many years. One problem I remember had to do with the handling of NULLs with NOT IN subqueries (the question is whether NULL should be considered NOT IN an empty set). We had to put in some special logic to deal with this, which caused performance problems. Eventually we (i.e. Sybase) re-wrote the entire subquery system to make it both perform better and always give the right results. Microsoft didn’t benefit from this re-write because they had already split off their own product. Apparently, they still haven’t fixed it.

BTW, Joe Celko’s explanation makes no sense to me, and I’ve worked on the internals of three different database systems.

Not a DBA but have worked with SQL in both Oracle and MS SQL environments for several years… and when writing queries DBA’s have always steered us away from using NOT IN and using NOT EXISTS instead for efficiency.

NOT IN with a subquery will run the subquery every time for each row of the main query and require table scans, and this can be quite expensive in terms of processing.

NOT EXISTS with the same subquery with run the subquery once and use whatever indexes are available.

At least that’s how it was explained to me, but whatever the reason NOT EXISTS has always run faster for me.

Indexing also makes a huge difference – make sure to have indexes at least on columns you are joining from.

I found it somewhat puzzling as well–hence the FWIW … I know the OP’s problem is long resolved, but I’m still almost interested enough to build some test queries & look at the execution plans on my friendly neighborhood SQL Server. If I can remember how to view the execution plans, that is. (Crud! I do this almost every day on Teradata–but this problem doesn’t exist on Teradata. :smack:)

I thought I read that the NOT IN subquery must be run each time to determine whether the results are null, which is a valid result for NOT IN (resulting in a TRUE for all values because IN returns a FALSE and the not changes that to TRUE).

Either way, I don’t see why the optimizer can’t be smart enough to use an index for that check.