What's wrong with my SQL

Imasquare · October 12, 2010, 1:59am

I’m doing something wrong but can’t figure out what:



SELECT Suburbs.SuburbID 
FROM Suburbs 
WHERE Suburbs.SuburbID NOT IN (SELECT DISTINCT Clients.SuburbID FROM Clients)

This should list all of the SuburbIDs from the Suburbs table that are not represented in the Clients table. But it doesn’t list anything.

I’m using SQL Server 2008…

K364 · October 12, 2010, 2:10am

NOT EXISTS will work for you.

For the NOT IN, if the subquery returns a NULL, then it will never return a match. Kind of odd, but that’s the way it works. Has something to do with the way NULL’s are matched. Hopefully someone can explain it better.

Suburban_Plankton · October 12, 2010, 2:12am

I thought I heard someone calling my name, but I guess maybe not…

SQL’s not my area of expertise (I’m a sysadmin), but I think K364’s advice is sound here.

Peter_Morris · October 12, 2010, 2:15am

Are you sure there ARE any such to show? If you have at least one client in each suburb, then it will return zero results

Run the following:
select count (distinct SuburbID ) from Suburbs

select count (distinct SuburbID ) from Clients
See if they give the same count of results.

Imasquare · October 12, 2010, 2:27am

Excellent advice - thank you.

Yes there are. 4,885 rows in the suburbs table of which only 92 are being used in the Clients table.

So I want to remove the unused rows from Suburbs and provide a screen for users to add new suburbs as required.

Jeff_Lichtman · October 12, 2010, 8:00am

To explain further: SQL interprets NULL to mean “value unknown.” If a set contains a NULL, we can’t say whether a given value is NOT IN that set, because the value might match the unknown NULL. So the answer to whether a value is NOT IN a set containing a NULL is neither TRUE nor FALSE - it is unknown. A query will return rows only in those cases in which the WHERE clause evaluates to TRUE - it will not return rows for which the WHERE clause evaluates to “unknown.”

NOT EXISTS solves the problem because it asks a slightly different question than NOT IN. NOT EXISTS evaluates to TRUE when the subquery returns an empty set. In the OP’s query, a NULL in Clients.SuburbID will not result in a match in the subquery’s WHERE clause (again, because the comparison will result in "unknown’), so it won’t cause the subquery to evaluate to TRUE.

BTW, the DISTINCT in the OP’s query is unnecessary. Duplicate values returned from a subquery under a NOT IN do not affect the result, which is a single TRUE, FALSE or unknown for each outer row. A good database engine will ignore the DISTINCT in this case. A not-so-good database engine will do a sort to eliminate duplicates.

I spent years as a software engineer on three different database engines: Ingres, Sybase and Cloudscape (now known as Derby). NULL values were always one of the greatest sources of confusion for users, support analysts, engineers and doc writers. The SQL standard is (mostly) logically consistent in its treatment of NULLs, but it still leads to surprises even for people who understand it.

Enright3 · October 12, 2010, 2:20pm

There are two ways that come to mind to get the results you want.



SELECT DISTINCT Suburbs.SuburbID 
FROM Suburbs LEFT JOIN clients
ON Suburbs.SuburbID = Clients.SuburbID
WHERE Clients.SuburbID IS NULL

-OR-

SELECT DISTINCT Suburbs.SuburbID 
FROM Suburbs 
WHERE NOT EXISTS (SELECT Clients.SuburbID 
                                 FROM Clients 
                                 WHERE Clients.SuburbID = Suburbs.SuburbID)

Personally I think the first choice is more elegant.

RaftPeople · October 12, 2010, 3:23pm

I’m curious about your logic on the not-so-good db. Consider a scenario in which the sub-query returns 999,999,999 rows with 1 value and 50 rows with some other value, and a main query that returns that 1 value and the 50 other values.

Without “distinct”, almost every row in main query will have to search that entire set (unless the db engine builds a temp index which is basically the same as sort with distinct), but with distinct the set to be checked will only be 51 rows.

So, from a performance perspective, what do you think a good db engine should do in this case? (I realize there are trade-offs and my scenario might gain from distinct and other scenarios might be hurt from it, just curious what your logic is).

don_t_ask · October 12, 2010, 3:59pm

The entire problem only exists if someone is so stupid as to create the database without a not null constraint on Clients.SuburbID regardless of the DBMS. Whoever did it deserves a kick in the balls.

And why would anyone reduce the size of the Suburbs table if it only contains 4,885 rows, even a 20 year old PC could hold this in memory. You realize that any screen you introduce to allow users to add suburbs will have to validate the suburb name, by looking at a table of suburbs, otherwise you will end up with 10 suburb names for the same suburb.

One of the systems I run reports on has 27 versions of one suburb thanks to misspellings, abbreviations etc.

Jeff_Lichtman · October 12, 2010, 5:46pm

This is a question of processing strategy, not semantics. The “distinct” in the subquery doesn’t change the meaning of the query. The query describes the desired result, and the database engine is supposed to figure out the best way to get that result. A query should not be seen as a set of instructions to the DBMS on how to get the data.

The query optimizer should evaluate the cost of running the query in different ways and choose the execution strategy that minimizes cost. In this case the DBMS could have statistics telling it that there is a large proportion of duplicates in the subquery column, in which case a good optimizer could decide on a
strategy that minimizes the cost of the duplicate values. For example, if there is an index on the column in question, it could use that rather than do a table scan. If there is no index, it could use a hash join or a sort-merge join.

Generally, performance tuning should be kept separate from query semantics as much as possible. Rather than put hints about a desired processing strategy into your queries, it’s better to use the database system’s performance tools. This means doing things like creating indexes and statistics where appropriate. In the example you gave, creating an index on the subquery’s column would greatly help performance even if there weren’t a huge number of duplicates - you really don’t want to force the system to do a scan of a million-row table.

Jeff_Lichtman · October 12, 2010, 5:49pm

The purpose allowing or disallowing NULLs in a column should not be to save space. If there are cases where you really don’t know the SuburbID of a Client, then you have to make the column nullable - otherwise there won’t be any way to represent the Client in the database system.

RaftPeople · October 12, 2010, 10:00pm

Yes, agreed - semantics are the same. When you said a good db would not perform that operation, it seems that what you were really saying is that a good db might or might not perform that operation depending on whether that was the most efficient way to process the query - correct?

The query describes the desired result, and the database engine is supposed to figure out the best way to get that result. A query should not be seen as a set of instructions to the DBMS on how to get the data.

The query optimizer should evaluate the cost of running the query in different ways and choose the execution strategy that minimizes cost. In this case the DBMS could have statistics telling it that there is a large proportion of duplicates in the subquery column, in which case a good optimizer could decide on a
strategy that minimizes the cost of the duplicate values. For example, if there is an index on the column in question, it could use that rather than do a table scan. If there is no index, it could use a hash join or a sort-merge join.

Generally, performance tuning should be kept separate from query semantics as much as possible. Rather than put hints about a desired processing strategy into your queries, it’s better to use the database system’s performance tools. This means doing things like creating indexes and statistics where appropriate. In the example you gave, creating an index on the subquery’s column would greatly help performance even if there weren’t a huge number of duplicates - you really don’t want to force the system to do a scan of a million-row table.

I understand the point you are making, let the engine figure out the best way to do it. However, in the real world, there are no optimizers I’ve used (Oracle, DB2, SQL Server) that are good enough to just let the optimizer figure it out (simpler queries, smaller tables, sure, not the complex stuff), the structure of the sql plays a huge part in whether the optimizer can make use of it’s tricks or not.

Imasquare · October 13, 2010, 10:38am

You are correct, it did not work but I am still grateful to K364 for teaching me a new trick.

The table does actually have a constraint to disallow nulls. There are and were no null SuburbIDs. It is the primary key and as such does not allow nulls.

Thank you so much.

Yes I do realise that, and of course I will deal with it appropriately. The Client screen allows the selection of a suburb using a combo-box. 4,885 choices in a combo-box is way too many.

I downloaded the suburbs data from the Australia Post website. It already had mis-spelt duplicates as downloaded. It doesn’t now.

Thanks to everyone who offered advice and especially to don’t ask for reminding me why I rarely dare to ask a question.

don_t_ask · October 13, 2010, 10:49am

Really. That’s no good. We downloaded it last week to update our lookup tables and I had assumed it would be clean. It’s amazing how much crap data you have to deal with. Oh well, something to put someone on to tomorrow.

lee · October 13, 2010, 12:08pm

One example where Nulls could be quite legitimate in a Suburb column is if you wanted to know the actual town they are in as opposed to a mailing address. I live in unincorporated Cook County, but my mailing address is in a suburb.

Jeff_Lichtman · October 13, 2010, 8:38pm

Yes. What I meant was that a good DBMS will ignore the DISTINCT keyword in this case and decide whether to perform a sort based on cost. It will not take the DISTINCT keyword as a command to perform a sort.

Yes, that’s true, for several reasons. The SQL language is powerful enough that there are often many ways of saying the same thing (for example, you can re-word an EXISTS subquery to use something like “0 < (SELECT COUNT(*) . . .)” There are often things you know about patterns in your own data that the DBMS doesn’t know to optimize for. Deeply-nested subqueries present problems to query optimizers that are hard to solve.

The OP’s example was pretty simple, though, and I would expect any decent DBMS to ignore the DISTINCT keyword and to make a cost-based decision as to whether to sort.

Jeff_Lichtman · October 13, 2010, 8:41pm

[quote=“Imasquare, post:13, topic:556851”]

I would expect SuburbID to be the primary key of the Suburbs table, not the Clients table. The problem was caused by NULLs in Clients.SuburbID, which I would expect to be a foreign key.

Am I wrong?

Nars_Glinley · October 13, 2010, 11:01pm

In Oracle, and I assume SQLServer, you could use a MINUS.

Select suburbs.suburid from suburbs minus select clients.suburbid from clients.

The only way to find out the best way is to do an explain plan.

Cerowyn · October 14, 2010, 3:03am

MINUS is the Oracle version of the SQL-92 standard EXCEPT keyword.

Topic		Replies	Views
Why does SQL assume you don't want to indlude NULLs in queries? Factual Questions	46	13056	May 29, 2010
is there a tool that tells you "why" a particular tuple is not in SQL query result set? Factual Questions	8	1134	April 14, 2011
Help with a SQL query? Factual Questions	3	728	July 19, 2007
Why does this sql query take so long? Factual Questions	45	7738	October 13, 2009
Help Me Please!! SQL Query, Desperatly. Factual Questions	4	820	June 17, 2001

What's wrong with my SQL

Related topics