For the NOT IN, if the subquery returns a NULL, then it will never return a match. Kind of odd, but that’s the way it works. Has something to do with the way NULL’s are matched. Hopefully someone can explain it better.
To explain further: SQL interprets NULL to mean “value unknown.” If a set contains a NULL, we can’t say whether a given value is NOT IN that set, because the value might match the unknown NULL. So the answer to whether a value is NOT IN a set containing a NULL is neither TRUE nor FALSE - it is unknown. A query will return rows only in those cases in which the WHERE clause evaluates to TRUE - it will not return rows for which the WHERE clause evaluates to “unknown.”
NOT EXISTS solves the problem because it asks a slightly different question than NOT IN. NOT EXISTS evaluates to TRUE when the subquery returns an empty set. In the OP’s query, a NULL in Clients.SuburbID will not result in a match in the subquery’s WHERE clause (again, because the comparison will result in "unknown’), so it won’t cause the subquery to evaluate to TRUE.
BTW, the DISTINCT in the OP’s query is unnecessary. Duplicate values returned from a subquery under a NOT IN do not affect the result, which is a single TRUE, FALSE or unknown for each outer row. A good database engine will ignore the DISTINCT in this case. A not-so-good database engine will do a sort to eliminate duplicates.
I spent years as a software engineer on three different database engines: Ingres, Sybase and Cloudscape (now known as Derby). NULL values were always one of the greatest sources of confusion for users, support analysts, engineers and doc writers. The SQL standard is (mostly) logically consistent in its treatment of NULLs, but it still leads to surprises even for people who understand it.
There are two ways that come to mind to get the results you want.
SELECT DISTINCT Suburbs.SuburbID
FROM Suburbs LEFT JOIN clients
ON Suburbs.SuburbID = Clients.SuburbID
WHERE Clients.SuburbID IS NULL
-OR-
SELECT DISTINCT Suburbs.SuburbID
FROM Suburbs
WHERE NOT EXISTS (SELECT Clients.SuburbID
FROM Clients
WHERE Clients.SuburbID = Suburbs.SuburbID)
Personally I think the first choice is more elegant.
I’m curious about your logic on the not-so-good db. Consider a scenario in which the sub-query returns 999,999,999 rows with 1 value and 50 rows with some other value, and a main query that returns that 1 value and the 50 other values.
Without “distinct”, almost every row in main query will have to search that entire set (unless the db engine builds a temp index which is basically the same as sort with distinct), but with distinct the set to be checked will only be 51 rows.
So, from a performance perspective, what do you think a good db engine should do in this case? (I realize there are trade-offs and my scenario might gain from distinct and other scenarios might be hurt from it, just curious what your logic is).
The entire problem only exists if someone is so stupid as to create the database without a not null constraint on Clients.SuburbID regardless of the DBMS. Whoever did it deserves a kick in the balls.
And why would anyone reduce the size of the Suburbs table if it only contains 4,885 rows, even a 20 year old PC could hold this in memory. You realize that any screen you introduce to allow users to add suburbs will have to validate the suburb name, by looking at a table of suburbs, otherwise you will end up with 10 suburb names for the same suburb.
One of the systems I run reports on has 27 versions of one suburb thanks to misspellings, abbreviations etc.
This is a question of processing strategy, not semantics. The “distinct” in the subquery doesn’t change the meaning of the query. The query describes the desired result, and the database engine is supposed to figure out the best way to get that result. A query should not be seen as a set of instructions to the DBMS on how to get the data.
The query optimizer should evaluate the cost of running the query in different ways and choose the execution strategy that minimizes cost. In this case the DBMS could have statistics telling it that there is a large proportion of duplicates in the subquery column, in which case a good optimizer could decide on a
strategy that minimizes the cost of the duplicate values. For example, if there is an index on the column in question, it could use that rather than do a table scan. If there is no index, it could use a hash join or a sort-merge join.
Generally, performance tuning should be kept separate from query semantics as much as possible. Rather than put hints about a desired processing strategy into your queries, it’s better to use the database system’s performance tools. This means doing things like creating indexes and statistics where appropriate. In the example you gave, creating an index on the subquery’s column would greatly help performance even if there weren’t a huge number of duplicates - you really don’t want to force the system to do a scan of a million-row table.
The purpose allowing or disallowing NULLs in a column should not be to save space. If there are cases where you really don’t know the SuburbID of a Client, then you have to make the column nullable - otherwise there won’t be any way to represent the Client in the database system.
Yes, agreed - semantics are the same. When you said a good db would not perform that operation, it seems that what you were really saying is that a good db might or might not perform that operation depending on whether that was the most efficient way to process the query - correct?
I understand the point you are making, let the engine figure out the best way to do it. However, in the real world, there are no optimizers I’ve used (Oracle, DB2, SQL Server) that are good enough to just let the optimizer figure it out (simpler queries, smaller tables, sure, not the complex stuff), the structure of the sql plays a huge part in whether the optimizer can make use of it’s tricks or not.
You are correct, it did not work but I am still grateful to K364 for teaching me a new trick.
The table does actually have a constraint to disallow nulls. There are and were no null SuburbIDs. It is the primary key and as such does not allow nulls.
Thank you so much.
Yes I do realise that, and of course I will deal with it appropriately. The Client screen allows the selection of a suburb using a combo-box. 4,885 choices in a combo-box is way too many.
I downloaded the suburbs data from the Australia Post website. It already had mis-spelt duplicates as downloaded. It doesn’t now.
Thanks to everyone who offered advice and especially to don’t ask for reminding me why I rarely dare to ask a question.
Really. That’s no good. We downloaded it last week to update our lookup tables and I had assumed it would be clean. It’s amazing how much crap data you have to deal with. Oh well, something to put someone on to tomorrow.
One example where Nulls could be quite legitimate in a Suburb column is if you wanted to know the actual town they are in as opposed to a mailing address. I live in unincorporated Cook County, but my mailing address is in a suburb.
Yes. What I meant was that a good DBMS will ignore the DISTINCT keyword in this case and decide whether to perform a sort based on cost. It will not take the DISTINCT keyword as a command to perform a sort.
Yes, that’s true, for several reasons. The SQL language is powerful enough that there are often many ways of saying the same thing (for example, you can re-word an EXISTS subquery to use something like “0 < (SELECT COUNT(*) . . .)” There are often things you know about patterns in your own data that the DBMS doesn’t know to optimize for. Deeply-nested subqueries present problems to query optimizers that are hard to solve.
The OP’s example was pretty simple, though, and I would expect any decent DBMS to ignore the DISTINCT keyword and to make a cost-based decision as to whether to sort.
I would expect SuburbID to be the primary key of the Suburbs table, not the Clients table. The problem was caused by NULLs in Clients.SuburbID, which I would expect to be a foreign key.