SQL query: MAX(CASE WHEN...

Hi I was wondering how the following SQL works:


SELECT
	Users.id as id,
	Users.username as username,
	MAX(CASE WHEN Users_meta.meta = 'first_name' THEN Users_meta.value ELSE NULL END) AS first_name,
	MAX(CASE WHEN Users_meta.meta = 'last_name' THEN Users_meta.value ELSE NULL END) AS last_name,
	MAX(CASE WHEN Users_meta.meta = 'user_level' THEN Users_meta.value ELSE NULL END) AS user_level
FROM
	Users
LEFT JOIN Users_meta
	ON Users_meta.user = Users.id

GROUP BY
	Users.id
HAVING
	first_name LIKE 'tes%'

The Users table includes the columns id and username. The Users_meta table includes the columns user, meta and value. Metas include “first_name”, “last_name” and “user_level”.

The actual code includes more MAX(CASE WHEN lines and the HAVING statement changes depending on the search query.

What I don’t understand is what MAX is doing. I thought MAX just found the highest value out of a set of numbers…

BTW if I leave the MAX out of a line it only returns one result rather than all of the matches.

I’m used to there just being lots of LEFT JOINs using Users_meta and aliases.

This code is somehow using 3 different sets of Users_meta which have a matching “user” column…

BTW I have 4000 Users and 60,000 Users_meta but with XAMPP it is taking 5-7 seconds…

So if someone could explain what MAX is doing and how it seems to be joining 3 rows of the Users_meta table per result that would be good.

It looks like max can be used for strings too…

e.g.
SELECT max(username) FROM Users

returns zeljko.dzambo

The query uses non-standard SQL, in that the username column in the SELECT list is not also in the GROUP BY list. Standard SQL requires that in queries that use aggregate functions (MAX, MIN, SUM, AVG), all columns in the SELECT list must either be inside an aggregate function or must also appear in the GROUP BY list.

When you use the MAX function with a GROUP BY list, it means you want to find the maximum value within each group. Suppose, for example, you have an employees table with three columns: employee_id, department_id, salary. Let’s say it contains this data:



employee_id     department_id    salary
-------------------------------------------------
1                       1                         10000
2                       1                         15000
3                       1                         12000
4                       2                         11000
5                       2                         17000


To find the maximum salary for each department, you would do this:



SELECT department_id, MAX(salary) as max_salary
FROM employees
GROUP BY department_id


This should give you:



department_id    max_salary
------------------------------------
1                         15000
2                         17000


Now suppose you want to know the ids of the employees with the highest salary in each department. You could try this:



SELECT employee_id, department_id, MAX(salary) as max_salary
FROM employees
GROUP BY department_id


You might expect to see:



employee_id    department_id    max_salary
-------------------------------------------------------
2                      1                         15000
5                      2                         17000


Standard SQL doesn’t allow this query because the employee_id in the SELECT list is neither within an aggregate function nor in the GROUP BY list. The problem is that there’s no obvious way to figure out which employee_id to associate with each result row. It’s obvious to you because you know what you want, and you understand the semantics of the data.

Some SQL engines allow this query. I know that Sybase does (I used to work there), and I think Microsoft SQL Server does also (it was originally based on Sybase’s DBMS). When I worked at Sybase, it would first create the result without the “extra” column, then join it back to the original base table(s) on the grouping column(s). In this case, it would join the result of the first query I showed you to the employees table on the department_id column, giving the following result:



employee_id    department_id    max_salary
-------------------------------------------------------
1                      1                         15000
2                      1                         15000
3                      1                         15000
4                      2                         17000
5                      2                         17000


I’ve never been able to figure out a use for these semantics. I believe it was a mistake for Sybase to allow this type of query and to interpret it this way, but once a feature is in a released product it’s hard to get rid of it.

You could get the results you want by rewriting the query like this:



SELECT e.employee_id, e.department_id, m.max_salary
FROM employees

    (SELECT department_id as m_department_id, MAX(salary) as max_salary
     FROM employees
     GROUP BY department_id) as m
INNER JOIN


I accidentally submitted my previous posting before I was done with it, and wasn’t able to fix it within the five-minute edit window. Please disregard that one and pay attention to this one:

The query uses non-standard SQL, in that the username column in the SELECT list is not also in the GROUP BY list. Standard SQL requires that in queries that use aggregate functions (MAX, MIN, SUM, AVG), all columns in the SELECT list must either be inside an aggregate function or must also appear in the GROUP BY list.

When you use the MAX function with a GROUP BY list, it means you want to find the maximum value within each group. Suppose, for example, you have an employees table with three columns: employee_id, department_id, salary. Let’s say it contains this data:



employee_id     department_id    salary
-------------------------------------------------
1                       1                         10000
2                       1                         15000
3                       1                         12000
4                       2                         11000
5                       2                         17000


To find the maximum salary for each department, you would do this:



SELECT department_id, MAX(salary) as max_salary
FROM employees
GROUP BY department_id


This should give you:



department_id    max_salary
------------------------------------
1                         15000
2                         17000


Now suppose you want to know the ids of the employees with the highest salary in each department. You could try this:



SELECT employee_id, department_id, MAX(salary) as max_salary
FROM employees
GROUP BY department_id


You might expect to see:



employee_id    department_id    max_salary
-------------------------------------------------------
2                      1                         15000
5                      2                         17000


Standard SQL doesn’t allow this query because the employee_id in the SELECT list is neither within an aggregate function nor in the GROUP BY list. The problem is that there’s no obvious way to figure out which employee_id to associate with each result row. It’s obvious to you because you know what you want, and you understand the semantics of the data.

Some SQL engines allow this query. I know that Sybase does (I used to work there), and I think Microsoft SQL Server does also (it was originally based on Sybase’s DBMS). When I worked at Sybase, it would first create the result without the “extra” column, then join it back to the original base table(s) on the grouping column(s). In this case, it would join the result of the first query I showed you to the employees table on the department_id column, giving the following result:



employee_id    department_id    max_salary
-------------------------------------------------------
1                      1                         15000
2                      1                         15000
3                      1                         15000
4                      2                         17000
5                      2                         17000


I’ve never been able to figure out a use for these semantics. I believe it was a mistake for Sybase to allow this type of query and to interpret it this way, but once a feature is in a released product it’s hard to get rid of it.

You could get the results you want by rewriting the query like this:



SELECT e.employee_id, e.department_id, m.max_salary
FROM employees as e,
    (SELECT department_id as m_department_id, MAX(salary) as max_salary
     FROM employees
     GROUP BY department_id) as m
INNER JOIN on e.department_id = m.department_id AND e.salary = m.max_salary


My SQL is rusty so it’s likely that it won’t work as written. I hope it at least gives you an idea of how to solve the problem using your data. I’m assuming, by the way, that your DBMS allows subqueries in the FROM list.

Honestly, this looks a bit like klunky database design. From what I can see, there are two tables, user, and user_meta. User contains basic user information, while user_meta contains additional user metadata, such as first name, last name and user_level. Frankly, this really should have been put into a single table, IMHO.

Anyway, so what we have is a set of tables a bit like this:

Users:



id          username
----------- ----------
1           bbrown
2           amurray
3           mobama
4           ttest


Users_meta:



user        meta       value
----------- ---------- ----------
1           first_name Bobby
1           last_name  Brown
1           user_level 1
2           first_name Andy
2           last_name  Murray
2           user_level 4
3           first_name Michelle
3           last_name  Obama
3           user_level 5
4           first_name Test
4           last_name  Test
4           user_level 6


What that query is trying to do is join these two tables together to give you a single resultset, with a list of all users, their username, first name, last name and user level.

If you do a straight join between the tables, without doing any GROUP BY, you get this query:



SELECT
	Users.id as id,
	Users.username as username,
	CASE WHEN Users_meta.meta = 'first_name' THEN Users_meta.value ELSE NULL END AS first_name,
	CASE WHEN Users_meta.meta = 'last_name' THEN Users_meta.value ELSE NULL END AS last_name,
	CASE WHEN Users_meta.meta = 'user_level' THEN Users_meta.value ELSE NULL END AS user_level
FROM
	Users
LEFT JOIN Users_meta
	ON Users_meta.[user] = Users.id


Here are the results:



id          username   first_name last_name  user_level
----------- ---------- ---------- ---------- ----------
1           bbrown     Bobby      NULL       NULL
1           bbrown     NULL       Brown      NULL
1           bbrown     NULL       NULL       1
2           amurray    Andy       NULL       NULL
2           amurray    NULL       Murray     NULL
2           amurray    NULL       NULL       4
3           mobama     Michelle   NULL       NULL
3           mobama     NULL       Obama      NULL
3           mobama     NULL       NULL       5
4           ttest      Test       NULL       NULL
4           ttest      NULL       Test       NULL
4           ttest      NULL       NULL       6


A few things to note here: we have multiple rows per user, and we also have a large number of NULL values, because the join is evaluated per row in the users_meta table, giving us 4 rows per user. What you want to do is collapse that into a set that only gives us one row per user, eliminating the NULLs at the same time. That’s where the GROUP BY clause comes in:



SELECT
	Users.id as id,
	Users.username as username,
	CASE WHEN Users_meta.meta = 'first_name' THEN Users_meta.value ELSE NULL END AS first_name,
	CASE WHEN Users_meta.meta = 'last_name' THEN Users_meta.value ELSE NULL END AS last_name,
	CASE WHEN Users_meta.meta = 'user_level' THEN Users_meta.value ELSE NULL END AS user_level
FROM
	Users
LEFT JOIN Users_meta
	ON Users_meta.[user] = Users.id

GROUP BY
	Users.id, users.username


However, you can’t run this query (at least in SQL Server), because it gives you the following error message:



Msg 8120, Level 16, State 1, Line 4
Column 'Users_meta.meta' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 8120, Level 16, State 1, Line 4
Column 'Users_meta.value' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.


You can’t include a column in the SELECT portion of a GROUP BY unless it’s included in an aggregate function, or else is part of the GROUP BY clause. The logical thing to do is to wrap an aggregate function around each meta_value, which takes you back to your original query. MAX is just the most logical aggregate function to use; but MIN would work equally well, since in each column group there’s only one non-null value. MAX and MIN both eliminate NULLs when doing a column concatenation of this sort - the MAX of a column is only NULL if every value in the column in NULL, and the same applies to MIN.

Here’s your query with MAX replaced with MIN:



SELECT
	Users.id as id,
	Users.username as username,
	MIN(CASE WHEN Users_meta.meta = 'first_name' THEN Users_meta.value ELSE NULL END) AS first_name,
	MIN(CASE WHEN Users_meta.meta = 'last_name' THEN Users_meta.value ELSE NULL END) AS last_name,
	MIN(CASE WHEN Users_meta.meta = 'user_level' THEN Users_meta.value ELSE NULL END) AS user_level
FROM
	Users
LEFT JOIN Users_meta
	ON Users_meta.[user] = Users.id

GROUP BY
	Users.id, users.username


And here are the results: identical, as you can see:



id          username   first_name last_name  user_level
----------- ---------- ---------- ---------- ----------
2           amurray    Andy       Murray     4
1           bbrown     Bobby      Brown      1
3           mobama     Michelle   Obama      5
4           ttest      Test       Test       6
Warning: Null value is eliminated by an aggregate or other SET operation.

You also get a useful warning that the NULL values have been eliminated by an aggregate, which in this case is exactly what we wanted - though in other cases the warning might be an indication of something incorrect in our query.

Sorry, I meant to add this but forgot: what this query is trying to do is essentially a pivot operation, taking column values and transposing them into the structure we want. In standard SQL, the only way to do this is using the CASE and GROUP BY clauses, as we have done above. In many mainstream SQL databases, you can do with via non-standard SQL extensions. I’m most familiar with SQL Server, so this is the syntax I’d use:



SELECT 
*
FROM 
	(SELECT
		Users.id as id,
		Users.username as username,
		Users_meta.meta,
		Users_meta.value
	FROM
		Users
	LEFT JOIN Users_meta
		ON Users_meta.[user] = Users.id) users
PIVOT (
	MAX(users.value)
	FOR users.meta IN ([first_name], [last_name], [user_level])
) AS PivotTable


Results:



id          username   first_name last_name  user_level
----------- ---------- ---------- ---------- ----------
2           amurray    Andy       Murray     4
1           bbrown     Bobby      Brown      1
3           mobama     Michelle   Obama      5
4           ttest      Test       Test       6

(4 row(s) affected)


Identical results, as you can see, with more elegant syntax and no CASE functions required. I’ve also found that this approach gives you better performance, but that might well be a quirk of SQL Server. Also, there’s no warning about NULLS being eliminated: the PIVOT function automatically assumes the elimination of NULL values when an aggregate function is used in the PIVOT clause.

More reading on PIVOT and UNPIVOT in SQL Server, if you’re interested: http://technet.microsoft.com/en-us/library/ms177410(v=sql.105).aspx

Similar functionality in Oracle: ORACLE-BASE - PIVOT and UNPIVOT Operators in Oracle Database 11g Release 1

Thanks for your answers…

That’s what I thought originally but then I saw that there were at least about 5 more types of metas. The database is used on multiple sites so it would take a bit of effort to change the database. BTW the Users table also includes a password column and a column storing whether the user is archived.

On seconds thoughts it should just be one table with 12+ columns.

One more vote for poor table design.