How would you solve this database exercise?

My company is recruiting a new DBA/SQL guru, and so I had to come up with some questions to give potential candidates. I wrote the following question to judge the candidate’s SQL and relational design knowledge. Then, I wondered how I would answer it, so I did. How would you approach it? It’s pretty simple, only using a few tables, but requires you to do a lot of work in your head, since there’s no actual database to poke.

The Question:

Write a stored procedure that accomplishes the following tasks. The procedure should check for errors and be able to rollback at any point. You will need to come up with a schema that makes sense based on the requirements.

[ol]
[li]Given a product ID as input, select a product from the PRODUCTS table[/li][li]Find all reviews for that product in a REVIEWS table[/li][li]For each review, find all the votes (VOTES table) from users who have read the review and voted it helpful (+1) or unhelpful (-1)[/li][li]Each review rates the product on a scale of 1-5. Find a weighted average rating for the product, given that[/li][LIST=1]
[li]All reviews with fewer than five votes are discarded,[/li][li]All reviews with a helpful : unhelpful ratio of less than 0.5 are discarded,[/li][li]All reviews with a helpful : unhelpful ratio of between 0.5 and 1 are counted at half weight,[/li][li]All other reviews are counted at full weight.[/li][/ol]
[li]Write the product’s weighted average to its record in the PRODUCTS table.[/li]
[/LIST]

I would question writing a stored procedure in the first place.

Also, your test exercise appears to be more about “application logic” (what programmers do) vs database design/architecture.

I’d approach it from the side.
Then the front.
Then the side again.
I’d rub my beard with two fingers back and forth and nod while making “hmmmm” noises.
Then I’d stab you in the neck with the number two pencil and run out the door before you could realize I have absolutely no clue how to program a Goddamned thing in SQL.

Oh, and then I’d write you a thank you note, asking when the second round of interviews were and when a potential decision might be reached.

Completely disagree.

This is a classic problem that is best done with a stored proc as opposed to the application layer. All of the data is in the DB; there’s absolutely no reason to pump it up to the application for processing. A stored proc is going to be faster easier to optimize than application code that does the same thing.

I’m too lazy to write out the whole stored proc, but here’s more or less what I’d do:

I’d have the three tables you mention (Products, Reviews, Votes) with a foreign key connecting products and reviews, and another one connecting reviews and votes.

From there, you can do one SELECT that would pull out all the product ratings, using a where clause to discard the reviews that don’t meet the criteria. Might have to use a couple subqueries to get the ratios and such.

Then I’d tell you that I’m not going to write that number to the Products table because it’s a constantly changing value if this is a live database, and it makes more sense to calculate it when needed than it does to store it. Or make a view that has it calculated. Or if the DB is too big to do it on the fly without bogging down the server, then maintain another table with the current average review that gets updated via triggers that go off when new reviews are added.

Do I get the job?

I do like your point about the usefulness of storing such a transient value in the DB. But I’m more interested in how people will calculate the ratios and weighted average stuff; whether they will put it into separate steps or try to cram all the logic into a WHERE clause. That sort of thing.

And I agree with you about the value of doing it as a procedure vs. app logic.

Performance is not the only concern. If you need to write business logic that works across different databases, stored procs hinders more than helps.

Yes, all the data is on the database side. Some of the performance loss of the application layer can be mitigated by having an application component executable residing on the same database server to eliminate “network” latency.

Stored procs are overused. It also doesn’t help that database vendors love espousing all the “performance” benefits of stored procs because when you invest the effort to write all that custom code via their proprietary SQL inside their database, it makes it harder to migrate to something else.

Maybe friedo’s particular scenario absolutely screams for stored proc. Maybe it doesn’t. My point is that I would question the need for it. Isn’t that what a database “guru” does… use his brain instead of just a reflexively doing what most people do?

I agree with you there. But the situation in the OP never said anything about multiple databases.

My experience is the opposite of yours: I far too often come across coders who NEVER use stored procs, even when the situation is perfect for it.

Heh.

I’ve come full circle on the whole “write the DB layer generically so we can change DBs easily” thing. I’ve written DB layers multiple times; no matter how “generic” it is, it’s a pain to move to another DB.

And actually moving to another DB very rarely happens. I lament all the time & effort I’ve lost writing vanilla DB layers. From now on out, unless there is a solid plan to move from one DB to another, my opinion is that it makes no sense to spend that time & effort up front; spend it when and if you really need to change DBs.

But that’s a whole other subject.

True in the real world. Not so true in an interview when you’re trying to judge stored proc writing skills.

I’ve gone the opposite direction — away from stored procs. I used to drink the stored proc Koolaid. The “performance” aspect of stored procs is way oversold. 95% of the bottlenecks in SQL code is the WHERE clause not matching indexes. Therefore, it wouldn’t matter if that SELECT WHERE statement was sitting in a db stored proc or app code – it would be slow either way.

Also, back in the old days that I was messing with this stuff, SQL stored procedures were impossible to debug line-by-line. What good is “faster” code when you can’t even check it for correctness by watching it in the debugger?

Also, I’m not interested in writing generic db layers. Sometimes “architecture astronauts” do get carried away. I understand that. However, for some things, the basic database connectivity driver is all the abstraction you need – no fancy db layer required.

It’s true in my world because I was somewhat combative in job interviews to see if the interviewer actually knew anything. I often threw the interviewer’s question back at him to see if his assumptions were correct in the first place.

“Why are manhole covers round?” … heh… you the smug interviewer thinks you know the answer to this but you really don’t. So… let me ask you-the-interviewer why aren’t manhole covers triangular?! Triangles also don’t fall into the hole like square ones. And what’s so bad about square ones?! There still has to be a lip/flange for any shape of cover to rest on. If that lip protrudes inward enough to be less than square root of 2 * length, the cover won’t fall through.

It’s been a long time since I wrote a stored procedure, so I don’t remember exactly how to do it. I’m not sure what kind of weighting function should be used either, so I guessed. Assuming the obvious schema, and that reviews.val is 1-5 and votes.val is +1/-1, in one Oracle SQL statement:

update products set review_value =
select sum(review_val)/sum(review_weight)
from
(select r.id id,
r.product_id product_id,
sum(v.val),
count(),
(case when sum(v.val)/count(
) > 0 then r.val
else r.val * 0.5)
end) review_val,
(case when sum(v.val)/count() > 0 then 1
else 0.5
end) review_weight
from reviews r,
votes v
where v.review_id=r.id
having count(
) > 5
and sum(v1.val)/count(*) >= -1/3
group by r.id) rv
where product.id=?
and rv.product_id=product.id;

The performance aspect is way more than just matching indexes, etc. I can tell you that a vast majority of application developers out there don’t truly understand how to compose SQL and create proper joins, etc. Code written at the application layer is often much more unwieldly to change, whereas code at the DB level is much easier.

This shows that you don’t truly understand the raw power of the SQL engine…your debugger isn’t going to help you any better if you’re writnig SQL in the application layer, either. You’re going to have a multi-line query statement that generates an error…same thing as inside the proc.

Word.

CREATE MATERIALIZED VIEW
prodavg

The rest is just fluff. Then again, I have a sysadmin background.

Why would you need to rollback anywhere in this query, and what could count as an error? Maybe you’d get a no data found, or if your data was dodgy a divide by zero? Anyway here’s how I’d do it:

1. Given a product ID as input, select a product from the PRODUCTS table

    Trivial query on PRODUCTS - actually on re-reading this it occurs to me there's no need to look at PRODUCTS, PRODUCT_ID will be a foreign key on REVIEWS surely?

**2. Find all reviews for that product in a REVIEWS table **

Cursor loop though REVIEWS querying up the VOTES stuff

**3. For each review, find all the votes (VOTES table) from users who have read the review and voted it helpful (+1) or unhelpful (-1)
4. Each review rates the product on a scale of 1-5. Find a weighted average rating for the product, given that **

    Parameterise all of these conditions into bind variables in VOTES query/queries
Either sum and average over round a loop or build into one chunky query.

All reviews with fewer than five votes are discarded, 
All reviews with a helpful : unhelpful ratio of less than 0.5 are discarded, 
All reviews with a helpful : unhelpful ratio of between 0.5 and 1 are counted at half weight, 
All other reviews are counted at full weight. 

5. Write the product’s weighted average to its record in the PRODUCTS table.

Nah, create a function which you can bolt onto a PRODUCTS query as a pseudo column. Some of us are allergic to denormalisation.

You could do the whole thing in one big Select statement (that’s how my boss would do it) but the rest of us think it’s a lot easier to read and maintain a bit of iteration and a few IF THEN statements.

We’re not talking about the same type of code.

I’m speaking of SQL code that uses variables, recordset looping, conditional branching, etc. I’m not talking about a mulitine SQL SELECT statement.

Incorrect generalization.

I’m merely a lowly Access DBA, but I learned SQL on QMF before I started on Access, so my inclination would be to do it with one big SQL query.
I agree that the score shouldn’t be stored in the DB if it can be avoided. If performance turns out to be an issue, I’d investigate what options are available for caching, down the line (at the web server if it’s a web app, or the client if it’s just going across the LAN to a stand-alone app).