Explain database normalization

Suppose you need to explain database normalization to someone who is not terribly computer literate, and not at all database literate. You’d like to educate her on the concepts of one-to-one, one-to-many, and many-to-many. A metaphor or real-life example is the best way to go about it. What metaphor would you use?

The one that springs mind for me is banking. You might have a table listing customers, and another listing their transactions. That’s a pretty easy way to visualize one-to-many. But I’m having trouble thinking of a many-to-many example.

What’s a better metaphor? There’s the old bookstore example given in help files. How about a sports metaphor? Or music? Or a supermarket?

You might try some kind of calendar/scheduling situation, like classes at a university. If each class takes place in a certain room, then the room-class relationship is one to many. Each class has many students, and students take many classes, so there’s the many-to-many.

Ooh, I like that!

A good example from one of my classes was music.

So, you’ve got a bunch of songs in your MP3 player, and each song has an artist and an album. Each artist can have multiple albums. A song can be recorded by more than one artist. So there’s a many-to-many relationship there.

Normalization would be working to organize all of this information in such a way that you’ve only got each song, artist, and album listed once with their related entries “pointing back” to that single listing.

You could use this message board, actually!

We have a table of registered users [columns: user ID, real name, password {encrypted}, registration email, screen name, signature line, registration date, status], a table of forums [columns: forum ID, moderator ID, forum name, etc], a table of posts [columns: forum foreignkey, user foreignkey, post timestamp, post contents, etc],…

It should be easy to imagine how much godawful redundant info you’d need to record if you were doing this in a flatfile system, and compare it to a properly normalized structure.

How about the Internet Movie Database (or a small subset thereof)? One table for movies. One table for actors. One table contains a link to a movie, a link to an actor, and the name of the character.

The book where I learned about normalization used the example of a database of employees at a company. It needed each person’s dependents (for insurance coverage purposes), so the first normal form was a dependents table, with foreign key links to employees. There were second and third normal form examples too, but I don’t remember what the many-to-many case was.

I’ve used the metaphor of books or music collections to great effect with people who are not familiar with computers or databases. The nice thing about doing a book collection is you can start off by describing a table of books where one field is the author, then explain why that doesn’t work (authors with multiple books) and explain how you would to a 1:N relationship with the authors table. But then you realize that that also doesn’t work, because books can have multiple authors, too, so now you get to explain N:N.

I think the sales transactions metaphor is a good one (perhaps only because I have spent a fair bit of time developing systems that do that).

And I think a good way to start is to look at a hypothetical non-normalised table - so in this case, one where every record repeats the order number, the date, the full customer details, the full product details, etc.
Then examine how the repetition of the product details means that you could just omit them and refer to a single line in a product table - and the same for the customer details, etc.

For many-to-many, I like the metaphor of a pizza chain with many franchises that offer some subset of all the available specials. If you want to figure out all the ingredients needed by a particular location, there’s a very natural N:N relationship to make that query work.

Just out of curiosity, why do you need to explain many-to-many relationships to someone who’s not at all database literate? That’s not an advanced topic or anything, but it’s not completely introductory either.

Thank you all for the suggestions. They’re all awesome. I ended up using the university one.

We’re trying to put together some specs, and she’s not understanding half of what I’m writing. If I can get her to look under the hood a bit, it may deepen both our understandings of how to constuct the code.

What’s wrong with banking? Each customer is a “one”, and has a history of transactions on the account–deposits, interest payments, counter withdrawals, checks paid, and debit card transactions. Obviously unless the customer just opened the account five minutes ago, there will be more than one transaction for each customer–thus you have your “many”.

So to begin with, tell your friend that the essence of normalization is the elimination of any duplication of data, other than where it is needed to allow you to cross-reference between lists. For example, the master customer record for each account holder has some sort of unique key value to identify it. For simplicity’s sake let’s say it’s the account number, although I suspect that banks use something different in order to add an extra layer of security.

The transactions also have a unique key identifier of which the account number forms a part; in addition there might be a date and sequence number to differentiate all the transactions for one customer. You wouldn’t want to have the customer’s address and phone number duplicated in the transactions, but in the early days of computing, that is how it was often done, because accessing data on cards and tapes was so difficult.

(You ask your friend if they follow you so far.)

Going a little further, we probably want to record the kind of transaction for each instance. But we don’t want to store “Debit Card Purchase” or “Interest Payment” or whatever it is, a million times for all our customers, so we set up additional lists that assign numerals to each transaction type; then in the transaction record itself, we record just a “1”, or “2”, or whatever, as appropriate. We program the system to look up in the additional lists when it needs to display the transaction type.

Notwithstanding all this (you continue), perfect normalization is an ideal that is usually not recommended. The more closely you approach it, the more often you have to combine tables to run a single query. Depending on the processing resources you have, you have to weigh the advantages of greater normalization against the greater system cost that can ensue if such normalization is done inappropriately. This usually happens if the original data design was ill suited for how it is being used today.

(Now offer them a nice glass of wine. They (and you) deserve it!)