Database design: Meaningful primary keys?

Never say ‘never’ - it’s just another way of asking to be proven wrong.

There’s nothing to stop you repeating fields in different tables for quick reference or indexing, or whatever - it does go against the principles of normalization, but it’s a lesser sin than using a real piece of data as a key, and is not likely to cause some dire consequence further down the road.

Now that’s a good point. IIRC, all the textbooks and tutorials I’ve seen have meaningful keys and content since they’re trying to teach set relationships and show how DBs work, not teach maintainable and proper coding practices.

I concur - all the examples I can think of that used meaningful keys, were probably just using them as simple illustrative shorthand to demonstrate the concept of relational tables. I think they typically go on to suggest using an autonumber or other generated key, once they start talking about real data.

I have some of my textbooks, but I don’t remember exactly where I saw it. I don’t think it said “in EVERY case always use meaningful primary keys.” I actually don’t think we discussed meaningful vs.meaningless keys hardly at all (unfortunately). I just have a specific memory of looking at a “student” table in either a textbook or in class example that used SSN as the primary key. Shoot - I might be making up memories.

The short answer remains the same - use meaningless keys.

The only legitimate exceptions as I see it are:

(a) When you are dealing with huge databases - hundreds of millions of records - and lots of concurrent users.

(b) Data warehousing, which by its very nature sacrifices flexibility for speed.

© Lookup tables such as States (AL, AK, TX, NY, etc.) or Sex (M, F) where the data is reasonably immutable.

Confession: In my previous post here, I mentioned that I once hotly debated the issue of meaningless vs. meaningful keys. I was the one arguing for meaningful keys. I was wrong.

Right, and it’s not a straightforward change, either. Depending on the DBMS, you can’t just change the SSN in the Employees table because you’ve got child tables with SSN FK’s that are dependent on the SSN PK. Nor can you just change the SSN in the Department table because the new SSN won’t point to a legitimate key in the Employees table.

You have to either turn off the FK integrity (or whatever it’s called), make your changes and turn it back on, or you have to first create a whole new Employees record with the correct SSN, cascade those changes to any child tables, and delete the original Employees record. Lots of room for error that way.

I have seen this in action (use of a “meaningful” primary key for a DB table) and I have seen the pain it can cause. Sure, you may get away with it for a while, maybe even forever, but what really is the downside to taking simple insurance out against the pain?

Here’s a simple 2x2 boolean grid that should make things clear:

Use Meaningful Primary Key? [Y/N]
(Where “N” = instead using a unique constraint and clustered index on the intended primary value)
Will This Key Eventually Change In Nature/Value Set? [Y/N]

If you use a meaningful primary key [Y]…
…and the key never changes its nature or value set [N]…
…you come out OK.

If you use a meaningful primary key [Y]…
…and the key unexpectedly changes its nature or value set [Y]…
…you come out Hosed Up The Wazoo.

If you use a meaningless primary key [N]…
…and the key never changes its nature or value set [N]…
…you come out OK.

If you use a meaningless primary key [N]…
…and the key unexpectedly changes its nature or value set [Y]…
…you come out OK.
What conclusion do you think is the prudent one?

I don’t see how this is a legitimate exception. What is your rationale?

We almost always use “meaningless” numbers. I can’t tell you the number of times I’ve had to tell a user, “It’s just a number, it doesn’t mean anything. We use numbers because we’ve got lots of numbers.”

Same as for data warehousing. Sacrificing flexibility for speed.

Actually, I’ve never worked with a database large enough to justify that, but there are some out there.

Interesting. Is there an actual performance gain from using a “native primary key” field versus a “clustered unique index” built on a field when querying a table?

Does it depend on the DB vendor?

Referential integrity.

Okay, what the fuck? Is this topic really not covered in school? I always assumed it was, but I wouldn’t know. I’d think it would be one of the very first things covered.

Probably. Re-read the entire quote I posted in [post=8899720]post 32[/post].

I spoke to a coworker of mine that graduated from the same university/same semester as me. He doesn’t recall a discussion about this either. I don’t think it is brought up as it’s own topic. I think, if you have a good professor, they probably bring it up as an aside. Unfortunately, the prof we both had for the database design class didn’t really bring it up.

I hope I am not forgetting the discussion from school, but I think I remember the discussions about choosing a PK, and I don’t remember this…consideration being mentioned.

As a data point: one of our customers called us to report a problem with slow performance of our software. They had a single (non-partitioned) table in an Oracle database that was 2.5 TB in size! This was 4 or 5 years ago. I hate to think how big that table is now.

From the little I know about cluster keys, they are a proprietary feature of MS SQL Server. The basic idea is that you can declare a cluster key on a table and the table will be physically ordered on that key. The benefit is that if you use the cluster key in a query, the amount of disc head movement is reduced, thus faster response.

I would imagine, however, that there is significant overhead involved by the DBMS in keeping the physical order as the table grows.

As it was explained to me, the data storage strategy is similar to a Library sort, where gaps are left in the table so that each insertion doesn’t require repositioning every subsequent record.

I’ve primarily used Sybase as a DBMS and it certainly supports use of a clustered index. In fact I thought it was a standard feature of a DBMS.

Well, Sybase and MS SQLServer were originally the same product, they just ran on different platforms, so you would expect that there would still be similarities.

So far as I know clustered indexes are not a standard DBMS feature.

Oracle has similar features re clustered tables, including “Multi-table Index Cluster Tables” where parent-child records (ie from two or more related tables) have their records stored physically as close as possible so that the parent and children are picked up with one disk read (or as few as possible).

Funny - this is exactly how IBM built their Hierarchical databases in the '70’s

No, it’s just that in this case locating a cited answer was relatively easy to do. My apologies for any misunderstanding.