Database design: Meaningful primary keys?

Epimetheus · August 20, 2007, 11:35pm

Ah, I see what you mean now, though I would have never considered them “meaningless.” UserID and or EmplID are often used as primary keys, but I never thought of them as meaningless. I suppose though, the use of them as “meaningless” ID numbers preceded their usefulness for say, logging on the system or some other identification method other than using SSN.

I would personally never use SSN, phone numbers, a person’s last name or anything like that as a Primary key, but I have used many of the ones mentioned in the post above. (#19)

Mangetout · August 21, 2007, 12:00am

I’m in the middle of trying to sort out a complete mess caused by a choice to use a meaningful primary key (ISBN) - the application is somewhat mature and has an established user base, who depend on it entirely for their productivity. Now ISBNs have changed from 9 to 13 digits and I have to find a way not only to convert the data, but preserve all the referential integrity, as well as modifying all instances of reports and interfaces that use it. If the products had been identified by a plain ID (simply kept invisible from the user), I could have just added a new field and phased out the old one, or even converted the old one to the new, without having to worry that it will fuck everything up entirely.

Meaningful keys are bad right up there past multiple values in a single field.

KneadToKnow · August 21, 2007, 12:04am

Nitpick: ISBNs have changed from 10 to 13 digits

EllisDee · August 21, 2007, 12:05am

Sorry for the confusion. In the context of this thread, meaningless refers to a piece of data that has no intrinsic business meaning.

Most primary keys are auto-generated, which is as close to meaningless as you get. (If sequential, they’ll tell you the order in which the records were entered, but that’s a bit of a nitpick and isn’t even a piece of business data anyway.)

Arnold_Winkelried · August 21, 2007, 12:07am

They are meaningless because the customer id can be 20 or 28239807432 or -15 or 6.7 or ABC128342F3E, they are totally independent of the information about the customer, they carry no meaning about the customer.

A meaningful customer id would carry some information about the customer:
e.g. my customer id is always the first three letters of the customer name followed by a sequential number.
For Cingular (they buy cell phones from me) the customer ID is CIN2893243
Now Cingular changes its name to AT&T. Do I go through my whole database, changing CIN2893243 to ATT2893243 ? Or do I leave it as CIN2893243 and explain to every new employee “that’s an old name, back from when Cingular wasn’t bought by AT&T”?

If my customer id is meaningless, 32473298, then when Cingular changes to AT&T I don’t have to change the customer id. The customer id is very important from an application point of view. Internally, my application has to know that Cingular is 32473298. But a user of my database (the person entering orders) does not have to know that Cingular is 32473298. They search for a company with the name Cingular, find one or more companies with similar names, use other information to pick the right company, and then see all orders for that particular Cingular company. Your application is looking for 32473298 but the user doesn’t have to know that Cingular is 32473298.

Arnold_Winkelried · August 21, 2007, 12:09am

I’m old enough to remember the good old days of COBOL programs with record REDEFINES and flat files.

If the first two bytes of the record are 01, then the record is a person name. Characters 1-20 are first name, 21 - 50 last name.
If the first two bytes of the record are 02, then the record is a company name. Characters 1-40 are the company name, characters 41-50 are the state/city abbreviation.

Good times!

Mangetout · August 21, 2007, 12:11am

:smack: I have no idea why I typed 9. I honestly don’t.

Arnold_Winkelried · August 21, 2007, 12:14am

because isbn[0] is the fist character of the ISBN and isbn[9] is the last character of the ISBN.

Mangetout · August 21, 2007, 12:18am

The value of meaningless keys is that they need never change - almost any other piece of information in common types of table (such as customer or product data) is potentially subject to change - people move address, streets get renumbered, people change their names or tell you that you’ve entered their birth date incorrectly, SKUs are subject to renumbering schemes when companies merge, or are re-used when obsolete, barcode identification schemes change, and so on.
Using any piece of real information as a primary key, no matter how unlikely you think it is to change in the future, is still able to come back and bite you hard on the arse at the worst possible time (i.e. when you’re done designing the database and the whole application is built over the top of it).

-On top of all that, there’s simply no good reason to do it.

minor7flat5 · August 21, 2007, 1:13am

Totally off topic: I had a real eye-opener when I got a glimpse at how they were designing our new data warehouse. Apparently all of these things we take for granted, such as foreign keys, seem to go out the window in warehouse land. Even our nice surrogate keys from the source systems cause indigestion in the warehouse.

I’m heading off to Informatica school next month to be indoctrinated into the faith; it should be interesting.

To the OP: It may be religion, but I can dictate to a certain degree in our shop, so surrogate keys are the rule. The system I designed with natural keys twelve years ago taunts me every time I look at its rigid unchangeable data. The fact that the chemical stockroom in our pharma company cannot change its room number without updating forty zillion order/request/receiving/dispensation records is convincing enough to me.

tomndebb · August 21, 2007, 1:35am

Just an FYI regarding SSNs: we have not (yet) ever re-used one.

(It is still a much smarter idea to use an invented key for ID purposes. What happens when some variant of Bush’s now dormant immigration reform gets implemented in a few years with a rule to exclude the temporary foreign workers from Social Security and gives them a 10-position ID? What happens if the system survives long enough that SSNs do get re-used?)

EllisDee · August 21, 2007, 2:06am

I don’t fully understand what the deal is here, so forgive me for copying somebody else’s post from another messageboard. Can anyone explain what the thinking is here:

We are also big supporters of compound primary keys. Since we use MS SQL SERVER and take advantage of “clustered” primary keys, having them be compound is almost a requirement.

For example our student class table for a school district has a primary key of:

Student Id, Year, Bldg, Course, Section (for example)

Our Course table is:

Year, Bldg, Course

Section table is

Year, Bldg, Course, Section

By having keys like this you get the benefit of the data being visually meaningful (that’s important). Also by clustering the data on this index it’s likely that all of the data for a single student is in a “single physical block on disk” - meaning a single I/O will retrieve that data. The same with the Course and Section tables - getting all the courses for a year and building might be just one or two I/O operations.

When you are dealing with millions of rows in tables things like this make a big difference.

minor7flat5 · August 21, 2007, 2:45am

By doing this, they can query a table and get back lots of useful stuff from the index alone, without ever having to look at the rows themselves.
I imagine that if the PK is compound then any FK references would also embed the useful stuff in the child tables, thereby making it much easier to get all the data in one shot rather than have the DB engine hunting all over the disk for bits of a row or joined in tables.

As the person said, this becomes an issue when you get into millions of rows, particularly when you end up performing complex joins between multiple huge tables. This kind of tweaking is what DBAs get paid $$$ for.

In the aforementioned data warehouse we are struggling with precisely these kinds of issues, where joining in extra tables to get the important stuff just isn’t an option, with close to a hundred million rows in one central table alone. Besides the traditional denormalization of a warehouse, the DBAs are adding all kinds of special compound indexes that address certain classes of queries. These indexes allow the DB engine to perform many queries on these huge tables without ever actually looking at the table data.

Duckster · August 21, 2007, 2:50am

You “heard” wrong.

Source: Social Security History

EllisDee · August 21, 2007, 3:54am

Nice cite, thanks.

Why the irony quotes around “heard”? Are you implying that I did not, in fact, hear that?

rainwalker78 · August 21, 2007, 1:55pm

I am in my 3rd year out of college, and I landed as a database developer. I find this thread fascinating, and I wish this kind of topic came up more often.

I come down on the meaningless PK side, for pretty much the same reasons that everyone else mentioned. If I remember correctly, this goes against what my textbooks said, but they don’t have to maintain these databases - I do.

zoid · August 21, 2007, 2:24pm

Sorry for the hijack but what’s “frex”?

Noone_Special · August 21, 2007, 2:26pm

Slightly lazy and slangy contraction of “For example.” Nothing important.

Sorry about that :o

LionelHutz405 · August 21, 2007, 2:41pm

Another vote for meaningless primary keys.

In my opinion, you can never say data will never change. Even something like SS numbers that you are assured are never re-used. This is because you never know if it will be entered into the system correctly the first time. If a SSN is entered with a typo, and fixed later, to the system the data has changed. Even if in real-life the SSN didn’t actually change. With a generated PK, the fix is trivial, otherwise you have problems.

To say it will ‘never change’ assumes no one, ever, will make a data-entry mistake.

ZipperJJ · August 21, 2007, 2:48pm

Do you still have that textbook? I’d be interested in seeing why a textbook would be for meaningful PKs and what the reasoning would be.

I can see, like in minorflat’s post, where hundreds of millions of rows are involved and you’ve got to do everything in your power to account for every second of processing time…but in general, for tables with under a million rows, I can’t see the reasoning. I would hope that your textbook doesn’t say “in EVERY case always use meaningful primary keys.” That would be horrible.

In my short career in working with databases (about 5 years), I’ve already had to help other companies re-configure and back out of their original DB design and add meaningless PKs because their original scheme is not scalable at all.