Question about database best practice design. Number IDs or Names?

Ok, for example I have a patient database which has about 20 tables or so. In order to link these I should use a unique I.D. number correct? Assuming that is correct, what I assume I should do is have one table with an auto incrementing ID tag, and one field with the most unique piece of data I have. For example let’s use names (pretend I don’t have SSNs). My ID table would then have entries like 1. Fred Smith 2. Joe Blow etc.

What is the best practice for differentiating these when I have two people with the same name? I can obviously do things like 1. Fred Jones, DOB 3/4/45, blond hair, blue eyes, 6 ft tall. Which seems fine, but the problem that I see is that I will have another table for descriptive information. That violates one of the best practices I’ve read for DBs, and that being one piece of information lives in one place only. Is then the best practice to do what I described before? Just have the ID and one piece of information, and then query the other tables when I need more info to figure out who is who?

I’m not quite understanding your question. The purpose of having a unique ID is so that you can uniquely identify each record in a table. We will assume that using people’s names is not a good idea since two people can have the same name. Likewise, some people might not have a SSN. So a numeric ID should be fine.

Why do you want to put “descriptive information” in a separate table? If these items have a one-to-one correspondence then they should be in the same table. Can you describe in more detail what your relationships look like?

Sorry for not being clear enough, I am trying to link different tables in the same database. To continue my example, I might want to store information about sexual history or mammograms. I think I should use a separate table for those. My question is how should I keep track of who’s who in those separate tables.

Is it ok to simply have one table with patient information, and then just use that ID number as my global ID?

Your table containing the personal details of the patients needs the unique ID, then all of your other tables that contain records that want to link to a patient only neerd store a copy of the patient’s ID.
When you want to see appointments for Joe Bloggs, your application finds him in the patients table and uses his ID to find his records in the appointments table.

Then if Joe Bloggs changes any of his personal details, you can change them, but the ID remains the same.

Other entities in the database might also need to have unique IDs, for example, treatments - these could be listed in a table with unique treatmentIDs, and when Joe Bloggs is prescribed a medication, this is represented by a record containing his ID along with the treatment ID in a prescriptions history table.

Hope that makes sense.

You need to have a “foreign key” relationship between your tables. Let’s say for a simple example that you have 2 tables, called “Patient” and “History”. The primary key for your Patient table will be the auto-incrementing field and would be named “PatientID”. Other fields in the Patient table might be things like LastName, FirstName, Birthdate, Sex, etc.

Your “History” table should also have a unique ID, called HistoryID, and will contain historical information about the patient. One of the fields in the History table should also be called PatientID, where you would store the ID from the Patient table which corresponds to this history record.

To get information from the History table regarding a specific patient, you would write a query like

SELECT * From History where History.PatientID = 37

Yep that makes a lot of sense, thanks. To be clear, I should have:

A table called Patient information, with a Patient ID field.

A table called History, with a HistoryID field, and the PatientID as the identifying information.

I should not have:

a table called History, with a PatientID field and fields for things like name etc.

Correct. What you want to do is take it to as high a normal form as is practicable, understanding that as you go higher, the benefits of further normalization become more and more rarefied.

I would recommend actually using the SSN as your global ID.

There are distinct advantages to using an existing, unique identifier that is known by the patient. Like their SSN.

You appear to be dealing with a health care application, so SSNs will be widely available, and needed anyway. Very few patients will not have one, and you can easily accommodate those rare exceptions – for example, just assign each of them an unused SSN (like 000-00-0001 and increment for each new one (like you would for a unique id field).

Programmer here, sometime database designer.

The point of a unique key/id is that it is unique. Names are really lousy for this. Using a combination of name and DOB is called an “aggregate key”, and can sometimes be useful, but in this case, don’t do it.

I read a while back that the Social Security office of the US Government used to use Name, DOB, and City of Birth as an aggregate key to assign Social Security Numbers, which caused problems because major cities occasionally have two babies born the same day with the same name - and the second baby’s application for a SSN caused a key collision and was recharacterized as a request for a reissuance of a prior Social Security Card - FAIL.

SSN’s are a common choice for unique ID’s, but then not everyone may have one. An aggregate key of ID Type and Number might work (so someone without a US Social Security Number can have a key of ID Type = [British Passport] Number = <whatever their passport number is>, but then what many people do is create a “surrogate key” which is meaningless except for it being a key - basically an ID number for the person just for the purposes of the database. An auto column can be a way to implement this in the database. Guids are also a possibility.

I assume lots of DB developers learned their lesson on this just like I did.

I designed a database with natural keys, just like I learned in college DB classes. It was a chemical inventory application. Then came the day someone wanted to change the product codes to use a different notation—couldn’t do it without loads of work because those visible codes were used for a primary key, so there were dozens of associated tables that would be affected by a single update.

I have used numeric keys (surrogate keys) ever since.

These days I am really taking a shine to GUIDs—globally unique IDs. These are particularly useful when you are working in a huge corporate environment with data warehouses and marts all over the place. The GUIDs keep your data unique regardless of who’s database they are in.

Concur - it’s useful to have these as alternate indexes into the “patient” table, since the patient won’t usually know their ID value. Ditto social security number - I wouldn’t make it the official key because there might be times where you don’t have it. Say you get a patient who’s unconscious, or doesn’t know it, or whatever.

On the History table, you can define its key one of two ways:

Combination of patient ID + a sequential number within patient - e.g. Joe Blow’s procedures might be 1234 and 1, 1234 and 2, 1234 and 3, etc while Jane Smith might be 5432 and 1, 5432 and 2, 5432 and 3.

Alternately you’d have a sequential history ID, with the patient ID simply a foreign key. So 1 might have 1234 as the patient, 2 might have 5432, 3 might have 5432, 4 has 1234, etc.

To me, the combination key is more intuitive because you’re rarely going to do the lookup by history ID - basically you’re usually going to want to pull up all of Joe Blow’s history when you’re doing anything. Different DBMSs might handle one approach vs. the other more efficiently.

As others have said, I would advise against that.
For one, not everybody has a SSN.

Two, as minorflat describes, auto-creating a unique ID by the Database is much, much easier if you have to change things later than using personal IDs like SSN.

Three, there have been several lenghty discussions in the past on this board about how the general preference for using the SSN as unique ID (either because of laziness or bad schooling of database programmers) means that the SSN is both publically known and your key to a bunch of databases with sensitive information, which is a problem not only with identity theft, but also that you want to make a medical database secure.

As for how to differentiate two John Smiths, if added Date of Birth and City of Birth is not enough, I’d add eye-color and blood group(converted into some value); those will be easily known in a medical setting and combining all factors, makes the result unique (even identical twins should have different first names! If not, go and clobber the parents for being idiots).

A past project I worked on had a legacy system that was chock full of this. The legacy system used names as the primary key for the people table – and the list of people contained many recurrences of the most popular names. (Essentially, 500 people named “John Smith.”) Our system used unique IDs, and one of our major issues converting from the old system to the new system was taking information about “John Smith” and figuring out WHICH John Smith it referred to.

Basic principle of good database design: Always use unique IDs that contain no information as primary keys. Auto-incrementing numbers and GUIDs are great for this.

That had to have been painful!!! We once had to do some name matching on a bank customer list.

The bank primarily served the Chinese-American population in San Francisco.

“All the Chins in Chinatown” indeed!!

Using “information” as primary key is (as you said) very rarely a good idea. You just never know when something is going to change. OK, maybe a list of US States by abbreviation, at least I think it’s reasonably unlikely that Pennsylvania will change its abbrevation from PA to PE or something like that… but it’s just safer to avoid it all around.

That way lies its own danger.

Ex: “Person IDs start with the USPS state abbreviation of the person’s residence.”

I just moved to a different state. Does my ID have to change? Some places, the business rule says yes.

And as to painful – sometimes we had an entry that referred to “John Smith” and “James Jones,” with some indicators as to which John Smith but no indicators as to which James Jones. Imagine someone converting your high school Permanent File and its list of all the kids you hung out with, and trying to link them up to their Permanent Files.

shivers Client business started out in US. This is okay.

Client business expanded into Canada. This was also okay. Except for the part where the State field was set to 2 characters. Sorry, Prince Edward Island (PEI)! That was fixed reasonably easily, however.

Client business expanded into Europe. Well crap.

Assign your own, unique number to each patient. This number may be used only internally, but you must ensure it is unique, and a computer program can do that. This is the absolute link between records in different files for the same patient.

SSN sounds like a really good idea until you run across someone who gives you a wrong number (or someone writes it down wrong), and you have duplicates, and…well, hilarity doesn’t ensue and the IT dept will have some ‘spainin’ to do. Don’t let this happen.

And a word to the wise, especially if you use the Patient IDs anywhere else than internally…construct them with checksums and/or gaps (don’t assign sequentially, but jump over a few numbers, like with a spread of 6). This will reduce the chances of a wrong number being accepted.

And it might be more ergonomic to include part of the person’s last name, for example 83948-SM for a patient named Smith, but as Scuba_Ben points out if an address changes, this method will lead to disaster if someone gets married or changes their name.

Take it from a database front-lines veteran.

Use unique patient ID’s. Make them readable for the people who have to diagnose and fix your bugs. Don’t include any non-unique information, it wastes CPU in maintaining/searching indexes. The non-unique data will be in the row somewhere anyway. Use ordered numbers for the history IDs or any other sequentially built tables so you don’t need a complex query to access them in order. If the history table as a sub-table relationship, the actual history ID will be a combination of the patient Id and history Id anyway.

Or enact and enforce a law that requires all people to have unique names. Get back to me when that’s done.

This, by the way, is a misunderstanding. This might be the best way, but there again it might not. It all depends on the requirements of the specific application. There is a trade-off between data integrity and performance. You can increase one at the expense of the other. Having each piece of information in one place only gives maximum data integrity, but can give lousy performance. It can perform a lot better if the data is not fully normalized.

I’m a FileMaker geek.

There’s nothing intrinsically wrong with using non-numerical strings as the unique identifier of records (their “primary keys”) but I’m constantly telling newbie FileMaker developers to get over the idea of using an identifier that “means something”. The ideal primary key is something that has absolutely zero meaning outside the confines of the database, and therefore cannot at any time be said to “be wrong”. A social security number, for example, would uniquely identify students, patients, employees, etc. But if at some point it turns out that this bit of info was entered wrong, if SocSecNo was used as the primary key value for the person and you then go and change it to correct it, you break the relationships to related tables.

Most databases (SQL, FileMaker, etc) are intrinsically designed to auto-enter a serial number, incrementing it up one for each record created; the same serial will never be reissued and it makes a perfect primary key. It happens to be numerical (although I suppose in theory a field could have been created by the folks who engineer the environments such that it starts out with “A” goes to “Z” then goes to AA thru AZ then BA thru BZ etc). There’s no reason to not use numbers and technically integers probably take up less storage space (although that’s a totally trivial concern) so everyone uses numbers.