I have inherited a database as I started a new job and have only a small amount of actual experience with databases as a whole. I have read elsewhere on this forum that you should typically not use actual data as your primary key. I would like to better understand this concept.
In my use case I manage a database of products, so on the surface it would seem that making the item number the primary key would eliminate duplicate products being uploaded to your point of sale website. However sometimes Item numbers can change from the manufacturer so that is not a perfect system.
However I am not sure how else you would organize a database of unique products and also make sure you are not adding the same item multiple times. My database has more than 50,000 items and so it is not feasible to just spot check or sort by item number, but I am sure you realize this :).
Anyways, if anyone could enlighten me on this I would appreciate it.
The thing you want to avoid is having two rows with the same primary key and have the primary key be a unique and constant identifier for the row. If using the item number gets you that, then using the item number is ok.
The problem is that, as you note, the product number can change. If you use item number as your PK, and then the item number for something changes, what happens to your database? Do you have another table of “orders” that now might refer to an obsolete key in another table? Can your system tolerate that?
It is fairly cheap to add an extra column whose sole purpose is to be the PK and that eliminates even the possibility of that happening.
Following up on what leahcim said, if the item numbers can change, then it is much better to have a separate field as the PK (typically an AutoNumber datatype in Access).
To keep the item number unique, in the table design view for that field, set the Indexed property to Yes (No Duplicates).
You could start a database holy war with that question on some forums. It’s the old Natural vs Artificial Primary Key debate.
The purpose of a Primary Key is to uniquely identify a record in a table. There can not be a duplicate value in the PK field. Also the PK field must have data in each record.
Artificial primary keys, such as the Autonumber data type in Access, are guaranteed to be unique within a table. They make great primary keys for that reason. However they have no meaning like a Natural PK. That makes looking up a record by its meaning is a little difficult.
A Natural PK has ‘meaning’. It could be someone’s name or an Item Number. The trouble happens when you have more than one person with the same name or different manufacturers use the same item number as you pointed out. What then? You may be tempted to split out a Name field into First Name, MI, Last Name and make a multi-field PK. That may buy you some time, but eventually you’ll probably run into more than one Jim J. Jones. Same goes with the Item Number. You could add a Manufacturer field to the PK. But then what happens when that Manufacturer is bought by another company or changes its name?
BTW multi-field PKs can lead to madness. :eek:
I think that you can see where I’m going here. I’m usually for Artificial PKs but there are times when a Natural PK is OK — IF — the data is chiseled in stone. How about US State Names for a PK? Do you know that California once tried to split into 3 states and North Dakota almost changed its name to just Dakota? Have fun fixing all those records!
What you can do is make the PK artificial then have a Unique constraint on the Item Number field. If you try to enter the same Item Number in more than one record, the database will complain. Then you’ll have to figure out how to handle that situation.
BTW: Access enforces PK and Unique constraints by indexes.
If you have other Access questions, I suggest that you post them in this excellent Access forum: http://www.utteraccess.com/
I might add that “Item Number”, (and in general the whole practice of assigning numbers to things to make “part numbers” or “order numbers” or “account numbers” or “ISBN numbers”,) is itself meant to be something like an artificial PK already. I.e. it is meant to do for regular recordkeeping what an artificial PK is mean to do for databases – make a way to make an unambiguous reference to a thing.
That is why when I make an order from a store, the invoice doesn’t say, “The order you made on September 30th of three lime green widgets that you paid for with a cheque on your Chase bank account”. It says “Order #817772635 of three (3) of item #188273661253 (Lime Green Widgets) paid for with cheque #2566 drawn on account #662771772 at the bank with routing number #442525 (JP Morgan Chase)”.
Artificial Primary Keys is just an extension of that same “assign an arbitrary number” habit that is already done in human bureaucracy.
The term for what you’re grappling with is “database normalization”, which is a primary topic in semester-long college-level database courses. As you might imagine, there’s some subtlety here. There are also tons of websites and books on the subject you should probably seek out.
The underlying idea is to avoid repeating yourself: Only store any given piece of data once, and refer to it every other time you need it. That way, you can change that single piece of data once, atomically, and be certain that no copies of the older version of that piece of data can exist. That ensures database consistency in a simple, comprehensible fashion.
From this perspective, a signal advantage of artificial primary keys is that they never change, so you can replicate them across multiple tables because they’ll never need to be updated.
Therefore, this is pretty much a slam dunk for using artificial primary keys for your application: Product numbers can change, and if you have to update a product number which is replicated across multiple tables and you miss one, your database goes insane and so do you.
The primary key is about the uniqueness of the database record. Other fields in the table can be about the uniqueness of the product you’re describing (and you can enforce no duplicates on some of those fields if you want.
I’ve been in a couple of real-world situations where a system was designed using meaningful data as primary key - allow me to explain how well that went…
The first one was a book distribution warehouse. The guy who wrote the system made the following assumptions:
[ol]
[li]All books have an ISBN[/li][li]All ISBNs are, and will ever be 9 digits[/li][li]We’ll never need to have two records describing substantially the same product[/li][li]Books will be the only products ever handled by this system.[/li][li]The ISBN of a specific product will never change[/li][/ol]
…Therefore, ISBN is an ideal primary key for the products database
What actually happened in real life:
[ol]
[li]Some books don’t have an ISBN (some just had an EAN barcode). We were asked to stock some esoteric stuff; someone decided we’d work around that by giving them dummy ISBNs that were composed of the last 9 digits of the EAN barcode, but this raised two other problems - firstly, the last 9 digits of some EAN barcodes just happen to collide with real ISBNs for other products, secondly, both EAN and ISBN numbers have check digits, but the last 9 of EAN does not typically coincidentally have a valid ISBN check digit. We had to override check digit validation - and errors began to creep into ISBNs elsewhere.[/li]
[li]In the middle of the life of the system, ISBNs changed from 9 to 13 digits. if ISBN had just been a descriptive field in the DB, it would have been really simple to change, but because it was a primary key, we had to change it not only in the product table, but in all of the other places it was used (sales transaction history, etc). There were lots of other mundane changes that had to be done to UI, reports, etc at the same time, but the fact that ISBN was a primary key escalated a minor change into a crisis.[/li]
[li]It turned out we did want to have more than one record for the same product - to control differently-owned stockholdings (we had two customers in our warehouse, both stocking the same book). The first time this happened, their stock, sales history and despatching just exploded into a horrible mess.[/li]
[li]We expanded into non-book products, but everything had to have an ISBN (similar problem to number 1 above, but because ISBN was baked into the system and DB at such a fundamental level, it was really hard to change the way the system looked to our non-book customers; we ended up looking stupid.[/li]
[li]Some ISBNs did need to change - either because they were entered wrong in the first place, or because the customer wanted a new edition to inherit the sales history of the previous edition. But changing the ISBN for a product means not only changing it in the products table, but also, rooting out every record in every other table where it is used as a foreign key. Nightmare.[/li][/ol]
I’m actually supporting a different system right now that is based on the same dumb assumptions, and again, it’s a bloody nightmare. I won’t go into details, but it’s more of the above.
The key skill I learned from all of this was to habitually challenge any ‘absolute’ statements in design proposals - that is, it is best to assume people are just wrong when they say things like:
“This thing will always be true…”
“We’ll never need more than X of these…”
“This scenario is the only one we need consider…”
“This attribute of the data will never need to change…”
“Best practice doesn’t apply here because…”
Hmm, so many replies and I don’t think anybody answered the question yet (about how to avoid duplication of items).
Use an autonumber column for the primary key. I have years of experience designing and working with databases and I don’t think the “natural” primary key is really a thing. Maybe for amateurs (i.e. those who would use MS Access as a database).
Put a unique constraint (or whatever Access has that’s similar) on the item number field. That will go a long way to prevent duplication of items.
However, since you said item number can occasionally change, there will still exist the possibility that someone could enter the same item twice (with two different item numbers). For example, they have the item in front of them, it has “item #666” stamped on it. They see there is no item #666 in the database so the enter it as a new record. Little do they know the item number for this exact same item used to be #13, and that still exists in the database. I’m not sure exactly what you’re going to do about that. It’s a tricky problem if there is no real-life constant that uniquely describes the item. Maybe you have to handle it at the application level somehow, by making sure users can easily search on an item description or something before entering a new item. Maybe you have some kind of post-process that knows which items have changed item numbers and looks for possible duplicates that way. Maybe you hire a “data manager” who is on top of these things. It all depends on how serious you want to get about it I guess.
As has been suggested, using a unique field that is generated for this purpose probably is the best way to go.
It can be auto-incremented, or you can use some other methodology (the auto increment is the easiest most direct way).
The process to ensure values are not entered multiple times is more difficult.
The reasons of course is if things are not handled correctly when changes are made, it can be maddening.
If a product number changes, how do you know? Do you get a list of changed numbers, or is this something you have to figure out.
Do you have data in the table that signifies the manufacturer? This will be needed if the part number ever happens to be the same across different manufacturers.
Assuming you know the part number and manufacturer and know of changes to the number, you could have your part lookup table have the auto increment field, plus part number, plus manufacturer, plus description (and any other values that are at the part level). Then periodically when you get new data, update any product numbers necessary and any other values that may have changed.
Then if you do a left join on the part number and the manufacturer number and only insert those that are not yet in the table, that helps you avoid dup entries. But you have to update your changed part/manufacturer numbers first. Once this table is updated, then you should be good going forward.
Using Derleth’s excellent post as a start, I’d like to elaborate based upon many years of database programming and handling.
A primary, unique key, no matter how generated, should be the pivotal data for each product. One line in the master, primary file, should have the description and specs for each product. This line should always describe only one product, with one manufacturer, one set of dimensions, etc. If any of these change, only some fields in this line will change.
If the same product is produced by more than one manufacturer, you need a manufacturer file, keyed on the primary key. The manufacturer field should be removed from the primary database.
If the same product is produced with different dimensions, you need a dimension file, keyed on the primary key. The dimension field(s) should be removed from the primary database.
If the same product can have more than one of any parameters, you need a separate file for that as well.
If you need to link to old orders, invoices, etc. for this product, you might consider storing all of the product data in each order, where it will not change. If you don’t, you may find yourself generating incorrect archive invoices by linking to the current spec file. Alternatively, you will have to prevent the specs from ever being altered in the current file, and if they do, you will generate a new, unique product number, possibly with a reference to the old one somewhere, maybe in a comment line.
I’ve had to confront all of these situations at some point in time, and I wish I knew then what I’m telling you here – it would have saved an immense amount of headaches.
Yeah, I worked at a place where originally, everything was coded in COBOL and things were always done the same way. The COBOL sequential files wer translated into progressively more advanced database. Then the management re-org fad hit in the lat 80’s and everything got complicated.
So then the IT moved so it reported to Technical division instead of Accounting VP.
The technical Process Control in each plant were moved to be reporting to the technical division instead of the local plant.
So in accounting, payroll, etc - department number had been the primary key, or plant/department-subdepartment. But all the technical divisions were same subdepartment codes, in each department/plant. If we change the departments, what happens to historical records- employee records, payroll, accounting? Do we change all these? If we don’t then historical reports are incorrect, because the department/subdepartment key that spawned them no longer exists on the master record. If we change everything to match the new sub numbers, the historical reports (accounting, headcounts) won’t match previous data. Management was no help… their answer was "it depends what we want to see on the report at the time. "
(worse yet, they merged two plants to be under one manager…)
So the simple rule of thumb-- “if you think a real-world identifier will be unique and permanent- it won’t.”
The corollary is that this applies to any relationship between items too.
We had one lady in our payroll programming who spent some time of the mid to late 1980’s fiddling with Social Insurance Numbers (equivalent in Canada to Social Security number, central tax ID). You only get one, and keep it all your life? No, maritime province people were notorious for obtaining several back in the day, so they could work under one number while collecting Unemployment Insurance under another. As the government tidied up its systems, they were sending us notices - this employee will now use this SIN not that one.
But there’s no doubt re-orgs make tracking historical data difficult, this disguising bad trends.
30 years of database programming has taught me one thing: Primary keys should be dumb. If you’re fresh out of college, you might be inclined to think that Social Security Numbers would make good primary keys. They do, until someone needs to change theirs. When I started at JCPenney, I was shocked when I learned that we received something like 30 SSN change transactions per week. You’d be amazed at how many people don’t know theirs or write them down incorrectly. And then you’re counting on someone entering correctly.
The system I’m supporting (well, trying to) right now takes natural keys to the extreme.
Products are keyed by their natural product code.
Customers are keyed by their customer account number (manually entered now, but it used to be their postcode, after all, we’ll never have two customers in the same street!)
Orders are keyed by their order number (system generated now but it used to be the order reference given by the customer - I mean, the chances of two customers thinking of the same order number are pretty slim, right?)
Lines on the order are keyed by a compound key composed of order number, an underscore, then product code. It’s impossible to have two lines of the same product on the same order (in fact, it’s enforced by validation now. It used to just overwrite earlier entries with later ones)
Transaction history records are keyed by order number, product code, date and time.
And so on. It’s almost like someone read the book on normalisation, then consciously decided to do the exact opposite.
More about SSNs: Yes, my great uncle has at least two. I don’t think this is any evidence of criminality on his part, but the fact remains that he has multiple SSNs.
The government reserves the right to reissue SSNs they issued to dead people. So far, they haven’t, but they could.
The use of SSNs as primary keys could survive the same person having multiple SSNs, but it cannot survive the same SSN being issued to multiple people. Reissued SSNs would break the fundamental assumption all primary key relationships are built on.
As a dev I can’t imagine a better use for government power than to vex DBAs. They so richly deserve it. Besides, it seems everybody else on earth is doing it what with their input errors, and constants that aren’t, and business rules that are more like business suggestions, etc. Why shouldn’t the gov’t get in on the fun?
More examples from my COBOL days -
Employee number should be unique, right? Well in the 1960’s the HR people who designed the employee numbering did not have anything to do with computers. To avoid conflicts, the would not re-issue a number until the next fiscal year, to avoid problems with taxes, etc. (Our company at times had a lot of overtime). Worse yet, we had summer students, and since by the time they returned, their last year’s number might have been given out, they got a different one. So an employee might have worked for us for several years, each year with a different employee number.
(My cynical self thought that when the upper management had college-age kids, there was a strong program to hire college students. As the workforce aged and recession hit, it was the bottom layer workforce more typically looking for jobs for the college-age kids. then, sorry, no money for that…)
Another fun fact was the time the supervisor payroll crashed - the first time someone exceeded $99,999 as a salary in the 80’s.
My favorite was the guy in the hiring office stuck in the job with minimal training. He ahd no idea what the “reason for leaving” codes were, so he made up his own table without telling anyone, and used his numbers. It wasn’t caught until someone happened to actually read a report months later and say “wait, that isn’t why he left!”