SQL - keys that are longer than the data they reperesent... why?

Where I work we have a SQL Server relational database.

There’s a table with just two records - it’s a ‘brand’ table and contains an id, a brand name, and the website of that brand.

In another table there are many thousands of records that are customers. Each one of these records has a brand field which contains the id of the brand that customer belongs to.

Every customer belongs to a brand.

The thing that puzzles me are the brand ids. they are freaking long!. Example “9117FC47-51B2-47B6-A79F-F6FB9988C550”

This is actually longer than the entire rest of the record, and for every single customer there is a string of this length on the record.

I was led to believe that one of the main points of relational databases was to reduce disk space overhead by replacing repeated data with short ids to reference that data (stored once) in a seperate table.

Oh gawd. Your database has been designed by a numbnuts who has been inducted into the GUID cult.

(There’s nothing wrong with GUID’s per se, they have their uses, but using them as database primary keys is fucking retarded.)

Anyway, there’s absolutely no point to this design, except that some people seem to think an integer sequence is somehow not sufficient for an artificial primary key. Those people should be thrown off the nearest cliff.

I’m glad you said that because I’ve been thinking along those lines. “Surely this isn’t right… or isn’t the best design”. But I am way less experienced (or so I assume) than the person (our developer) that designed it.
I might innocently bring it up with him one day (i.e. innocently curious and try not to imply that I think it’s bad design).

SQL Server DBA/Database Designer chiming in here. We all know there are two sides to every argument! There are a few perfectly valid reasons to use the row GUID as your primary key … such as … um …

Never mind, you’re absolutely right. But there never seems to be a cliff handy when you need one.

SQL Server itself will assign a GUID key (And in fact has an internal one for each row anyhow) if you are doing things like merge replication.

Having such a large key on a relatively narow table can massively impact performance. You could increase your throughput by 100% (or more) by switching to an integer based ID.

I like them.

And yes, those 8-4-4-4-12 character groups are a traditional GUID.

They have some useful attributes. They are particularly useful for Master Data, which “Brand” might just be. More on that below.

The “G” in GUID is really “Global”. In other words, a GUID entry for a specific ActiveX component in the Windows registry can be used anywhere in the world without worrying about namespace collisions. Likewise, a primary key from a Person table can be copied and swapped everywhere without any concern for mixing it up with another different person.

The Wikipedia article discusses some of the algorithms in use to guarantee they are global. One common one used is to mix in some machine-specific identifier (though there are privacy concerns to consider with that).

Because they are global, you don’t need to retain type information along with them in your workflow. Of course, it does help if you know you are moving Employee records around or Document records around, but it isn’t necessary.
In a traditional database you might have identifiers like “655321” and “102314”, but in order for those numbers to be useful you need to know that “655321” is a PERSON_ID and “102314” is a DOCUMENT_ID.

Consequently, if you had some kind of table, say OBJECT_ANNOTATIONS, which had text entries annotating any kind of data in your database, you would need both OBJ_KEY and OBJ_TYPE columns to uniquely identify the table and key for the annotation (if you have Document 101 and Person 101, you need to include the type to uniquely identify the correct annotation). With a GUID, you have no need to indicate OBJ_TYPE. Your annotation table could consist of OBJ_GUID,ANNOTATION. And for any object in your database you could look up its GUID in the annotations table to see if any are present. This is a contrived example, but it shows some of the use of global uniqueness.

Global uniqueness is useful in data warehousing.

In a very large corporate database environment, it is not unusual to have many layers of databases, with data originating in transactional systems, moving to databases of Master Data that then feed a data warehouse that then feeds data marts that then feed other transactional systems. Such master data concepts as “Employee” cut across the corporation and these records find themselves being passed around like baseball cards in dozens of different databases.

Business keys (the unique key formed by the data elements themselves) are not sufficient for identifying duplicates because they change. For example, employees get married and change their names. An illustration of this weakness, especially in disconnected data warehousing environments: Imagine if I sent you two letters, the first saying “Jane Smith is now Jane Doe”, followed by “Jane Doe is now Jane Johnson” and you missed the first… you would not be able to make the second change.

Technical keys, or surrogate keys, are sometimes troublesome because these are usually different for each database (especially so across Dev/Test/Prod environments)

Here is were GUIDs can make a difference. As long as the architecture is configured correctly, a GUID can be used to identify master data in different parts of the organization.

In your example, I could have a database right now where I enter BRAND_GUID=“9117FC47-51B2-47B6-A79F-F6FB9988C550”
Regardless of what I called it, someone could copy this GUID from my post and look in your company’s database and unambiguously identify the precise brand you copied that from. And, as I said earlier, “Brand” sounds like a pretty good candidate for Master Data that would be proliferated throughout the corporation.

Then throw me off the cliff.

There’s a lot of very good reasons to use GUIDs as primary keys, including:

  • They are unique. In situations where you need to create records for multiple related tables on a client machine, you avoid a lot of back and forth to the DB as you create records then have to read the same record to get the key back to use as a foreign key in a child table. You create the GUID, and use it as the primary/foreign key in all your tables, and do your commit/update in one transaction.

  • as minor7flat5 said, they are very useful in data warehousing, as you never have to worry about duplicate keys across multiple installations.

  • depending on how you’re using the DB, incremental keys can cause hard drive lag as the same physical sector is hammered as records are added to the DB in numerical order. Using guids as a primary key avoids this. I think this argument is losing ground as hard drives become faster, but it’s still worth mentioning.

The only real argument against GUIDS as primary keys is that they are not easily human readable. Some people argue that there’s a performance issue with guids (as they are larger) but I think it’s been sufficiently proven in SQL Server at least that this is not the case.

:: toss ::

There are better ways to do this, such as asking a sequence for the next value before you insert the parent table record. In some databases this is even transaction-safe. (I don’t know anything about SQL Server, though.) Then you can insert the parent record and all its related records at once.

The way you handle duplicate keys across multiple installations is to add a source identifier to the imported data. (You’re going to store where it came from anyway, right? So why not make that part of the pkey?) I find that some GUID-PK proponents have an irrational fear of multi-column primary keys, but they are a very useful tool. The source column of the primary key can even itself be a foreign key into a sources table. This is a much more granular and useful structure than just giving everything a semantically meaningless global identifier.

Dubious. That depends entirely on the indexing structure and storage-engine parameters, and even the OS filesystem configuration, all of which are configurable by a knowledgeable admin.

I don’t think the size matters too much. (Storage is cheap these days.) I just think there are more robust solutions that give you more usefully structured data than just sticking a fancy random number on stuff.

The performance hit can be dramatic depending onthe width of the table. Imagine a table with a few simple fields, like Lobsang’s, such that the GUID effectively doubles the length of the row. SQL Server reads data in 8K pages, and so it will have to do twice as many read operations as it normally would. Depending on the volume of daa required for reading, this can have a massive impact.

But we’re still talking two hits to the DB in your scenario - one to get the next identity, one to do your insert. Sometimes this isn’t ideal. And in your scenario, the DB is going to need to place some sort of lock on the table to ensure that another client doesn’t come along and use your ID. Again, sometimes this isn’t ideal.

Maybe I have a source identifier. Maybe I don’t. Again, depends on what I’m doing.

And isn’t one of your big arguments about using guids as keys based on them being too long? Is an identity column + some sort of source ID all that much shorter?

I don’t think it is all that dubious - an identity column typically sorts the physical data based on the value of the ID.
All that said, this is a religious debate, and though I myself lean slightly towards using guids as primary keys, I personally don’t care much one way or the other unless there’s a really good reason. I’ve designed & used plenty of DBs that use either type. Most arguments about which is better don’t make a whit of difference at all until the DB gets to be huge, and most DBs simply aren’t all that large, so the discussion really comes down to who likes what.

Despite what others on this thread want to say, there is no clear conclusion as to which kind of key is truly “better” in all situations, and it’s perfectly acceptable to use guids. If the OP really wants to know the pros and cons, go Google “guid primary key” and you’ll get reams of info on both sides of the debate, and all the reasons why you’d want to use one or the other.

It’s just another tool in your tool kit.

Just like so many other aspects of database work, it’s all about moderation and knowing what to use where.

These things often cause us to think, “I have a shiny new hammer and now everything is a nail.” We have all seen the result of someone’s shiny new hammer in a database. Anyone seen a DB with indexes on every column?

Other DB tools that are best used in moderate amounts in the right places: stored procedures, indexes, synonyms, database links, Java stored procedures, SQL hints, and the list goes on.

GUIDs are one more item in that list of tools that have a place and time. Understanding their pros and cons is a good thing.

Not in SQL Server, which is presumably what we’re talking about here. You have to specifically choose for an identity column to be a clustered index …

Performance hits are going to be very variable–it depends a lot on how the various tables are indexed. I guess the main reason it annoys me is that it is a generally decent/acceptable option, but very rarely either optimal or terrible one, which makes it extremely popular amongst people who want to design tables for their project without having to think and make intelligent decisions. Which usually means that, even if the use of the GUID itself isn’t causing any problems, there’s some other mindless default decision they made that is. I suppose that unfairly tarnishes (in my eyes) the reputation of the people who have evaluated the different design options and selected this as the best for their particular situation. But …

A GUID is stored as a 128-bit integer, not as a string of characters; 16 bytes rather than the 32 character convention used to display them. There are many very good reasons for using GUIDs as described above, not the least of which is performance. In addition to removing the necessity of multiple trips to the database environment, it also enables lazy writes to the database from a cached scheme (common where memory-based object models are used for processing), and lazy synchronization between some types of load-balanced databases. They can make a dramatic difference in transaction recording systems, for instance.