I work with a lot of data for marketing purposes and perform statistical and business intelligence analysis, so I have to work with IT folks frequently to understand what the data in a database really means. One of the phrases that I haven’t really understood is when they talk about data being normalized and denormalized. So what what is the deal here?
A proper answer to this would fill at least a chapter in a text book. Normalization is a discipline of database design which strives to eliminate redundancy and simplify retrieval. To understand an explanation of normalization, you would first need to be familiar with database terminology and concepts (columns, rows, keys). There some info on this here and here.
What Tim Staab said. And it pertains mostly to hierarchy.
Picture the SDMB as a database (and somewhere in Chicago I’m sure it is). You have a Forums table and you have a Registered Users table and you have a Threads table and you have a Posts table.
Record # 6459038 in the Posts table contains the OP of this thread (the actual text of it) along with a UserID field containing the value 2385 and a ThreadID field containing the value 329562. You’d also have Post Date and Post Time fields, or perhaps a single TimeStamp field, to indicate when the post was made. If you exported the Posts record, that’s all you’d get. Other relevant information, such as the UserName string “cards”, the UserName signature string (if in use), the Forum Name string “General Questions”, the Thread Name string “Database question: difference between normalized and denormalized data?”, the User Status string “Charter Member”, and so on are available to the database software through the relationships between the relevant tables and are not repeated in the Posts table.
I’m not accustomed to seeing the adjective “denormalized” so much as “you need to normalize your data structure” or variations on that, but I suppose one could deliberately “denormalize” a data set — in fact I guess I do so whenever I export data for a given record and reference values in other related tables because I know the person reading it wants all that information. In this example, if I were exporting the Post record to a plain text file, I would probably export the other fields mentioned above as if they were part of the Posts table.
When you end up with your “normalized” database you may find that when retrieving data from it you end up with too many table accesses. By “denormalizing” it (or a small part of it) you may be able to speed up access. For example, your normalisation process says to remove repeated elements to another table, but this obviously adds another table access to a search. If, when you access a record, the repeated element is required most of the time but leaving it on the orginal record your search/retrieval process is going to be quicker. This is especially important if you are searching/retrieving a lot of records.
Like Mr Stabb says a full explanation of normalisation would be lengthy. How much do you already know about databases? It also takes a while to get your head round, explained in formal terms it’s pretty incomprehensible (to me anyway). It is much easier to grasp in terms of splitting data across tables then joining tables to re-assemble it.
De-normalised means that data that is/was normalised has been copied to a single table so that a query does not have to access (join) several tables to assemble a result. This is done for performance (a query will be quicker with fewer joins) it is bad because you are throwing away all the good things about normalisation. You will be storing redundant data. You may get out-of-date invalid results. You have to have batch or trigger routines to keep it up to date.
Dammit, too slow.
If your IT people want to de-normalise stuff you might want to ask them about their indexing strategy (or just use the phrase at random and watch their faces).
I have seen the term “denormalized” used as my fellow posters mentioned, but also in another way. Databases which have been designed with too much emphasis on strict normalization and not enough attention paid to real-world performance are said to be “over-normalized”. I have seen instances of this where database administrators eliminated some tables, adding the columns from those tables to other existing tables, and eliminating some joins. They referred to this as “denormalizing” as well. I think the use of the term is more appropriate when referring to unloaded, flat-file data as noted by the others.
It’s true that a complete discussion of normalizing tables could fill a textbook, but you don’t need to know nearly that much.
A “normalized” database is just a fancy way of saying that your data is elegantly organized.
Most of the rules of normalization are nothing but good database design techniques that people had been using intuitively for years before Codd, et al formalized the rules.
Ultimately, though, good database design is part science and part art. Sometimes - not often, but sometimes - it’s OK to break the rules for the sake of performance or simplicity.
Here is Wikipedia’s take on it. Frankly, I haven’t read it so I can’t tell you if it’s any good or not but I can make it qute a bit easier for you. The article lists 6 normal forms (design rules) plus two off shoots - Boyce/Codd normal form and Domain/Key normal form (which I’d never heard of before).
Beyond the first three it becomes pretty academic. In fact you have to come up with some pretty contrived situations to even find examples for some of these. Most real world database designers concern themselves with the first three only.
IIRC, unless you have a many-many relationship, 3NF is equivalent to the higher ones.
The relevant answer to what you are doing is much shorter. Essentially you design your database structure so that any data element only has one instance that represents it. Here’s an example:
I have a database table where “Company_Name” is a collumn. In a normalized database, I do not populate the Company_Name collumn with the actual company name. I populate it with a unique identifyer that refers to another Company table of unique company names and their IDs.
Why do I do this? One reason is so I don’t get 50 different version of:
IBM
I.B.M.
International Business Machines
Intl Business Machines
International Business Machin
and so on
Another reason is that I can reduce redundancy. There is data I can associate with “IBM” like CEO, headquarters location, employees, that I can place in the normalized table instead of populating every row of my main table.
All of the preceding posts vary between very good and excellent. However, I will now try to describe it outside IT jargon. (Wish me luck.)
One goal of data base design is is efficiency and one aspect of efficiency is to not have the same piece of information (data element) showing up in hundreds of places on the data bases. Normalization is an attempt to reduce the number of places any data element is stored.
For example, we need a postal code (Zip Code in the U.S.) to place on a mailing address to expedite swift delivery of bills by the Post Office. We also need a Zip Code on the shipping address, because UPS, Fed-Ex, and Yellow Freight use the Zip Code to deliver packages and calculate their charges. We often need to examine the Zip Code to identify taxing location in order to calculate sales tax. Marketing may need Zip Codes to identify sales trends by region. There are probably a few other places where we need Zip Codes, as well.
In a poorly designed system, the Billing department would have a file that carried the Zip Code, the Shipping department would have a file that carried the Zip Code, the tax compliance department would have a file that carried the Zip Code. (In a really badly designed IT operation, they would all have separate customer master files that would have to be separaely updated and reconciled on an ongoing basis.) Creating a design that gets the Zip Code onto a single database in a specific location where all the systems can use it is the act of normalizing.
However, lets go back and look at that marketing application. Suppose we want to run dozens of “what-if” scenarios against our sales by location? We can turn our query loose against the entire customer/sales database, extracting and summarizing all the sales, dividing them among different regions based on Zip Code for each query. On a large customer database, that is going to use a lot of query power. Alternatively, we can run an extract on a single occasion (hourly, daily, monthly), summarizing the sales dollars by Zip Code, then run our “what-if” queries against that extracted data using a lot less horsepower. However, we now have a separate table that has repeated all those Zip Codes. They have been denormalized.
This has been presented as an extremely oversimplified example and I can see several objections, myself, but it is intended for those with the least experience of data bases.
Depending on the database, there is definitely a point of overkill in strict normalization. In large corporate databases, often you have different folks that “own” different tables of data. In such cases, I think it makes sense to denormalize a bit just so that you don’t have to join to tables with which you are unfamiliar. If every query and every report requires you to make joins to tables that you don’t deal with often, it’s a major pain. Better in my opinion to store redundantly, preferably with triggers to keep the tables in sync.
Records in a normalized table contain data relating to the Key, the Whole Key, and nothing but the Key. So help me Codd.
Har har. DBA humor.
Thanks for the great insights, it helps put several things in context. I did understand about using tables to organize data to keep redundancy out of a database and have used tools like SQL and Microstrategy to join tables I was familiar with. The problem comes from when I need a lot of different types of data about customers brought together to do predictive modeling and segmentation from large corporate datbases. I now get why one of the IT guys rolls his eyes at me and says “you want all this data denormailzed?”
He then explains what it will take to do that in technical terms. That’s OK though, I explain to him in my technical terms that I need the data this way to do Logit modeling, which is a non-linear regression methodology that blah blah blah.
I can get anyones eyes to glaze over.
Hmmm… I don’t think so. Many-to-many relationships are quite commom and require a junction table (aka associative table, intersect table) to connect the two tables
But without one of those tables, I don’t think there’s anything you can do beyond 3NF. I’m of course open to correction.
Sounds like that person needs reminding that the entire purpose of a database is to be able to get data back out of it. If the data only went in we really wouldn’t even need to turn the computer on.
Maybe I don’t understand what you’re saying.
An example of a many-to-many relationship might be students and classes. Each student might take many classes and each class will have many students.
Classes Table:
ClassId (primary key)
Class Name
etc…
Students Table:
StudentID (primary key)
LastName
FirstName
etc…
StudentsClasses Junction Table
StudentID
ClassID
(And that’s it. The primary key is the composite ClassID and StudentId)
If you’re saying that because the StudentsClasses table consists of nothing but a key, then there’s no point in normalizing it. Well, I suppose so.
On the other hand, junction tables can have their own attributes, so I would think they would fall under the same rules of normalization.
I’m not sure how I can make this any clearer: If a database has no many-to-many relationships, then there is no further normalization possible beyond third normal form.
Oh! If you don’t have M2M’s then 3NF is as far as you can go anyway. :smack: :smack: :smack:
I’m more on the transaction side, but I’ve found quite often that data warehouses (which you might use for analysis purposes) frequently have “rolled-up” summary data and have been intentionally denormalized for ease of ad-hoc analysis and reporting (someone using Access or Crystal Reports isn’t typically eager to join a dozen tables for a basic report). It isn’t an unusual circumstance.