What is a relational database?

Over here, some resident geeks got heated about just what constitutes a relational database. And it made me realize that I don’t have any real idea what separates relational from “flat” databases.

I’ve done a google search, but they turn up a lot of jargon that doesn’t begin to answer my questions. Could someone please explain the difference (dumbed down to my apparently low-level of comprehension of Database terminology) to me?

Thanks.

The easiest explanation is to think of an invoice, which consists of a header and details, or lineitems.

A flat file would have to have the header information, such as customer name and address for every single lineitem, while a relational database would have two tables, one for the header information and one for the detail (lineitems). For each invoice, the header would have a single row, including a unique identifier of some sort, while the detail table would have a row for each lineitem, along with the identifier pointing back to the header row.

Basically, you eliminate the need to store a lot of duplicate data.

Let’s say you have a flat-file database that lists teachers and the school they teach at:

Teacher School
Smith Clinton
Jones Reagan
Johnson Clinton
Woods Clinton
Baker Washington

and so on.

Now, if we get a conservative school board, and they decide to change the name of Clinton School to Bush school, you will need to go into the file and change all the “Clinton” entries to Bush. If you miss one, you’re database is bad. If you misspell “Washington” once, your database is bad. (Bad, in this case, meaning if you ran a list of all teachers at “Washington”, and four of them were in the DB as at “Washngton”, they wouldn’t be on your report.)

Instead, in a relational DB, you’ll have two tables for this:
one, called, say TEACHERS
teacher school num
Smith 3
Jones 2
Johnson 3
Woods 3
Baker 1

and then another, called say SCHOOLS, for a lookup
school num school
1 Washington
2 Reagan
3 Clinton

Now, to move all the teachers from Clinton to Bush, you just change the name of school 3 to “Bush”. Then the realationship that links school to teacher is updated automatically.

It gets more complicated with Normal Forms and ER diagrams and whatnots, but this is the simple version of the benefits and idea.

I just realized that an actual example might help illustrate it better.



Flat File:

Name    City	 Zip	InvAmt	SKU      Desc	  Qty	Retail
Smith   Atlanta	 30339	127.98  23956    Socks    4     10.00
Smith   Atlanta  30339	127.98  79836    Shirt    2     15.00
Smith   Atlanta  30339	127.98  97652    Slacks   2     28.99
Relational Model:

Header table

Identifier  Name    City      Zip    InvAmt
1           Smith   Atlanta   30339  127.98

Detail

HdrID  SKU     Desc    Qty    Retail
1      23956   Socks   4      10.00
1      79836   Shirt   2      15.00
1      97652   Slacks  2      28.99



Notes for purists:

Yes, the item info would actually normally be in another table, but we’re not trying to reach 3rd or 4th normal form here. It’s just an illustration.

The detail table would ideally have its own unique identifier, but we’re still just illustrating.

Stop being such a purist. I do this for a living. :slight_smile:

I hate to pick nits but I’ve been doing this for years too so I’ll just expand a little. In aktep’s example it isn’t just a relational database that solves the problem but also the use of a surrogate key, that is a unique identifier that is artificially created rather than using a unique attribute of the school. Surrogate keys are needed if there is no truly unique attribute of the items in the table or if there is a risk the natural unique identifier could change at sometime. It’s also handy when there is no truly unique identifier. Social security numbers are a good example so big brother can identify two john smiths who were born on the same day from each other.

Each table should have a column or group of columns that identify each row uniquely and this is called a primary key. Relationships are made when the primary key of a parent table is a non-key atrribute of a child table. In aktep’s example there is a one (schools) to many (teachers) relationship. Relational databaases may have one to one relationships and sometimes many to many relationships.

Rather than beat the definition of a relational database to death, you might want to do some research on other database models to see how they differ.

http://unixspace.com/context/databases.html

Foreign keys, joins, normal forms, and all that jazz are far above the level of relational and non-relational. What makes the relational model different from earlier database models is that it was designed on paper first. Axioms and definitions were stated, theorems were proved, and it wasn’t implemented until the theoreticians were satisfied with it. Earlier models were built first and described later.

According to chapter one of Chris Date’s “An Introduction To Database Systems”, the key principles of the relational model are as follows:

Of course, a more precise definition is given later.

Nowadays, those two properties seem obvious because very few of us have had any exposure to a non-relational database. You’d have to do some digging to find info on such things, but if you’re interested, the major types were hierarchical, network, and inverted list databases.

I’ve also seen some authors distinguish relational databases as databases that contain information about their own structure, but I can’t find anything in Date’s book.

As a pratical example of why I established and use relational database.

I have a set of tables that I use to define my music collection. Each CD that I purchase is assigned a five character code that is based on the source (CD, album, DVD, etc.) and the order in which it was purchased.

The first table defines the code, album title, artist, record label & number, number of songs, and any special attributes (5.1 sound, HDCD, etc.). The second table lists all of the songs or selections on the source by the code, track number, title, writer(s), date recorded (I have an extensive jazz collection where this is helpful to distinguish between different performances) and time. The third table lists the code and the performers on each source by code, track, and performer name.

The three tables are then linked by the “key field”, which is the album code, in a form that I can print or view for each album that gives general source info, songs on the source and performers.

If I want to query the database to determine the individual songs containing Miles Davis recorded between 1958 and 1960 and also including Red Garland and John Coltrane I can easily accomplish this with my linked tables.

If it were a “flat-file” database I would have to keep repeating the same entries for fields relating to album title, for example and song as I try to enter each performer name. In short, I would have an incredible amount of duplication in fields with no real addition of information.

Great! I think I get it!

Thanks a lot.

In DP/MIS/IS/IT 30+ years:

You are hitting on 3 basic file concepts:

  1. “Flat” or Sequential files:

Think of a box of 3x5 cards - what is written on each may vary, but without an index (in this case, little tabs), you will have to read each card in sequence to find the one you want (when I was an operator (public sector) we had a little job that looked for the 9th file on each of five tapes - it made the tapes spin as a background for elected types wanting to make some “important announcement”).
IBM Mainframespeak: QSAM (hi, geeks!)

  1. Indexed or Keyed file:

A file on which one (advanced topic: or more) fields has been indexed: you give it a key, it gets the record(s) with that key - no need to do a sequeltial search.
IBM Mainframespeak: ISAM (yes, I’m that old), VSAM.

  1. Database:

An indexed file with some internal linkage of related records. The 2 main types are:

Hierarchical - One record is the root key (as example, a customer, keyed by customer number, secondary index of name). The related records are linked to either the root, or a higher level record (segment).
IBM Mainframespeak: IMS

Relational - Each record type (customer, orders, salesrep, etc.) is on a physically separate file, but logically, they are all the same record - think “file with index on every field”, and you’re real close.
IBM Mainframespeak: DB2 - the parent of SQL - yes, SQL was originally a subset of DB2 - that is where it came from.

Relational databases are (I’ve heard) based on relational algebra - I never got that far in math, so do not know if it is or is not.

p.s. - physically, your disc drive does sequential files, nothing else - if it looks like something elsa, it is because of software (which has its own internal, hidden (to most folks) files.

Been a while since I studied but:

Relational databases use both relational algebra and relational calculus since they are considered equivalent to each other (see Chris Date’s book as cited earlier )

The algebra are simply operations on the relations, e.g. join, union, intersection, select

The calculus gives a notation for relational queries. To understand SQL (as opposed to just being able to use it) we were taught relational calculus and more specifically tuple relational calculus.

I tended to think of relational algebra as being the basis for the relational model and the calculus as the basis for SQL. However, that may just be the way that I was taught.

No, you have much more to learn. We DBAs are relentless and always make people asking a simple question. We’ll give you a comfortable seat… did you ever see the aversion therapy scene in A Clockwork Orange? There will be a quiz later. You will learn to normalize.

There are two different notions of what a relational database is floating around in this thread, so let me take a minute to compare and contrast.

The first is the theoretical notion that I offered, which views a relational database as a set of tables and a set of operators for producing new tables. In this view, all of the flat-file databases discussed so far are relational databases.

The second is the DBA’s notion of a set of tables with relations defined on them. In the theoretical view, this corresponds to a relational database in second or third normal form.

Which view is right? Well, are you a theoretician or a DBA? The theorems that are proved about relational databases hold even if a given database isn’t in first normal form, so there’s no reason there to exclude such things. On the other hand, any database that’s going to be used for an appreciable amount of data should definitely at least be in second normal form, if not third.

At the very least, you need to know your audience when you’re writing about such things.

Are we just complicating this and re-duplicating the other thread again?

I go with the traditional view. A relational database is data stored in one or more than one relation(table) with tuples(rows) and attriubutes(columns). A speadsheet is a relational database if a simplistic and not particularly robust solution. However if you do have more than relation, in order to be relational it must follow the rules of relational algebra(I’ve always found relational Alegebra and relational Calculus to be equivalent. just different approaches using different notational structure to do the same thing exactly, an SQL user can be easily see in what SQL is doing in either of them. The Alegrba is just a little more easily to follow along recipe-book style until you understand it)_

However normal form is a different beat entirely. It is one system of rules to implement a relational database, and nearly universal for a time, but doesn’t even need to be known to create a true relational database. Essentially Codd created a set of rules for the express purpose of eliminating redundant data, and therefore using disk(or DASD) resourses most efficiently. Much like using a two digit date instead of a 4 digit date for the year was good, because the biggest expense, bottleneck was storage resources. However it is losing popularity becuase storage isn’t the biggest cost. The cost of the people to do the work is a much bigger percentage of the cost, as well as the performance cost as tables start to exceed Tara-bytes. Most Architects and desingers followed Codd to the letter in 3rd normal form and created a huge maze of lookup tables, and data seperation. Then the actual DBAs took a look and put alot of it back in the main tables(anybody still call it 2.8 normal form?) because that many extra joins killed performance.

Look at oracle 9i. WIth the new complex data structures like LOV and VARRAY,Even 1st normal form has been thrown out the window now for many design efforts. And Object-relational databases and relational data warehouses/data marts are a billion dollar effort to create or at least simulate massive data redundancy, completely against the p[oint of the normal forms in the first place.

And screw Chris Date. :wink:
Fundamentals of Database Systems, by Ramez Elmasri, Shamkant B. Navathe is * the* bible for learning about databases. Yeah it’s a textbook, so what. It was actually created help people learn about databases, rather than push certain views and gain followers, as so many computer books seem to do these days

It’s hard to call a spreadsheet a relational database because there’s no equivalent to SQL/relational algebra, which is (IMO) foundational to the relational model.

Yet, IME, if you fully eliminate redundancy performance suffers, because you end up using too many multiple-table lookups. In a commercial application, a customer’s name will probably be stored only once, but such attributes as account type and status might appear in multiple tables.

Thanks for reminding me of one of the Things I Used To Know. I was so proud one day when reading a book by Codd that had NOT been filtered through Date, his explainer, that I actually knew what he was talking about. Then I donated the book by mistake. :frowning: The brain cells that knew that stuff are long disused, I fear, so I just ordered up a copy of “Fundamentals of Database Systems.” Amazon had it for $3.50 since I don’t mind it being used and the previous edition. Starting in about a week’s time I ought to be sleeping like a baby! "Oh, let’s read about normalization tonight. “There are three forms…’ ZZZZZZZZZZZZ”

TARA-bytes? :smack:.

And also if anybody has trouble finding a comma or parenthesi it’s because I used them all in my post. When my inner geek takes over things can get ugly.

I said it was definitive, not that it was a good read. :slight_smile:

It’s easier to describe what it’s not.

A non-relational, flat-file database is essentially a spreadsheet. (Not a workbook but a single sheet). The columns each contain a specified type of data and each row is a record, or instance. You can have very pretty flat-file databases that do not visually look like a spreadsheet. Microsoft Works comes with one, as does AppleWorks. You can view your data one record at a time (i.e., one row at a time) with the fields (columns) arranged any way you want. You can format some of the fields as checkboxes showing acceptable values, and some of them as drop-down fields with value lists and so on. You can arrange them in designs that do not remotely resemble a grid. But the data itself could still be represented as a simple grid of rows and columns, regardless of what this screen looks like.

I’m a professional database geek using a database system widely regarded as “nonstandard” by relational database geeks: FileMaker Pro. Many of the elements that others might describe as necessary for a relational database do not (necessarily) apply to FmPro. It is, nevertheless, relational, and has been relational since FileMaker 3.0 (it was a flat-file system before that point).

If the Straight Dope Message Board was a flat-file nonrelational database, the record constituting this particular post would be a row, and it would have to have a column for Forum (“General Questions”), a column for User Name (“AHunter3”), a column for thread title (“What is a relational database?”), a flag for thread status (“not locked”), a column for this post itself (<the text you’re reading right now>), a column for my signature, a column for my join date, a column for the date of this post, and so on. As a relational database, the record for this post could “know” (reference) all this data with just the column for this post itself (<this text>), the thread identification number (322384), post date, a flag for whether or not to display the signature, and my userID number (66).

A different table, a table of threads, contains a column with thread title (“What is a relational database?”), thread identification number (322384), forumID number (3), and thread status code (<whatever represents “not locked”), and so the post record, which shares the same thread identification number as the thread record in that other table, knows what its thread title is and whether or not the reply-to button oughta work when you click it.

Yet a third table, the table of forums, contains a column for forumID number (3), forum name (“General Questions”), forum visibility status (“visible”, or a code that means “visibile”), and so the post record, which as we’ve alread seen is connected to and “knows” the values in the thread record, indirectly is connected to the forum table via the thread record and therefore doesn’t need any column to indicate that it is a post within “General Questions” and that it is a post within a thread of a visible forum so you should be allowed to view it if given a link to it.

Yet another table, the table of members, contains a column for user ID number, a column for screen name, and a column for status. So since this post record contains the value 66 for userid, it knows that I am AHunter3, “Charter Member”, that I have the right to post here, and so on,
Performance and scalability issues aside, any relational database would theoretically be able to house the Straight Dope Message Board using an essentially identical storage structure. Some systems would have more specialized requirements than others but the ability to house this kind of data in a set of related tables would indicate that it’s a relational database. If Jerry the Tech God were to burn a DVD of the tables that constitute the Dope in tab-delimited or other common format, it should be a relatively (ahem, sorry) simple matter to define fields and import the entire message board and run it in that database environment using an identical structure.