DB Index on unique field

I am working through the rails tutorial and I have a question about database indexes. From the guide:

http://ruby.railstutorial.org/chapters/modeling-users#sidebar-database_indices

The analogy to a book index makes sense for cases where a given item can be found in multiple rows. However, in this case each user is required to have a unique e-mail, so the index is going to be the exact same length as the full table. Is the db software doing other magic things here like creating a table of contents saying “Bs start at 267”, “Cs start at 313” etc. to speed up sorting?

My understanding of indices is that they essentially create a copy of your table with just two items per record: the indexed field (in this case, “email address”), and a pointer to the full record on the original table. The idea is that, in a full-table scan, the query has to first look at each field, check whether that field is the “email address” field, and if so, look at the value therein. So a index scan allows the query to just look at the value in the “email address” field of each record.

Check out this explanation:

In particular:

There could be some kind of hashing function akin to what you’re describing.

Where this is most useful for unique indices are the situations where you want the proverbial needle in a haystack. You tell the DB you want treis@straightdope.com specifically. The DB takes that value, performs some kind of pre-determined mathematical calculation on it, and the result is the address in your table corresponding to the value you requested. You don’t even really search; it knows exactly where that value was stored.

If you’re talking about getting the entire table, but just sorting it, an index of any kind is not going to help you much. Since you need every row anyway, it’s just as about as inefficient as a full table scan. In fact, a good DB optimizer will fish for the point at which using your index is no longer theoretically a good idea and will probably discard it.

In cases where you need everything, it may be a better strategy to simply get the entire contents of the table en masse and let your presentation-layer software do the crunching to sort the results. For example, using Oracle’s BI/XML Publisher, you could get an unsorted XML file from the source database and force the Publisher front end to sort the XML before formatting it for you.

It cracks me up when the users ask for a data pull that must be sorted by a certain criteria, but then specify that they want the output in Microsoft Excel. If you know how to open an Excel spreadsheet, chances are you also know how to sort it nine ways from Sunday.

Yes that is basically what is going on.

Typically it is built with some sort of tree structure where it does what you describe multiple times until it finds the element. Here’s a made up example just to give you the idea:

Index first splits the alphabet in half (approx)
If the name<“M” then go to the first half of the alphabet, otherwise check the 2nd half

First half of alphabet (split again in half):
If the name<“F” then go to the first quarter of alphabet, otherwise check the 2nd quarter of the alphabet

etc.

Short answer: Yes, indexes are magic.

The very simplest thing you could do with an index is store a copy of that column in alphabetical order (just that column, not the whole of each row), along with a pointer to the place in the full file where the rest of the row may be found. That is just how the index in a book works: A list of items, alphabetized so you can quickly find what you’re looking for, with page numbers pointing to the full text for each item.

Note that you can find items in the alphabetized index quickly and easily. A HUGE example would be a phone book, with everybody’s names alphabetized. If phone books were listed in phone number order, and you wanted to find Abraham Pleisczkowicz, good luck. You’d have to read the list item by item until you find it. That’s the “full scan” method. For big tables: SLOW.

But note how quickly you can find the name when the list is in alpha order. You don’t need to do a full scan. In computer databases, the equivalent thing can be done to find records quickly with an index.

And that’s just the easy way. There are more complex algorithms, like “hashing” and “binary trees” (B-tree), and others. This gets heavy. We’re talking about heavy-duty research by Computer Science graduate students, professors, and other PhD’s here. Major software corporations like Microsoft or Oracle hire squadrons of PhD’s just to spend their lives developing better and better sorting, searching, and indexing algorithms. It’s a whole field of endeavor.

Not only do you want the index column to be easily searchable, but for really BIG databases, you also want to do it with the fewest number of physical disk sector reads. So there’s whole areas of research into ways to compress indexes so you can get the most data in each sector, while still preserving whatever organization you need to search fast. People write their doctoral theses on these things, and then there are whole books full of mathematical algorithms that rival Quantum Physics in their complexity.

Note that you will almost always end up with as many pointers as you have rows in your table. In the rare case that your data is sorted in the correct order to begin with, no extra index is needed, but the database engine needs to do work to keep the rows sorted.
This special case (data sorted) goes out the window as soon as you want to index different things (for example if you index by last name and then index by telephone number for reverse lookups). Since the data couldn’t be sorted by both at once, you end up having a list of all table rowids with some kind of index values.

The index values themselves will definitely be compressed as you said (e.g. “Letter A begins here”), but the rowids still need to be there.

As an aside, the B-tree algorithm mentioned by others is a cousin of the simple binary search. The binary search is where you pick a value in the center and see if it is low or high and then repeat with half of the data and a quarter and so-on.
B-trees are an adaptation of binary searching that take into consideration the file block size to perform the searches as quickly as possible with as few disk accesses as possible.

Instead of having only two options at each step, a B-tree might split the whole table into, say, 10 chunks and have a small index into the top level with ten entries in it. Each chunk would then be broken down into 10 smaller chunks with an index into those chunks. This would go on until the chunks represented individual rows. The number of chunks would likely be more than 10, and is related to how many index keys fit in a single file system block.

This might be compared to having a top-level index that says “for A-D go to list 1; E-H go to list 2; I-M go to list 3” and so on, with the A-D list saying “For Alexander to Andrews go to list 1a” and so on.

One of the most interesting projects I was assigned in college was to write a database engine in C on Unix. We had to implement transactions, logging (to rebuild the DB after a crash), and B-tree indexes. Our database engine was primitive when compared with any modern database, but it was kind of cool to know that even if someone pulled the Unix server’s plug out of the wall our database would correctly recover to a consistent state.

Thanks for the responses everyone.