In one of my CS classes, we’re discussing hashing techniques. A popular technique is, apparently, to declare a hashing function to use h(k) = k mod tableSize, and then deal with collisions accordingly.
One of our assignments is to set up a hash table for a fake employer, using employee IDs as the keys. The IDs we were provided are basically 1 through 1000, with accompanying user data. My question is… is there anything fundamentally wrong with using the hashing function h(k) = k, since the keys aren’t repeated and are linear?
Aside from having a large table, is there any reason I should use another method?
If you’ve really been given keys ranging from 1 to N sequentially, then it’s pointless to use hashing. Or, as you put it, you might as well use the trivial hash function of mapping a key to itself. Perhaps I’ve misunderstood the nature of the homework assignment.
(Aside: Hashing is normally used in computer algorithms to map values from a large set — such as strings of characters — to a much smaller set of small integers, in such a way that identical mappings (the “collisions”) are rare or nonexistent. If these conditions can be met, then hashing is a much faster solution for some problems than other data structures, like binary trees. In particular, the insertion and search operations are quite speedy, relatively speaking.)
It seems silly to me to map the IDs from 1 to 1000 to a smaller table index, say from 1 to 100, because it just means you’ll have 9 collisions for every ID. The advantages of hashing are much reduced with that kind of collision rate.
Maybe the instructor just wants to give you experience with writing a hashing algorithm — though he should have given the sample problem a little more realism. Maybe you’re supposed to pretend you don’t know in advance how many employees there are, or that the set of IDs will be a full sequential range. If this is what he’s looking for, then I suppose you have to play along for the sake of a good grade.
you say ‘basicly 1 to 1000’ does that mean they are not really 1 through 1000? if that is so, the clear path is to hash them in a way that ends up 1 to 1000, instead of basicly 1 to 1000.
did you actually get 1000 data peices? if you only have 50 or so and its just the number space thats 1-1000 then thats a perfect place to use hashing… are you sure you really have 1000 data points?
Bytegeist has pretty much nailed your question, particularly when he says:
Bingo. Hashing is useful when your data space permits a very large number values, but is only sparsely populated. For example, credit card numbers have 10[sup]16[/sup] possible values, but a list of valid credit card numbers will be (relatively speaking) much smaller. If there are 10[sup]9[/sup] credit cards out there, then only one ten-millionth of the data space is actually occupied. It should be pretty obvious that a very big 10[sup]9[/sup] element hash table is a better solution than an absolutely gargantuan 10[sup]16[/sup] element table to hold every possible credit card number.
You say the IDs provided range between 1 and 1000. Are there actually about 1000 IDs, or is that just the range for the given data? If you have only a few IDs in that range, hashing may make sense, at least as an academic excercise.
In this case, I’d use the trivial hash function. If there are 1000 unique IDs, then there’s no reason to use more than 1000 slots, and taking an ID mod 1000 will just give itself.
I’d say your professor has written a bad assignment.
Play along (as Bytegeist suggests) and then revel in the fact that your friends on the SDMB are smarter than your teacher. To actually learn something, you may want to use a hash table that is smaller than 1000 so you have to deal with collisions. Looking at how to handle deletions may get you extra credit.
The table probably has to store all the elements, so having fewer than 1000 entries might not be so hot. For collisions, use (x[sup]2[/sup] + 2x) % 1000 as a hash function instead.
If this was a real-world problem, then you would want to account for the possibility that more employees might be added in the future, in which can you would use h(k) = k mod 1000. However, if the assignment doesn’t require you to prepare for that possibility then why bother?
There is a technique called separate chaining where each cell in the hash table is actually a list of elements rather than one single element, thus the total number elements can actually exceed the size of the table. In reality, it’s only used in a few select cases.
And here was me going into this thread thinking it would be about expatriate running clubs. Can’t you guys even give the rest of the human race an “in” to what the hell you are talking about.
That’s generally what you’re supposed to do if you have a collision. It’s faster to simply append the item to a linked list than it is to re-arrange the whole table. When it comes time to look something up, it’s easy to sequentially search a few items in the list, since they don’t get very big. Naturally, it’s important to properly tune your hashing parameters so you don’t end up with so many collisions as to make your table inefficient.
Separate chaining is good, but it’s also worth mentioning dynamic tables.
Basically, you have an array, and resolve collisions by whatever method feels appropriate. When the table is getting full (say, when 7/8 of the slots are filled), you double the length of the array, rehash everything into it, and proceed as normal. If you support deletes, it’s also worth halving the size of the table when it’s mostly empty (say, when 1/8 of the slots are full).
Although any given insert or delete can take linear time, these occur rarely enough that the asymptotic performance is not affected.
Jeez that’s an easy assignment for hash tables. When I took data structures the teacher assigned us to write a program that could go through text files, store chunks of text N long and then with a seed sequence spit back out randomly generated text by providing one of the words that followed that sequence in examined text. Now that’s a freakin assignment.
The point being to really learn about the uses of data structures requires using them in actual programs. I haven’t done any programming in a while but I still remember when to use what data structure.
Hashing to a computer programmer is an algorithm (a well-defined procedure that a computer can implement) for storing records of data that are indexed by a “key”. The keys can be anything — a name, a serial number, a date and time — as long as they have an ordering. That is, given any set of these keys, we can unambiguously arrange them into sorted order. Putting it another way, given any two arbitrary keys, we can always declare one of them to be “earlier” in the ordering, and the other “later”.
Hashing is only one of many such algorithms for storing organized records of data, all of which have various strengths and weaknesses. Generally you will guide your choice of algorithm by the nature of the problem that needs solving. Hashing, in particular, is fast at insertion (adding a new record) and searching (detecting whether a given record exists, or retrieving its content). Or I should say: it’s fast at these things if the “hash function” does not generate many “collisions” — by which I mean, if the function that maps keys to internal storage locations does not yield too many repeats. Ideally, none at all.
On the flip side, hashing is lousy at operations that might be desirable in other contexts. For example, hashing is a poor choice if you’ll be requesting sets of records whose keys range over some interval. (“Give me all records between this date and that date, if any.”) In this case, if you are expecting such requests to be common, it would be better to choose a different algorithm entirely. Fiddling with the hash function won’t save you.
Come to think of it, a hash data structure doesn’t care whether the keys can be ordered, though this will normally be true anyway. So scratch most of that paragraph I wrote. (Sorry.)
However, it is generally assumed that the keys are going to be unique, since the main purpose of hashing is to store a set of records that are identified primarily by their keys.
It depends, there are several other alternatives to chaining that don’t take much longer on insert/find and don’t create the memory requirements of chaining (important in embedded operations).
First, you can simply keep walking up the table until you find an empty slot and store your value there. Make find a little more efficient by storing a global of the biggest spread between hashed spot and an actual entry, or else find has to keep going until it hits an empty entry to declair that something isn’t in the table.
Secondly, you can have a second (and third, and fourth) hashing functions. If you can’t insert at the result of the first hash, insert at the result of the second instead. If not second, then third, etc. Again, keep a global of the “biggest” hash function you had to use, to make find more effecient.
These add complexity to delete (have to see if any entries past the one you’re deleting would rather be in the slot you’re emptying), but that’s not a big deal in situations where entries are “aged out” at regular intervals at a lower priority than insert/delete.
My contention is that converting the employee ID from ascii to binary, then indexing into the table IS a hashing function. If you weren’t allowed to do this, how else would you approach the problem?