So is hashing like a substitution cipher?
One more question, guys. Thanks for the great responses!
Say I had about 50,000 entries. The hash table being used uses Separate Chaining to avoid collisions. What would be a good table size to
a) maximize performance on insertions?
b) maximize performance on finds?
I’m using guess and check, but the best results are coming at a table size of 50,000. Does this make sense?
Actually, I think I just realized that this depends largely on the hash function itself. Right? Someone stop me if I’m wrong.
That should depend on the hash function and the data structure used as a chain.
This is a misleading question. Even if you use chaining, you maximize performance on insertions and finds by minimizing collisions.
While “it depends” is the proper answer, if you assume that the combination of the hash function and the hash keys used yields a relatively even distribution of data points, the most efficient hash table is usually around twice as big as the number of entries.
That said, hash functions usually work best if the table size is a power of 2, so a lot of times you just round the expected number of entries up to the next power of 2.
If you’re still using the same assumptions, 50,000 entries, each with a unique key from 1-50,000, then 50,000 would indeed be the most efficient table size. I still think that that’s a bogus scenario, you’re better off just using a big array.
BTW, I’m also getting the distinct impression that we’re doing your homework for you.
-lv
In fact, you’re not; but, you are helping me learn. karma++.
Thanks for the replies, everyone!
When designing a database, if you are stuck using a relatively long string (first and last name for instance) as a key, usually what is done is to define a “seek key” on that string. When you get a representative string key to look-up, insert or delete in the database, you calculate its seek key using a hashing algorithm. The seek key is numeric and is used as the primary key for your database. The space efficiency difference of using a seek key in a database rather than a character string is enormous, the time efficiency difference is even more significant.
This is just one good example of how hashing is used in the real world.
Would that be … only if the keys are less than OR EQUAL the table size? I was just wonder if the “equal” case would be technically considered a hash function? Or does a hash function “require” that some collisions have a less than zero probability? Maybe the “equal” case is just an index calculation?
“less than zero probability”?
I’m working off a 0-based array here. So for a table of n elements, the indices are 0, 1, 2, …, n - 1. If you work off a 1-based array, the indices are 1, 2, 3, …, n.
Either way, if you just take the value of the employee ID, you could conceivably run off the end of the array, and that’s bad. At that point, you start overwriting random data, which could be really important.
Random data is hardly ever important. And it’s still pretty random after you’ve overwritten it, so what’s the problem?
Just kidding.
One man’s trash is another man’s list of function pointers.
Sorry, I wasn’t very clear in that last post.
What I meant was, if I have a table that has 1000 entries, and my Emp IDs run from 1 to 1000, would the function that coverts the Emp ID from ascii to binary be considered a hash function?
Just a quick post of appreciation - so you know I was actually interested in knowing more. Your explanation was very good for an uninitiated MS Office user. I have always felt I should know more about databases and their management/interogation, or at least know enough to ask sensible questions…
Carry on.
Well that’s ok then.
ccwaterback, a hashing function has to be able to take any input of the desired type (string, in your case) and return a number between 0 and the size of the hash table-1. It can’t depend on restrictions to it’s inputs, else, in this case, you’d end up writing past the end of the table as soon as you add another employee.
Which is why you have to put in the mod(tablesize) calculation. The mod operator (% in c-style languages) basically gives you the “remainder” from integer division. So 1000 % 1000 would be 0, 1000 & 1001 would be 1, etc. And, since arrays are 0-based in c-style languages, a 1000 you can map employees 1-999 to array entries 1-999 if you want, but that means that employee 1000 has to go in slot 0, not slot 1000.
-lv