Hashing

erislover · March 31, 2004, 7:57pm

So is hashing like a substitution cipher?

caphis · March 31, 2004, 8:20pm

One more question, guys. Thanks for the great responses!

Say I had about 50,000 entries. The hash table being used uses Separate Chaining to avoid collisions. What would be a good table size to

a) maximize performance on insertions?
b) maximize performance on finds?

I’m using guess and check, but the best results are coming at a table size of 50,000. Does this make sense?

caphis · March 31, 2004, 8:28pm

Actually, I think I just realized that this depends largely on the hash function itself. Right? Someone stop me if I’m wrong.

ultrafilter · March 31, 2004, 8:33pm

That should depend on the hash function and the data structure used as a chain.

LordVor · March 31, 2004, 9:15pm

This is a misleading question. Even if you use chaining, you maximize performance on insertions and finds by minimizing collisions.

While “it depends” is the proper answer, if you assume that the combination of the hash function and the hash keys used yields a relatively even distribution of data points, the most efficient hash table is usually around twice as big as the number of entries.

That said, hash functions usually work best if the table size is a power of 2, so a lot of times you just round the expected number of entries up to the next power of 2.

If you’re still using the same assumptions, 50,000 entries, each with a unique key from 1-50,000, then 50,000 would indeed be the most efficient table size. I still think that that’s a bogus scenario, you’re better off just using a big array.

BTW, I’m also getting the distinct impression that we’re doing your homework for you.

-lv

caphis · March 31, 2004, 9:26pm

In fact, you’re not; but, you are helping me learn. karma++.

Thanks for the replies, everyone!

ccwaterback · March 31, 2004, 10:06pm

When designing a database, if you are stuck using a relatively long string (first and last name for instance) as a key, usually what is done is to define a “seek key” on that string. When you get a representative string key to look-up, insert or delete in the database, you calculate its seek key using a hashing algorithm. The seek key is numeric and is used as the primary key for your database. The space efficiency difference of using a seek key in a database rather than a character string is enormous, the time efficiency difference is even more significant.

This is just one good example of how hashing is used in the real world.

ccwaterback · March 31, 2004, 10:13pm

Would that be … only if the keys are less than OR EQUAL the table size? I was just wonder if the “equal” case would be technically considered a hash function? Or does a hash function “require” that some collisions have a less than zero probability? Maybe the “equal” case is just an index calculation?

ultrafilter · March 31, 2004, 10:17pm

“less than zero probability”?

I’m working off a 0-based array here. So for a table of n elements, the indices are 0, 1, 2, …, n - 1. If you work off a 1-based array, the indices are 1, 2, 3, …, n.

Either way, if you just take the value of the employee ID, you could conceivably run off the end of the array, and that’s bad. At that point, you start overwriting random data, which could be really important.

Newton_meter · March 31, 2004, 10:30pm

Random data is hardly ever important. And it’s still pretty random after you’ve overwritten it, so what’s the problem?

Just kidding.

ultrafilter · March 31, 2004, 10:31pm

One man’s trash is another man’s list of function pointers.

ccwaterback · April 1, 2004, 12:42am

Sorry, I wasn’t very clear in that last post.

What I meant was, if I have a table that has 1000 entries, and my Emp IDs run from 1 to 1000, would the function that coverts the Emp ID from ascii to binary be considered a hash function?

notquitekarpov · April 1, 2004, 4:43pm

Just a quick post of appreciation - so you know I was actually interested in knowing more. Your explanation was very good for an uninitiated MS Office user. I have always felt I should know more about databases and their management/interogation, or at least know enough to ask sensible questions…

Carry on.

LordVor · April 1, 2004, 5:00pm

Well that’s ok then.

ccwaterback, a hashing function has to be able to take any input of the desired type (string, in your case) and return a number between 0 and the size of the hash table-1. It can’t depend on restrictions to it’s inputs, else, in this case, you’d end up writing past the end of the table as soon as you add another employee.

Which is why you have to put in the mod(tablesize) calculation. The mod operator (% in c-style languages) basically gives you the “remainder” from integer division. So 1000 % 1000 would be 0, 1000 & 1001 would be 1, etc. And, since arrays are 0-based in c-style languages, a 1000 you can map employees 1-999 to array entries 1-999 if you want, but that means that employee 1000 has to go in slot 0, not slot 1000.

-lv

Topic		Replies	Views
Reversible hashing function Factual Questions	9	9605	August 9, 2010
Hashing website addresses? Factual Questions	7	881	May 2, 2006
The identity hash Factual Questions	12	1421	September 5, 2008
Efficient implementation of RDB seek keys Factual Questions	6	756	January 24, 2005
Cryptographic Zero Knowledge Proof Question Factual Questions	79	2760	July 25, 2008

Hashing

Related topics