Reversible hashing function

MaxTheVool · August 7, 2010, 12:50am

Anyone have C code for a good reversible 32-bit to 32-bit hashing function? Googling around resulted in tons of references to such things, but no actual examples.
Thanks

(Note that it’s perfectly fine if it’s only “decent” hashing…)

The_Hamster_King · August 7, 2010, 12:59am

It’s been a few years since I was a programmer, but why do you need to hash if you’re maintaining the same number of bits? Just use the original number.

I also thought that hash functions were, by definition, irreversible, since they map many values onto a single output.

TriPolar · August 7, 2010, 1:16am

I’ve seen the kind of thing you are looking for used for secondary hashing, if there is a collision with an existing hash, transform the number and see if there is an open spot there. I don’t like hashing solutions myself, they don’t scale well, but from what I can recall, the idea is to use a minimal function, (ultra simplisticly, add 1) because your original hash should have given you a distribution across the range, and nearby values should be unused.

The other major reason for hashing I know of is for data validation, to eliminate transposition of bits or digits, such as the hash used on credit card numbers. I don’t think these are reversible though, and require additional information be added to the data, essentially a ‘checksum’.

So why do you need this? As The Hamster King has alluded, it really is just the same value. Why wouldn’t +1, a complement, or a rotation do the job for you?

friedo · August 7, 2010, 1:44am

The traditional solution for collisions in a hash table is to build a linked-list of the would-be collided values. When linear search of that list becomes burdensome, it’s time to retune the hash.

Another technique I’ve seen is to flag hashes that have collided, and make sub-hashes, with different tuning than the main hash. But that requires running your hash algorithm twice for collided keys, which may be inefficient for a read-heavy application, which most hash tables are.

septimus · August 7, 2010, 5:23am

If a 31-bit to 32-bit function (or 32-bit to 33-bit) is good enough for you, you can use the technique I use, based on modular multiplication. (It could be 32-bit to 32-bit if you know some of the 2^32 different inputs won’t occur.)

The expansion arises because the modulus must be prime (in fact, not all primes work) and 2^32 isn’t prime (even though it has no odd prime factors whatsoever ).

(I’ll check if 2^32 - 1 is one of the workable primes. Then you could apply my method, details depending on your purpose.)

TriPolar · August 7, 2010, 7:09am

friedo:

TriPolar:

I’ve seen the kind of thing you are looking for used for secondary hashing, if there is a collision with an existing hash, transform the number and see if there is an open spot there. I don’t like hashing solutions myself, they don’t scale well, but from what I can recall, the idea is to use a minimal function, (ultra simplisticly, add 1) because your original hash should have given you a distribution across the range, and nearby values should be unused.

The traditional solution for collisions in a hash table is to build a linked-list of the would-be collided values. When linear search of that list becomes burdensome, it’s time to retune the hash.

Another technique I’ve seen is to flag hashes that have collided, and make sub-hashes, with different tuning than the main hash. But that requires running your hash algorithm twice for collided keys, which may be inefficient for a read-heavy application, which most hash tables are.

The reason for rehashing (secondary hashing) is that you don’t need the overhead of maintaining the linked lists and the sub-hashes to take a quick second shot at a rapid lookup. This works well when there are many insertions and deletions of entries.
The advantages of hashing are tied to the consistency of the range and distribution of the source data, while tree techniques become more efficient as the range and distribution vary. In recent years the availability of memory has increased the size of the source data to the point where the hashing algorithm and comparison times dwarf the overhead of the maintenance process for either approach, as well as throwing a lot of memory overhead considerations of the organization methods out the window as well. I’m wondering why intelligent memory isn’t more common considering how cheap and dense the underlying storage is.

septimus · August 7, 2010, 1:45pm

Revising earlier post (red-faced :mad: ):
All primes are “workable” … but 2^32-1 isn’t prime!
But here’s a hash and its inverse which might work for you:

#define HM 4294967291ull
#define HASH1(x) ((x) >= HM ? (x) : (x)*2654435764ull % HM)
#define HASH2(x) ((x) >= HM ? (x) : (x)*1403066617ull % HM)

I’m also curious what OP is doing, but there are reasons for such functions. I use them as search-table hash-functions because they allow the storage-reduction technique sometimes called “quotienting.” (And no, the answer isn’t just to buy more memory instead of making table space-efficient. In chess-playing software you want to take best advantage of your search cache no matter how big it is.)

TriPolar · August 7, 2010, 5:55pm

I’ve never been that interested in chess, but chess-playing software sounds interesting. I’m waiting for intelligent memory, but you probably want quantum memory. Out of curiousity, do you store whole board states, and how many bits do you need to represent them?

septimus · August 9, 2010, 4:09am

My aspirations for this thread aren’t so deep. I’d be satisifed just to learn whether the 32-bit to 32-bit reversible hash function I displayed satisfies OP’s need.

MaxTheVool · August 9, 2010, 5:34pm

The initial impetus for this was that we are going to have ID numbers for internal things that we will be exposing to the end users. We want to convert them to 8-character strings or something, with some amount of checksumming built in. But if we do things the naive way, the very first string we send out will be AAAAAAA, then the next one will be AAAAAAAAB, etc. So I wanted to put a step in which takes in 1, 2, 3 and spits out 1395234,8132437, 134033 etc, but which perfectly maps all U32s onto all U32s and is quickly reversible, then insert that between the numbers and the strings, so 1, 2, 3 will map to AB9EFCC, 24QQZYX, etc.
There are various other approaches to that, but once I started thinking about it I also just got curious.

Fortunately, performance isn’t hugely important within reason.

One thing I did come up with works like this: I have a pre-determined fixed “seed” array of 64 random U32s. Then I take my input number and look at bit 0. If it’s 0, I xor my number with array[0], otherwise I xor it with array[1]. BUT, when doing this xor-ing I mask out bit 0 so it remainds untouched. Then I look at bit 1. If it’s 0, I xor with array[2], otherwise xor with array[3], this time making sure to leave bit 1 untouched. Repeat for all bits. Because my seed ints are constant, and because I always leave the determining bit untouched, this is trivially reversible.

This may or may not fit some mathematical definition of good hashing, but it certainly should map all U32s to all U32s in such a way that nearby starting numbers end up scattered all over the place.

Topic		Replies	Views
Hashing Factual Questions	33	1425	April 1, 2004
The identity hash Factual Questions	12	1421	September 5, 2008
Hash Functions and Dups Factual Questions	7	1610	January 12, 2016
Hashing website addresses? Factual Questions	7	881	May 2, 2006
Math Translation? Factual Questions	4	1485	December 20, 2009

Reversible hashing function

Related topics