Summary of question:
How are hash functions like SHA1 evaluated so they know that collisions are as infrequent as the probabilities seem to indicate?
Details:
Usage of GIT (source control) triggered interest and reading about details of hash functions like SHA1 and possibility of dups.
Although I understand that that number of hash values is a very large and when analyzed purely from that perspective, the possibility of dups is extremely low.
But just having a large set of values isn’t enough to trust that dups would be very infrequent. The space of possible input is so much larger than the the space of hash values that we know there are dups when viewed from that perspective. Even though we know that the number of actual inputs will be less than possible inputs and also less than possible outputs, but that alone still doesn’t seem like enough to be comfortable with no dups.
To be comfortable that in real life usage the actual inputs used will not cause collision, there must be some mathematical tests they are performing on the algorithm, so the question is what are those tests? How do they approach this problem? How can they be sure they didn’t pick the one algorithm that just happens to have a large number of collisions with the type of actual input that will get passed to the function?
One method they use is to try and make the output as sensitive as possible to small changes of the input. This is called the “avalanche effect”: Ideally, if you change one bit of the input, every bit of the output should flip or not flip with 50% probability. One common way to achieve this is to build your hash function around a product cipher, which amounts to iterating operations multiple times such that every bit of the input has a good chance of influencing every bit of the output (this is called “mixing”).
There’s precious little in the way of mathematical proof regarding how good of a hash function you end up with; entropy is good, as is the avalanche criterion I elucidated above, but beyond that… well… that’s why experts attempt to break these things, and why smart people don’t rely on new or proprietary hash functions for cryptographic work. (Encryption functions are much the same in that regard.)
I won’t pretend to understand the math here, but there’s an open source hash test suite meant to compare algorithms against one another (both speed and quality). It includes some of the tests that Derleth talks about:
A few tests of a hash that I can come up with are:
If you modify a single bit of the input, it should affect all output bits with equal probability.
For all possible inputs, the probability of the distribution of 1 to 0 bits should follow a perfect bell curve.
Similarly, any equal length subsets of the output should follow the same distribution probability.
The output for any input remains consistent from trial to trial.
Ideally, one would be able to mathematically prove that all information about the input was destroyed by the hashing algorithm.
#5 is actually fairly simple. For example, if I have a hashing algorithm where I take an integer value n and divide it by 2n, I know that the result is 0.5 regardless of input. From knowing that all input n results in 0.5, I can confidently state that it is impossible to determine from the output what the input was. The output was created solely using operations that result in complete loss of all information about the input. And since all of them lose all information for all data, we can confidently say that the hashing algorithm is
The problem with #5 is that all the other rules require that the information not be lost - it must affect every single but in the output - and that the input information not be replaced by random data - consistent results are required. So basically #5 has to be replaced with #6:
It should be mathematically proven that information added by the algorithm reduce the total percentile of input information, regardless of input length, to a small enough degree of all information used to compute each bit of output that a device with n processing power could not analyze a sufficient number of sample outputs to deduce the algorithm’s information from the input information within a target m unit of time.
Ideally, across the space of all possible inputs, all results in the space of possible outputs would be equally represented. Anything that meets that criterion destroys as much information as possible.
Note that destroying as much information as possible doesn’t guarantee a good hash function: Truncation to n bits will accomplish that, for instance, but is still a lousy hash function. But that’s what the other criteria are for.
The problem is that the real criterion for a good hash function is that, given any hash, it should be very difficult to find an input that produces that hash. But it’s tough to rigorously define “very difficult”. Or rather, you can define it (number of computer operations needed to find one, using the optimal algorithm, would be a good standard), but it’s then very difficult to prove difficulty (since you don’t know if your algorithm is optimal).
Not all uses of hash functions destroy information. In particular, the case where the input is shorter than the output (either in raw length or total entropy). This is frequently the case of hashing passwords–the input might only have only a few dozen bits of entropy, while the output could be 256 bits.
A hash function with good statistical properties will have (in practice) no collisions in cases like this, and therefore we can always recover the inputs perfectly. However, ideally it should take 2[sup]n-1[/sup] operations (where n is the input length) on average to determine this input–that is to say, it requires brute forcing.