Summary of question:

How are hash functions like SHA1 evaluated so they know that collisions are as infrequent as the probabilities seem to indicate?

Details:

Usage of GIT (source control) triggered interest and reading about details of hash functions like SHA1 and possibility of dups.

Although I understand that that number of hash values is a very large and when analyzed purely from that perspective, the possibility of dups is extremely low.

But just having a large set of values isn’t enough to trust that dups would be very infrequent. The space of possible input is so much larger than the the space of hash values that we know there are dups when viewed from that perspective. Even though we know that the number of actual inputs will be less than possible inputs and also less than possible outputs, but that alone still doesn’t seem like enough to be comfortable with no dups.

To be comfortable that in real life usage the actual inputs used will not cause collision, there must be some mathematical tests they are performing on the algorithm, so the question is what are those tests? How do they approach this problem? How can they be sure they didn’t pick the one algorithm that just happens to have a large number of collisions with the type of actual input that will get passed to the function?