I’ve got an idea and so I’m investigating if it is possible to find a hash algorithm which generates a hash in less than 16 characters (like MD5, SHA1 for example). Anyone have any ideas?
Umm… I’m not sure I understand the question. Any function that is deterministic one-way is usuable as a hash algorithm. (That is, every time you feed the same value, the same website address I guess, it’ll return the same value.) Of course, not all hash functions are equally useful. You want to have a spread that’s large enough to be useful, and that will work to ‘flatten out’ the distribution so that values in the range will be hit almost equally.
Any idea how big a range of hash values you want??
No, this is actually pretty well over my head. I’m just knowledgable enough to be dangerous.
I had this idea on a very boring multi hour drive through the rain yesterday, but there are websites out there that allow you to shorten website addresses so instead of a 98 character address to pass around you can shorten it to something like 12 or 18 characters.
But eventually, with enough years of use, those sites counters will get longer and longer. Which I know is very long off since they usually use alphanumerics and such.
So the idea was to apply md5 or sha1 to domains and thus obtain a unique key that could be passed around. But the long md5 and sha1 keys make it less desirable, so I was wondering if there was some way to generate a hash of an address (limit it to 255 chars probably)
And I naively during the drive yesterday thought it could be done in 8 characters, but am now beginning to think it would definitely require more length.
Does that clear it up any?
You can have a hash algorithm output data in any length you want. But of course, the smaller the hash, the fewer items you can encode before a collision occurs. If you want to do something like tinyURL, perhaps using an expiration system would work better. Then you could re-use each code after a certain period of time.
If you want a shorter hash, just generate a MD5/SHA and lop off the first n charecters. But hashes are only a one way function so you can’t use them to pass around addresses like your envisioning.
There are several ways to do this:
- Use a hash, as you suggested. The problem with this is that hashes are innately one-way and you cannot recover the original URL from a hash. You lose the full address in the process and you just have to hope that you don’t have two addresses that reduce to the same hash, or there’d be no way to distinguish between the original two. It’s like reducing people’s names to initials: both John Dalton and Joe Deman will become J.D., and there’d be no way to tell which J.D. the initials refer to without more information.
You will need a way to resolve the hashes back to the original addresses, which might be done via something like the next method…
- Use a ID system in conjunction with a lookup table or database. Instead of hashing addresses, simply assign a new ID# to each new address. The ID number can be:
- Numeric (not so good since there isn’t enough variability)
- Alphabetic (OK. This is what TinyURL uses. It might be easier to remember than numbers, but it still offers limited variability)
- Alphanumeric (More variability, but harder for humans to remember)
- Some other expanded set of characters: 254 ASCII, for example (Excellent variability, terrible usability)
They all work. The more characters you use, the shorter the URL, but the harder it is for humans to use. TinyURL’s simple alphabetic system can accomodate 208,827,064,576 unique addresses in 8 characters (26 ^ 8). If you wanted to, I’m sure you could use a greater set of characters in your URLs, something like http://nanourl.com/5~&N=, but humans might not be able to remember or use that address very well.
Computers, on the other hand, could certainly take advantage of shorter URLs. But if you’re dealing exclusively with computers, some sort of compression might make more sense…
- Losslessly compress the URL. This is almost like hashing, except you don’t lose any data in the process. With a good compression algorithm tailor-made for URLs, I suspect that you can get halfway decent compression.
You can use any compression algorithm for this, but you’ll get better results with a tailor-made one because there are certain strings in URLs that are often repeated: com, net, org, html, php, thread, etc.
This is much safer than hashing/IDing because the process is reversible and not dependant on a database. A quick Google Scholar search reveals some hits.
And just to correct myself: TinyURL uses alphanumeric, not alphabetic, addresses
Thanks for the replies folks, when I said hashing I meant compressing because I want to be able to use them as a key for the addresses… I’ll play around with it.