Making a 6-digit numeric hash from a string?

ETA: Ack, I can’t change the title. I didn’t mean to make it Excel-specific… could a mod please change the thread title to what the post title currently says? :smack:

So… first the question, then I’ll provide an explanation (in case there’s a better way to do this).

I have a long list of book titles that I need to turn into 6-digit unique ID numbers. Is this easy to do? I’m essentially hoping for a one-way hash function with minimal collisions.

For example:


A Story About A Dog --> 531281
A Story About A Dog Vol 2. --> 320812
The Republic --> 005135
Walden --> 000429


Basically, I just need to make a unique ID for each title, but it has to be 6 digits or shorter and all numbers.

I’m trying to digitize a library where some books have no innate ISBN and I have to make up a fake one for them. The ISBN format is such that:


978-0-01-######-?

gives me 6 digits of freedom to make fake IDs with, and the ? at the end is a checksum that I’ll generate. The publisher code (01) is not assigned, so there shouldn’t be any collisions with real-world books. The idea is to be able to print ISBN-compatible barcodes for every book, using fake ISBNs for the books missing real ones, so that they can be easily scanned and checked in/out with our software (Readerware).

There are lots of hash functions out there, and most of them would work for your purposes, where there are no cryptographic/security concerns, and where you understand that collisions are a possibility.

What does your flow look like? I.e., how are the titles stored? Do you know any scripting languages like Perl? Will this be running under Windows? Etc.

You could, for instance, take a program like this:
http://www.fourmilab.ch/md5/

Run like this:


md5 -d"Hello, world!"

And get this result:


6CD3556DEB0DA54BCA060B4C39479839

Of course, this is hex, not decimal. So you take the last 5 hex digits and convert to decimal. Unfortunately, the number you get may be as large as 1048575: slightly more than your limit. In this case, you can modify your string in some known way, such as adding spaces at the end until you get a value in range.

You could also just strip off the last 8 hex digits and then take the number mod 1000000, but that won’t give quite as nice a distribution.

Anyway, there are zillions of options, depending on your needs and data flow.

It doesn’t need to be cryptographically secure per se, but there has to be enough randomness such that a collision would be unlikely in everyday use. I don’t know how to even begin to calculate that, mathematically, for book titles of arbitrary length. I understand that MD5 is fine for normal use, but when I’m only taking the last 6 characters and converting to decimal, well, don’t the collisions go way up?

As for language/platform, it’s more or less agnostic. I am re-learning programming after a decade or so; VB or C# or a PHP script would be my preferred way to do this, but that’s not absolutely essential… originally I was hoping for an Excel formula (hence the wrong title), but that seems unlikely.

ETA: And as for the workflow, the library software (Readerware) stores its database in a proprietary format as far as I can tell, but allows easy translation to and from CSVs, so that should be easy to work with.

I got it working using this code.

Now to learn how to manipulate CSVs with C#… sigh :frowning: I can’t even remember how to do a for loop.

Perhaps this is a stupid question, but why not just use 000001, 000002, 000003 etc? That is trivial to generate in Excel and guaranteed to be unique.

Basically, because in this workflow I don’t see any way to keep track of numbers that have already been used.

This is how it works right now:
[ol]
[li]Export list of books from Readerware into CSV[/li][li]Run fakebarcode utility, which assigns ISBNs to books that don’t have one[/li][li]Reimport CSV into Readerware, replacing missing ISBNs with fake ones[/li][/ol]

But anywhere from a few days to a few months later, when we repeat this process again with 40 new titles and maybe some removed, we have to be able to do this again. And if we lose a copy of a book, gain one later and re-add it to the database, it shouldn’t have a separate ISBN from the first copy. And this whole process has to survive different computers running both the library program and the utility, which would mean that the sequential IDs would have to be saved somewhere online and I’d have to worry about sync issues and all that… just seems like a nightmarish scenario versus tying it to the book titles, which we can assume to be unique and stable.

So after butchering some snippets, I now have usable C# code which can turn any arbitrary string into a valid (but fake) ISBN-13 number:



  public partial class MainWindow : Window
    {
        const int MUST_BE_LESS_THAN = 1000000; // 8 decimal digits

           public int GetStableHash(string s)
           {
               uint hash = 0;
               // if you care this can be done much faster with unsafe 
               // using fixed char* reinterpreted as a byte*
               foreach (byte b in System.Text.Encoding.Unicode.GetBytes(s))
               {   
                   hash += b;
                   hash += (hash << 10);
                   hash ^= (hash >> 6);    
               }
               // final avalanche
               hash += (hash << 3);
               hash ^= (hash >> 11);
               hash += (hash << 15);
               // helpfully we only want positive integer < MUST_BE_LESS_THAN
               // so simple truncate cast is ok if not perfect
               return (int)(hash % MUST_BE_LESS_THAN);
           }

           public int CalculateChecksumDigit(ulong n)
           {
               string sTemp = n.ToString();
               int iSum = 0;
               int iDigit = 0;

               // Calculate the checksum digit here.
               for (int i = sTemp.Length; i >= 1; i--)
               {
                   iDigit = Convert.ToInt32(sTemp.Substring(i - 1, 1));
                   // This appears to be backwards but the 
                   // EAN-13 checksum must be calculated
                   // this way to be compatible with UPC-A.
                   if (i % 2 == 0)
                   { // odd  
                       iSum += iDigit * 3;
                   }
                   else
                   { // even
                       iSum += iDigit * 1;
                   }
               }
               return (10 - (iSum % 10)) % 10;
           }
        public MainWindow()
        {
            InitializeComponent();
        }
        private void Button_Click_1(object sender, RoutedEventArgs e)
        {
            int id = GetStableHash(BookTitle.Text);
            string fakeisbn = "978001" + id.ToString();
            SixDigitID.Text =  fakeisbn + CalculateChecksumDigit(Convert.ToUInt64(fakeisbn)).ToString();
        }

    }


It ain’t pretty (at all), but it works. Now I need to figure out how to apply to a whole list of titles in a CSV.

That hash looks perfectly fine for your use.

But yes, with only 6 digits the odds of a collision are fairly high. For just two books, it’s just 1 in 1000000. But for 1000 books, it’s about 39%. This is a general form of the birthday “paradox”, where you need surprisingly few people in a room (23!) to have >50% odds of one pair sharing a birthday. Here’s some math, if you’re interested.

Note that the length of the title is (almost) irrelevant. A good hash will distribute bits well no matter what the input. Of course, if the titles themselves collide there’s nothing you can do.

There was a bug in that code; IDs < 6 digits weren’t being properly padded. Here’s the fixed version.



           public int GetStableHash(string s)
           {
               uint hash = 0;
               // if you care this can be done much faster with unsafe 
               // using fixed char* reinterpreted as a byte*
               foreach (byte b in System.Text.Encoding.Unicode.GetBytes(s))
               {   
                   hash += b;
                   hash += (hash << 10);
                   hash ^= (hash >> 6);    
               }
               // final avalanche
               hash += (hash << 3);
               hash ^= (hash >> 11);
               hash += (hash << 15);
               // helpfully we only want positive integer < MUST_BE_LESS_THAN
               // so simple truncate cast is ok if not perfect
               return (int)(hash % MUST_BE_LESS_THAN);
           }

           public int CalculateChecksumDigit(ulong n)
           {
               string sTemp = n.ToString();
               int iSum = 0;
               int iDigit = 0;

               // Calculate the checksum digit here.
               for (int i = sTemp.Length; i >= 1; i--)
               {
                   iDigit = Convert.ToInt32(sTemp.Substring(i - 1, 1));
                   // This appears to be backwards but the 
                   // EAN-13 checksum must be calculated
                   // this way to be compatible with UPC-A.
                   if (i % 2 == 0)
                   { // odd  
                       iSum += iDigit * 3;
                   }
                   else
                   { // even
                       iSum += iDigit * 1;
                   }
               }
               return (10 - (iSum % 10)) % 10;
           }
           private void generateISBN()
           {
               string titlehash = GetStableHash(BookTitle.Text).ToString("D6");
               string fakeisbn = "978001" + titlehash;
               string check = CalculateChecksumDigit(Convert.ToUInt64(fakeisbn)).ToString();

                SixDigitID.Text = fakeisbn + check;
           }


Hmmmmmm. So far we have only 40 books without an ISBN, which according to this calculator means a <1% chance of collision. Supposing we end up with 10x that amount – 400 books with fake barcodes – the chance goes up to 8%, which is actually a source of some concern.

Nonetheless, if the alternative is some sort of external tracking of sequential IDs, I just don’t think that’s going to survive very long in our organization. What to do, what to do…

Stupidly, I just realized there’s no reason to limit the barcodes to ISBN format.

I can assign a user ID# to each book, which can either be the ISBN (if it already exists) or a 13-digit number hashed from the title. The chances of collision will be much smaller then.

Yep, so after more investigation, I’m going to assign hashed GTINs to all the books without ISBNs. GTINs are the superset to which ISBN-13s belong, and both can be printed as EAN-13 barcodes. Whereas ISBNs start with 978 or 979, it seems like I can use the 020 prefix for internal use for exactly this purpose.

In practice, hopefully this’ll mean less end-user confusion – scanning either my barcode or the book’s own ISBN barcode (where it exists) would work.

So scannability, typability, and fairly minimal chances of collision… maybe this is the best of all worlds.

Perhaps this is a stupid question too: re the OP, what exactly would the difference be between a hash function and a random number in this case?

It’s the same answer as the one I gave to sequential IDs. Basically, with hashing:

  1. There would be no need to track and manage assigned numbers
  2. The same book will always have the same ID#, even if it’s removed and re-added to the library

(In a way I suppose you could think of the hashed IDs as length-limited pseudorandom numbers with the book title (and maybe author) as the only seed)

Fair point. I hadn’t realised that this needed to be reused with a changing set of books, and still generate a stable number for each title when reused. The sync issues would indeed be a nightmare in that scenario; your approach makes much more sense.

That is actually what I was going to recommend if one didn’t have access to proper hash functions. It’s not ideal, but it’ll work for simple purposes.

Hashing won’t work if you can’t keep track of the existing numbers already used. It’s meant for rapid lookup and has to account for collisions. In that case you might as well use sequential numbers starting at 0,1,2…

The odds of a collision occurring increase with the number of problems the collision would cause.

I really don’t see the problem with this solution. Generate a bunch of numbers, then the next time you run the program, instead of starting at zero, start at the last number you generated +1. Either hand code that into the program, or make the program ask for the seed first, or better yet, have the program store the last number generated, then use that number for the next, +1.

Unlike any random or hash value, this guarantees no duplication. From a programming standpoint, it’s trivial. However, I’m not an Excel programmer, so I can’t suggest the exact code.

Hashes have lots of uses; rapid, collision-aware lookup is just one of them. Although collisions are always a possibility, with enough bits you don’t have to care–128 is pretty much enough to guarantee that the universe will end before you have a collision (real-world secure hashes use more bits to give them some headroom against malicious attacks).

The OP appears to be generating data for testing purposes. A collision isn’t the end of the world; it likely just means that someone, somewhere will notice that a book scanned back to the wrong title. The problem will be obvious when it’s actually investigated. And with 13 decimal digits and only a few hundred books, the odds of even that happening are very slim.

I agree. I was speaking very generally there. However, you ignored the second part of my post which explains that no number of bits can be used to avoid a collision if the consequences are serious enough. (No number of bits less than the number needed to uniquely identify the original string, since I wouldn’t call it hashing if that were exceeded, just some odd encoding).

I didn’t ignore it; I pointed out that for this application, the consequences aren’t actually that serious.

At any rate, I don’t agree. I would literally stake my life on a 256-bit hash (of my choosing) not colliding for two ordinary texts (i.e., not chosen maliciously). That’s a pretty serious consequence… but I have trust in math and statistics.

The odds of avoiding collisions go up exponentially as you increase the number of bits linearly, so for many applications it’s easiest to just pick a reasonably long hash.

I am being facetious. It’s just an application of Murphy’s law.

However, I’d never use a hash without a collision detection system. I don’t particularly like hashing anyway because it doesn’t scale. I’m old enough to have seen memory grow from Kbytes to Tbytes so I’ve seen the hashing algorithms get reworked many times. I’m a Btree and Ptrie fan myself. But that’s a whole other holy war. The major feature of hashing is the time efficiency factor and I don’t see that in the OP’s application.