Classification problem

I have a large number of integers, in the tens of thousands, and I want to map all of them to one of seven possible values.

e.g.
1 maps to 6
2 maps to 1

12,328 maps to 3

This mapping is known to me, and I want to write a C-program to implement this mapping.

I could think of several ways of doing so, but I wonder if this is a well-known problem with a nice, clean and efficient solution. For example, I would like to avoid creating an array of size 20,000, in which I store the value for each case.

If there is some structure in the mapping, I can see how it can be exploited to make a more intelligent mapping. But, if there is no structure to the mapping, is it the case that we cannot do better than simply storing all the mappings in an array of size 20,000?

[The actual problem is slightly more complex, but I can map the actual problem to the above. The original problem involves mapping n-tuples to one of seven values, but I can easily convert each n-tuple to a unique integer representing that n-tuple]

Information theory says you’re going to need 56,148 bits to hold any arbitrary such mapping of 20,000 things to 7 values. Unless there is computable rationale behind the mapping, that’s the best you can do.

If there really is “no structure” to the mapping (you’re given a table of key-value pairs you have to use), then an array of values is (almost by definition, see “Kolmogorov complexity”) as good as you can do. Practically speaking, if this is for a PC application, why are you worried about 20K values? This is probably 0.005% of your computer’s RAM, even if you’re spendthrift and use 8 bits per value. If you feel like it, you can pack them into 3 bits (8 values per 3 8-bit bytes) and save 62% of even that. (You can even save a little bit more, but it’s probably not worth it.) If the integers are consecutive (or mostly consecutive) a linear array is best for storage and access; if the integers are sparse you probably want a hash or balanced-tree-based map (e.g., an STL map) to do the key-value lookup.

But if all you want is a repeatable pseudorandom mapping, you can hash the keys down to as few bits as you want.

You could use a map so that all elements mapped to a value can be accessed at once.

You might be able to get away with using neural nets, but there’s some possibility of error, and it’s significantly easier to write the table. What’s the issue with doing it that way?

Well, if there’s no structure to the mapping, how is the table going to be created other than by hand or by reading in a file (if one exists) detailing the mapping?

Since the mapping is known, after training I could just cycle through all input values and make sure the neural net gets them all correct. If it doesn’t I could train it again and/or try a different neural net configuration.

I just checked and the actual number is 93,000 entries, so even if we store 3 bits per entry, that’s 10 entries per 32-bit word, which makes the table about 37 KB.

This is not for a PC application, but it will go onto a device with somewhat limited memory, so the 37KB is a bit large. Also, there are some time constraints as to how soon the answer has to be computed, so any non-table solution can’t be too complex.

From the responses so far, it seems that if there is no structure, there is not much that can be done.

So, besides neural nets, is there another way to discover possible structure in the mapping?

If neural nets are the best bet, are there any open-source neural net programs that you know of and would recommend?

I assume that if I want to uncover any possible structure, it would be best to use the original n-tuple values (mentioned in the OP) as n inputs to the neural net (or other program), instead of the unique integer representation of each n-tuple.

If the list of 20000 things is certain, you could just map six of the possibles and leave the 7th as the default (f not in the array, then the mapped value is thus-and-so).

Make the default value, of course, the largest set for best efficiency.

You can look into various classification algorithms, but unless there are significant differences among the sets of inputs that map to each output, none of them are going to work. And this bit bears repeating:

Still, if you can get a simple piece of code that classifies most of the data correctly, you’ll be able to get away with a smaller table lookup for the rest of the values.