Algorithm question - generating subsets

mr.jp · July 19, 2014, 2:08pm

I have a large set, where each element has a value. Now I’d like to sample subsets from it , where each subset contains a specific number of element, and where the sum of the elements in it is within a specific boundary. I do not need to find all subsets.

For instance, lets say I have a set (1,2,3,4,5,6,7,8,9,10), and I would like to sample subsets from it of size 3 and 10 < sum < 15, then (3,4,5) would be such a sample.

mr.jp · July 19, 2014, 2:22pm

The subsets should be a random sample of the group of all subsets.

Walton_Firm · July 19, 2014, 2:34pm

Hmm, tricky – can’t see how you could guarantee a fair random sample, without the relatively brute-force approach of enumerating all valid subsets and then picking a random item from that list. That can be done relatively straightforwardly by sorting the original set and first generating all valid subsets which start at the first item, then all valid subsets which start at the second item, etc. But if the original set is very large, it could still be a somewhat expensive job in terms of CPU time and memory.

How large is the original set? And what would be a typical value for the desired subset size?

mr.jp · July 19, 2014, 2:37pm

Original set is 40,000. Subsets around a few a hundred. So yeah, that’s a lot of subsets.

BeepKillBeep · July 19, 2014, 2:48pm

There are numerous ways to do this; however, my suggestion would be Monte Carlo approach only because it is simplest to code. In the graph, I would weight the links such that the weights are proportional to the size of the value as a ratio of the candidate value to the average value of the set so as to limit the number of bad sets picked. Also, weights could be dynamically set such that a pick which puts you over the limit cannot be taken.

So for example, the first layer of the graph contains 1…10 and assume the size of the desired set if 3 and the desired sum is between 5 and 15. This means we want to pick values that are less than an average of 5 per pick but greater than 5/3. So if we pick a small value we want to pick a large value next to compensate, and similarly if we pick a large value we want to pick a small next to compensate. The second layer contains 2…10, 1,3…10, 1…2,4…10, etc. The weight from 1->10 would be very high, as would 1->9. Similarly the weight from 10-> 1 would be high. Etc. etc. etc.

As you get deeper into the graph some weights would drop to zero if picking it would put you over / under the desired sums.

BeepKillBeep · July 19, 2014, 2:54pm

If you want to do a more sophisticated approach you could use just about any optimizer to solve this. I’ll use a genetic algorithm as an example. The genome would consist of a single sequence with N genes where N is the size of the set you want. Each gene would have min / max value equal to the min / max of your set (if your set has irregular values, than it would be 1…N which corresponds to a position in the set). The GA would halt when it found a sufficient solution and then execute again (you could store previous solutions and give them a fitness of 0). The fitness function would consist of proximity to the min / max desired values for the sums. This is actually the way I would go, but only because I have all the libraries written to do this. This approach would be much less CPU and memory intensive, but likely more coding if you don’t have a GA lying around.

Mithras · July 19, 2014, 4:13pm

I’m finding this a fun problem to play with until I have to start getting ready to go to a wedding. Something to keep in mind, though, is that depending on the range of values, how narrow the space between the upper and lower bounds, and where the center of that range sits relative to the average value in your set multiplied by the size of your subset, just picking totally random sets and seeing if they fit your criteria can be very fast.

For example, using my years old laptop, if I take a random set of 40,000 numbers between 1 and 1,000,000 and request a subset of 300 values that sum to between 145,000,000 and 150,000,000 I can get a result in less than a second.

If I make it harder and require the sum to between 125,000,000 and 130,000,000, it’ll usually take less than a minute but occasionally much longer.

So depending on exactly what your input looks like and what you’re going for on the output, you might not require anything more sophisticated than total randomness.

BeepKillBeep · July 19, 2014, 4:29pm

Good answer Mithras. My answer assumes that the search criteria would be hard to find by random guessing but you’re correct, if the answer is easy to find, then for example the GA would likely find a sufficient solution in 1 generation (which is equivalent to a random guess) and you’d be doing extra coding, setup, etc for nothing.

The same is true with the weighted Monte Carlo approach. No weighting is necessary when the end condition is easy to find. At that point the Monte Carlo approach reduces to random guessing as well.

mr.jp · July 19, 2014, 4:36pm

Actually, I would like to the bounds to be as close together as possible, as long as there are enough samples within that bound.

I suppose that means that what I want is really the n samples that have the summed value closest to v.

I initially thought there was some sort of standard approach to this problem, like the for knapsack problem or some such.

RaftPeople · July 19, 2014, 4:44pm

Build an index over the values

To choose a set:
First number=X=Choose any value from set

Second number=Y=Choose any value from set such that X+Y+1 > 10 and < 15 (the +1 leaves at least minimum room for Z)
note: may need to add a condition based on max value in the set, meaning minimum Y value might have to be >= some other number based on max and X - if set has values great enough then that condition is not needed

Third number=Z=Choose any value from set such that X+Y+Z > 10 and < 15

Digital_is_the_new_Analog · July 19, 2014, 5:05pm

I was going to suggest something similar.
I think we loose a little randomness by forcing some of the values. If the sets are such that some values will have very few options if they are successfully included in a subset, then those subsets may be overweighted by picking the value for X.

Either that, or I need more caffeine.

This may or may not be acceptable for the OP’s definition of random.

mr.jp · July 19, 2014, 5:13pm

I was thinking the same thing.

Walton_Firm · July 19, 2014, 5:29pm

Problem is, the knapsack problem is well-known exactly for being a problem that is really hard (or at least computationally expensive) to solve.

Depending on the details of the search space, even finding a single solution can be difficult unless you are willing to throw enormous amounts of computing power at it. There are standard approaches indeed, but all of them are at least one of: slow, complex, not guaranteed to give the optimal answer. (And I’m afraid that if you say “OK, I’ll take ‘complex’ then” you’re still out of luck.)

Clearly then, finding an answer which is guaranteed to return a fair random pick from the set of all possible solutions, is not going to be easier than finding a single solution…

RaftPeople · July 19, 2014, 5:45pm

I was wondering about the resulting randomness too.

But it does seem like it should be random because it seems like we are just randomly following any of the valid paths.

RaftPeople · July 19, 2014, 5:53pm

Nevermind, you’re correct Digital is the new Analog, it will be weighted towards the lower numbers, algorithm needs a weighting factor to end up with randomly distributed sets.

BeepKillBeep · July 19, 2014, 5:54pm

If you want a value as close to a value as opposed to within a certain range then you’re going to need some kind of search. The Monte Carlo approach won’t really work well for that. You could use Monte Carlo using an exploration and exploitation technique where it learns what paths are good and weights them accordingly.

Just for fun, I tried it with a GA for a range of values and it found solutions within even a very narrow a range very quickly as expected. Generally less than 150 generations with a total processing time per solution of 10 seconds or so. Although I didn’t time it very closely. It would be trivial to change that to a single value (since a single value is a special case of the most narrow of ranges) in the fitness function, although I have no doubt it’ll take much longer to run.

Ultimately, it comes down to what you need or are willing to accept.

BeepKillBeep · July 19, 2014, 6:03pm

Out of curiousity, are there any other details that are missing and do you need the source code?

If there are no other details and you don’t need the source code and it is a one-time execution, I’d be willing to run it for you. It wouldn’t take me very long. I just don’t want to get myself into a quagmire of “Oh, well there’s this condition and that condition”. But if it really is this straight forward I’ll help.

mr.jp · July 19, 2014, 6:07pm

What are you coding it in? I’m currently working on implementing your initial Monte Carlo suggestion in Python.

BeepKillBeep · July 19, 2014, 6:10pm

I work in C++/cli mainly. I have built some VB AI libraries as well, but they are old and unoptimized.

Mithras · July 19, 2014, 6:37pm

Another quick and almost good enough solution:

If it’s your first pick or the average of all the numbers in your set is within .001% of your target sum / total number of elements you’re shooting for and it’s not your last pick, make a totally random pick. In all other cases, pick the number that brings the average of all the numbers you picked closest to your target sum / total number of elements you desire.

This is very fast, leads to a sum that is rarely more than 200 from the target if the inputs are randomly distributed between 1 and 1,000,000, and results in a list of numbers that looks random.

Topic		Replies	Views
Algorithm for subsets Factual Questions	1	759	October 10, 2000
Sorting algorithm question Factual Questions	26	1395	September 12, 2007
Algorithm to create groups with minimal internal duplicates Factual Questions	33	2531	September 27, 2011
How random is random enough? Factual Questions	26	1463	February 19, 2004
Original Pure Math Question Factual Questions	27	2425	June 1, 2000

Algorithm question - generating subsets

Related topics