Algorithm question - generating subsets

BeepKillBeep · July 19, 2014, 6:43pm

If you’re going to go through all the trouble of assessing the numbers, why not just assign each an appropriate weight and pick randomly in proportion to the weights as originally suggested? I don’t see the advantage to this approach.

Mithras · July 19, 2014, 6:54pm

It’s not necessarily the optimum way to do things. But it can be written in a couple dozen lines, is fast enough, and produces results that are probably good enough. And like I said, I’m just playing around until I have to get ready for a wedding.

Sage_Rat · July 19, 2014, 8:46pm

Sort the set.
Determine the number of desired elements (numElements).
Copy the boundary into a local variable (currBoundary).
Use a binary search to find the first value that is equal to or greater than currBoundary. Set a variable to the offset of that value in the set (right).
Choose a random index between 0 and right-1 (index).
Add set[index] to your output (outSet).
If outSet has numElements, return outSet.
Subtract the chosen value from currBoundary (currBoundary -= set[index]).
GOTO 4.

BeepKillBeep · July 19, 2014, 9:07pm

I might not be understanding your algorithm but won’t step 4 fail because none of the values are greater than the boundary. In the example, the set values are 1…10. The boundary values are 10 & 15.

mr.jp · July 19, 2014, 9:12pm

BeepKillBeep:

There are numerous ways to do this; however, my suggestion would be Monte Carlo approach only because it is simplest to code. In the graph, I would weight the links such that the weights are proportional to the size of the value as a ratio of the candidate value to the average value of the set so as to limit the number of bad sets picked. Also, weights could be dynamically set such that a pick which puts you over the limit cannot be taken.

So for example, the first layer of the graph contains 1…10 and assume the size of the desired set if 3 and the desired sum is between 5 and 15. This means we want to pick values that are less than an average of 5 per pick but greater than 5/3. So if we pick a small value we want to pick a large value next to compensate, and similarly if we pick a large value we want to pick a small next to compensate. The second layer contains 2…10, 1,3…10, 1…2,4…10, etc. The weight from 1->10 would be very high, as would 1->9. Similarly the weight from 10-> 1 would be high. Etc. etc. etc.

As you get deeper into the graph some weights would drop to zero if picking it would put you over / under the desired sums.

I didn’t really understand what you said in the second paragraph.

I assign new weights after each selection as max(1 - abs(average - value) / average, 0). Not sure if this is what you suggested. Anyway this results in the overall value of the sample set being somewhat too low, since there are a lot of elements with a small value in the full set. So I up-adjusted all weights of elements with value higher than average. And I made it so that the final element was chosen so the value of the set was as close to possible to the desired.

BeepKillBeep · July 19, 2014, 9:40pm

I think you’re understanding it. Pure Monte Carlo would be just a random guess (as suggested by Mithras) so for example:

(1) Pick from {1…10}. Say 2.
(2) Pick from {1,3-10} and assign each an even chance.
etc.

The problem with that approach is although simple that you can end up with many invalid sets. This is compensated for that it is very fast so you cycle through sets very quickly.

Weighting the values, tries to give some guidance to the algorithm so that it makes better picks. You can do this a few ways. One is learning by exploration and exploitation. So after completing a set you weight your picks based on how good they were. This is exploration. Then as more and more weights get adjusted the algorithm will increasingly find good sets, this is exploitation. This is very useful in certain circumstances and would work here.

However, since we know the desired outcome we can pre-weight everything. Say the desired sum is 15 with 3 values. This means we’d really like to pick things that give us an average value of 5. We don’t want to be determinisitic about it because then we’d get the same set everytime, so instead we weight things based on how close it gets us to that average. For example:

(1) Pick from {1…10}. Say 2.
(2) Pick from {1,3…10}. We don’t give everything an equal chance though. 10 is a great pick, the average of 2 and 10 is 6! 3 is a terrible pick as the average if 2.5. So just as a rough estimate I would assign each value:

1 / 1 + |(desired average - average if picked)|^2

1+2 = 3/2 => 1 / 1+ |(5-1.5)|^2
3+2 = 5/2 => 1 / 1 + |(5-2.5)|^2
4+2 = 6/2 => 1 / 1 +|(5-3)|^2
…
10+2 = 12/2 = 1 / 1 +|(5-6)|^2

Add up all the weights. Pick a random number from 0 to sum(weights). And find the winner. It is very likely (in this) case to be 10, but not absolutely so.

(3) Say we do end up with 10. So we now have 2,10. Pick from {1,3…9}. Using a similar process we say:

2+10+1 = 13/3 => 1 / 1 +|(5-4.33)|^2
2+10+3 = 15/3 => 1 / 1 +|(5-5)|^2
2+10+4 = 16/3 => 1 / 1 +|(5-5.33)|^2
…
etc.

You can use just about any weighting formula that works, the point is just to give it some guidance. If you really struggle with it, don’t bother, go pure random. It’ll take longer to compute, but you’ll get answers.

Hope that helps.

ftg · July 19, 2014, 9:57pm

Given that the problem is NP-hard …

One standard approach to such problems is to use a “First Fit” type solution. In this case, I would have started with First Fit Decreasing. I.e., sort the elements, find the largest that “fits”, find the next largest that fits the remainder, etc.

But the rule about having a fixed number of elements, rather than any number really messes this up. You’re not trying to fit the most in a given range, but a specific number. So sort order may actually slow things down rather than helping.

Basically, if some sort of First Fit type method doesn’t give you solutions in reasonable time for your data, than you’re stuck just plain running thru all subsets of the right size.

I have no reason to believe that any randomization approach would help.

Sage_Rat · July 19, 2014, 10:00pm

Or set to 1 past the last element.

Small bug in the writeup.

BeepKillBeep · July 19, 2014, 10:07pm

@ftg. Both Mithras and I have implemented solutions to the problem so it is quite solvable without going to a brute force technique.

ftg · July 20, 2014, 8:55pm

I think you need to recheck how you think you are “solving” the problem efficiently.

BeepKillBeep · July 20, 2014, 9:43pm

You think that using a genetic algorithm to search for solutions is inefficient? The computational complexity is quite reasonable. I’d like to hear your explanation for why you believe this is inefficient.

Indistinguishable · July 20, 2014, 10:12pm

The explanation is presumably that it is NP-hard, and thus while, for all I know, it may be easy in practice to solve most instances, a solution which operated in worst-case polynomial-time [in the bitlength of the inputs] over all instances would be a Holy Grail breakthrough in computer science.

One way to see that it is NP-hard is by trivial reduction of the subset-sum problem to this one.

BeepKillBeep · July 20, 2014, 10:23pm

Yes, but fortunately we don’t need something which solves it for all instances. We only need something that works for this particular problem with its very specific set of parameters. To say for this specific problem the only way to solve it is by brute force is, in my view, not correct.

Indistinguishable · July 20, 2014, 10:29pm

By “all instances”, I mean all instances of the problem the OP outlined; i.e., every choice of list of values to draw from, number of values to draw, and range in which the drawn values should sum to. I don’t know what distinction between “all instances” and “this specific problem” you are invoking.

Indistinguishable · July 20, 2014, 10:35pm

Mithras proposed two approaches, as far as I can see. The first was “Just look around completely at random, and see if you happen to have found a solution. If not, try again.” This is the very definition of (randomized) brute force.

The second was almost deterministic: Almost always just pick the next value which brings your average so far closest to the desired average for the output set. However, pick randomly in a few cases: For your first pick, and for any pick other than the last if the input set happens to have an average within 0.001% of the desired output set’s average.

Note that if the input set happens to have an average within 0.001% of the desired output set’s average… this reduces to randomly picking everything except the very last element, and then checking whether the last element can be made to work out. Again, naught but randomized brute force.

If we ignored that last condition, this would be deterministic (except for the first pick, if we left that condition in…) and we could readily construct examples on which this would fail to find solutions even though they exist.

BeepKillBeep · July 20, 2014, 10:35pm

Exactly. We don’t need it to solve for every choice for each of the parameters. It only needs to work for:

Set size - 40,000
Target set size - “couple hundred”
Target sum range - unspecified, but ideally close to a specific value

For these parameters, brute force is not needed. In fact, for values much larger than this brute force is not needed. Although the larger the size the longer it will take until it becomes intractable.

Perhaps, we are speaking past each other.

BeepKillBeep · July 20, 2014, 10:36pm

Well, I’m speaking more to my proposed solutions, not Mithras’.

Topic		Replies	Views
Algorithm for subsets Factual Questions	1	759	October 10, 2000
Sorting algorithm question Factual Questions	26	1395	September 12, 2007
Algorithm to create groups with minimal internal duplicates Factual Questions	33	2531	September 27, 2011
How random is random enough? Factual Questions	26	1463	February 19, 2004
Original Pure Math Question Factual Questions	27	2425	June 1, 2000

Algorithm question - generating subsets

Related topics