Algorithm question - generating subsets

If you’re going to go through all the trouble of assessing the numbers, why not just assign each an appropriate weight and pick randomly in proportion to the weights as originally suggested? I don’t see the advantage to this approach.

It’s not necessarily the optimum way to do things. But it can be written in a couple dozen lines, is fast enough, and produces results that are probably good enough. And like I said, I’m just playing around until I have to get ready for a wedding.

  1. Sort the set.
  2. Determine the number of desired elements (numElements).
  3. Copy the boundary into a local variable (currBoundary).
  4. Use a binary search to find the first value that is equal to or greater than currBoundary. Set a variable to the offset of that value in the set (right).
  5. Choose a random index between 0 and right-1 (index).
  6. Add set[index] to your output (outSet).
  7. If outSet has numElements, return outSet.
  8. Subtract the chosen value from currBoundary (currBoundary -= set[index]).
  9. GOTO 4.

I might not be understanding your algorithm but won’t step 4 fail because none of the values are greater than the boundary. In the example, the set values are 1…10. The boundary values are 10 & 15.

I didn’t really understand what you said in the second paragraph.

I assign new weights after each selection as max(1 - abs(average - value) / average, 0). Not sure if this is what you suggested. Anyway this results in the overall value of the sample set being somewhat too low, since there are a lot of elements with a small value in the full set. So I up-adjusted all weights of elements with value higher than average. And I made it so that the final element was chosen so the value of the set was as close to possible to the desired.

I think you’re understanding it. Pure Monte Carlo would be just a random guess (as suggested by Mithras) so for example:

(1) Pick from {1…10}. Say 2.
(2) Pick from {1,3-10} and assign each an even chance.
etc.

The problem with that approach is although simple that you can end up with many invalid sets. This is compensated for that it is very fast so you cycle through sets very quickly.

Weighting the values, tries to give some guidance to the algorithm so that it makes better picks. You can do this a few ways. One is learning by exploration and exploitation. So after completing a set you weight your picks based on how good they were. This is exploration. Then as more and more weights get adjusted the algorithm will increasingly find good sets, this is exploitation. This is very useful in certain circumstances and would work here.

However, since we know the desired outcome we can pre-weight everything. Say the desired sum is 15 with 3 values. This means we’d really like to pick things that give us an average value of 5. We don’t want to be determinisitic about it because then we’d get the same set everytime, so instead we weight things based on how close it gets us to that average. For example:

(1) Pick from {1…10}. Say 2.
(2) Pick from {1,3…10}. We don’t give everything an equal chance though. 10 is a great pick, the average of 2 and 10 is 6! 3 is a terrible pick as the average if 2.5. So just as a rough estimate I would assign each value:

1 / 1 + |(desired average - average if picked)|^2

1+2 = 3/2 => 1 / 1+ |(5-1.5)|^2
3+2 = 5/2 => 1 / 1 + |(5-2.5)|^2
4+2 = 6/2 => 1 / 1 +|(5-3)|^2

10+2 = 12/2 = 1 / 1 +|(5-6)|^2

Add up all the weights. Pick a random number from 0 to sum(weights). And find the winner. It is very likely (in this) case to be 10, but not absolutely so.

(3) Say we do end up with 10. So we now have 2,10. Pick from {1,3…9}. Using a similar process we say:

2+10+1 = 13/3 => 1 / 1 +|(5-4.33)|^2
2+10+3 = 15/3 => 1 / 1 +|(5-5)|^2
2+10+4 = 16/3 => 1 / 1 +|(5-5.33)|^2

etc.

You can use just about any weighting formula that works, the point is just to give it some guidance. If you really struggle with it, don’t bother, go pure random. It’ll take longer to compute, but you’ll get answers. :slight_smile:

Hope that helps.

Given that the problem is NP-hard …

One standard approach to such problems is to use a “First Fit” type solution. In this case, I would have started with First Fit Decreasing. I.e., sort the elements, find the largest that “fits”, find the next largest that fits the remainder, etc.

But the rule about having a fixed number of elements, rather than any number really messes this up. You’re not trying to fit the most in a given range, but a specific number. So sort order may actually slow things down rather than helping.

Basically, if some sort of First Fit type method doesn’t give you solutions in reasonable time for your data, than you’re stuck just plain running thru all subsets of the right size.

I have no reason to believe that any randomization approach would help.

Or set to 1 past the last element.

Small bug in the writeup.

@ftg. Both Mithras and I have implemented solutions to the problem so it is quite solvable without going to a brute force technique.

I think you need to recheck how you think you are “solving” the problem efficiently.

You think that using a genetic algorithm to search for solutions is inefficient? The computational complexity is quite reasonable. I’d like to hear your explanation for why you believe this is inefficient.

The explanation is presumably that it is NP-hard, and thus while, for all I know, it may be easy in practice to solve most instances, a solution which operated in worst-case polynomial-time [in the bitlength of the inputs] over all instances would be a Holy Grail breakthrough in computer science.

One way to see that it is NP-hard is by trivial reduction of the subset-sum problem to this one.

Yes, but fortunately we don’t need something which solves it for all instances. We only need something that works for this particular problem with its very specific set of parameters. To say for this specific problem the only way to solve it is by brute force is, in my view, not correct.

By “all instances”, I mean all instances of the problem the OP outlined; i.e., every choice of list of values to draw from, number of values to draw, and range in which the drawn values should sum to. I don’t know what distinction between “all instances” and “this specific problem” you are invoking.

Mithras proposed two approaches, as far as I can see. The first was “Just look around completely at random, and see if you happen to have found a solution. If not, try again.” This is the very definition of (randomized) brute force.

The second was almost deterministic: Almost always just pick the next value which brings your average so far closest to the desired average for the output set. However, pick randomly in a few cases: For your first pick, and for any pick other than the last if the input set happens to have an average within 0.001% of the desired output set’s average.

Note that if the input set happens to have an average within 0.001% of the desired output set’s average… this reduces to randomly picking everything except the very last element, and then checking whether the last element can be made to work out. Again, naught but randomized brute force.

If we ignored that last condition, this would be deterministic (except for the first pick, if we left that condition in…) and we could readily construct examples on which this would fail to find solutions even though they exist.

Exactly. We don’t need it to solve for every choice for each of the parameters. It only needs to work for:

Set size - 40,000
Target set size - “couple hundred”
Target sum range - unspecified, but ideally close to a specific value

For these parameters, brute force is not needed. In fact, for values much larger than this brute force is not needed. Although the larger the size the longer it will take until it becomes intractable.

Perhaps, we are speaking past each other.

Well, I’m speaking more to my proposed solutions, not Mithras’.