I have a large set, where each element has a value. Now I’d like to sample subsets from it , where each subset contains a specific number of element, and where the sum of the elements in it is within a specific boundary. I do not need to find all subsets.
For instance, lets say I have a set (1,2,3,4,5,6,7,8,9,10), and I would like to sample subsets from it of size 3 and 10 < sum < 15, then (3,4,5) would be such a sample.
Hmm, tricky – can’t see how you could guarantee a fair random sample, without the relatively brute-force approach of enumerating all valid subsets and then picking a random item from that list. That can be done relatively straightforwardly by sorting the original set and first generating all valid subsets which start at the first item, then all valid subsets which start at the second item, etc. But if the original set is very large, it could still be a somewhat expensive job in terms of CPU time and memory.
How large is the original set? And what would be a typical value for the desired subset size?
There are numerous ways to do this; however, my suggestion would be Monte Carlo approach only because it is simplest to code. In the graph, I would weight the links such that the weights are proportional to the size of the value as a ratio of the candidate value to the average value of the set so as to limit the number of bad sets picked. Also, weights could be dynamically set such that a pick which puts you over the limit cannot be taken.
So for example, the first layer of the graph contains 1…10 and assume the size of the desired set if 3 and the desired sum is between 5 and 15. This means we want to pick values that are less than an average of 5 per pick but greater than 5/3. So if we pick a small value we want to pick a large value next to compensate, and similarly if we pick a large value we want to pick a small next to compensate. The second layer contains 2…10, 1,3…10, 1…2,4…10, etc. The weight from 1->10 would be very high, as would 1->9. Similarly the weight from 10-> 1 would be high. Etc. etc. etc.
As you get deeper into the graph some weights would drop to zero if picking it would put you over / under the desired sums.
If you want to do a more sophisticated approach you could use just about any optimizer to solve this. I’ll use a genetic algorithm as an example. The genome would consist of a single sequence with N genes where N is the size of the set you want. Each gene would have min / max value equal to the min / max of your set (if your set has irregular values, than it would be 1…N which corresponds to a position in the set). The GA would halt when it found a sufficient solution and then execute again (you could store previous solutions and give them a fitness of 0). The fitness function would consist of proximity to the min / max desired values for the sums. This is actually the way I would go, but only because I have all the libraries written to do this. This approach would be much less CPU and memory intensive, but likely more coding if you don’t have a GA lying around.
I’m finding this a fun problem to play with until I have to start getting ready to go to a wedding. Something to keep in mind, though, is that depending on the range of values, how narrow the space between the upper and lower bounds, and where the center of that range sits relative to the average value in your set multiplied by the size of your subset, just picking totally random sets and seeing if they fit your criteria can be very fast.
For example, using my years old laptop, if I take a random set of 40,000 numbers between 1 and 1,000,000 and request a subset of 300 values that sum to between 145,000,000 and 150,000,000 I can get a result in less than a second.
If I make it harder and require the sum to between 125,000,000 and 130,000,000, it’ll usually take less than a minute but occasionally much longer.
So depending on exactly what your input looks like and what you’re going for on the output, you might not require anything more sophisticated than total randomness.
Good answer Mithras. My answer assumes that the search criteria would be hard to find by random guessing but you’re correct, if the answer is easy to find, then for example the GA would likely find a sufficient solution in 1 generation (which is equivalent to a random guess) and you’d be doing extra coding, setup, etc for nothing.
The same is true with the weighted Monte Carlo approach. No weighting is necessary when the end condition is easy to find. At that point the Monte Carlo approach reduces to random guessing as well.
To choose a set:
First number=X=Choose any value from set
Second number=Y=Choose any value from set such that X+Y+1 > 10 and < 15 (the +1 leaves at least minimum room for Z)
note: may need to add a condition based on max value in the set, meaning minimum Y value might have to be >= some other number based on max and X - if set has values great enough then that condition is not needed
Third number=Z=Choose any value from set such that X+Y+Z > 10 and < 15
I was going to suggest something similar.
I think we loose a little randomness by forcing some of the values. If the sets are such that some values will have very few options if they are successfully included in a subset, then those subsets may be overweighted by picking the value for X.
Either that, or I need more caffeine.
This may or may not be acceptable for the OP’s definition of random.
Problem is, the knapsack problem is well-known exactly for being a problem that is really hard (or at least computationally expensive) to solve.
Depending on the details of the search space, even finding a single solution can be difficult unless you are willing to throw enormous amounts of computing power at it. There are standard approaches indeed, but all of them are at least one of: slow, complex, not guaranteed to give the optimal answer. (And I’m afraid that if you say “OK, I’ll take ‘complex’ then” you’re still out of luck.)
Clearly then, finding an answer which is guaranteed to return a fair random pick from the set of all possible solutions, is not going to be easier than finding a single solution…
Nevermind, you’re correct Digital is the new Analog, it will be weighted towards the lower numbers, algorithm needs a weighting factor to end up with randomly distributed sets.
If you want a value as close to a value as opposed to within a certain range then you’re going to need some kind of search. The Monte Carlo approach won’t really work well for that. You could use Monte Carlo using an exploration and exploitation technique where it learns what paths are good and weights them accordingly.
Just for fun, I tried it with a GA for a range of values and it found solutions within even a very narrow a range very quickly as expected. Generally less than 150 generations with a total processing time per solution of 10 seconds or so. Although I didn’t time it very closely. It would be trivial to change that to a single value (since a single value is a special case of the most narrow of ranges) in the fitness function, although I have no doubt it’ll take much longer to run.
Ultimately, it comes down to what you need or are willing to accept.
Out of curiousity, are there any other details that are missing and do you need the source code?
If there are no other details and you don’t need the source code and it is a one-time execution, I’d be willing to run it for you. It wouldn’t take me very long. I just don’t want to get myself into a quagmire of “Oh, well there’s this condition and that condition”. But if it really is this straight forward I’ll help.
If it’s your first pick or the average of all the numbers in your set is within .001% of your target sum / total number of elements you’re shooting for and it’s not your last pick, make a totally random pick. In all other cases, pick the number that brings the average of all the numbers you picked closest to your target sum / total number of elements you desire.
This is very fast, leads to a sum that is rarely more than 200 from the target if the inputs are randomly distributed between 1 and 1,000,000, and results in a list of numbers that looks random.