Sorting algorithm question

Suppose you have a sequence of positive numbers, denoted by x_k, k = 1 to N

Without loss of generality, we can assume that the x_k are sorted from smallest to largest.

I want to sort
sum x_k, where k belongs to a subset of {1,…,N}
over all subsets of {1,…,N}

For example, if we sort the sums we could get


x_1,  
x_2, 
x_3,  
x_1+x_2, 
x_4,
x_5 ...

or


x_1, 
x_2,
x_1+x_2, 
x_3, 
x_4,
x_1+x_2+x_3 ...

or some other sequence, depending on the relative values of the x_k

This seems like a generic enough problem that there is an algorithm to handle it.

Is there one? Or, can you guys think of a numerically efficient way of doing it?

When you’re given a bunch of random numbers that need to be sorted finding a sorting algorithm can be complex. But when you’re building the list the list of numbers yourself, all you need to do is make sure the list is ordered at each step of your program.

If, as your examples somewhat suggest, x1+x2+x3 is a valid sum, but x1+x3 (skipping the x2) is not, the problem isn’t all that hard. What you need to do is make a copy of the list, which is assumed to be ordered, and start inserting x1+x2 (and then x1+x2+x3, etc) at the right place.

Any insert shouldn’t take any longer than 2n (the lists maximal length) steps, and with n inserts this is a O(n^2) solution.

On the other hand, if you need to take into account sums like x1+x3 the problem becomes harder. This is not because of the sorting requirement but because it seems like you need to generate all possible combinations of x’s. Each xi can either be in or out, so you get 2^n possible combinations. The thing to note, however, is that it’s unlikely that there’ll also be 2^n different outcomes. As an example, consider the inputlist consisting of just n copies of the number one. The outcome list will just be [0,1,2,3,…,n].
That’s something to take advantage of. Make sure your insert procedure handles duplicates correctly… Here’s your algorithm:
[ul]
[li] Start out with a result list consisting of just the number 0.[/li][li] To each result in the result list add x1 and insert it into the result list, which is now [0,x1][/li][li] To each result in the result list add x2 and insert it into the result list, which is now [0,x1, x2, x1+x2][/li][li] To each result in the result list add x3 and insert it into the result list, which is now [0,x1, x2, x3, x1+x2, x1+x3, x2+x3, x1+x2+x3] (Note that i haven’t ordered anything, nor removed duplicates. Your program should do both these things)[/li][li] …[/li][/ul]
(note that at each step the list of results doubles in length, so if there aren’t any duplicates you’ll end up with a list consisting of 2^n entries, as expected)
If there are m possible outcomes, you’ve now made n steps that consist of at most m inserts into a list of length m, which makes the algorithm O(n*m^2).

Can you clarify the problem a little?

You’re starting with a pre-sorted array x[k] where k=1…N. Right?

You’re trying to generate an array y where y[l]= Sum (x[a],x**) and 1<=a <b <=N

You then need to merge arrays x and y to create a new array z.

You then want to sort z.

Right or wrong? If I’m right, I’ve just listed the steps you need to take.

If the second array, y[k] consists of all possible sums of the items in the x[k] array, the length of the y[k] array is 2[sup]k[/sup], i.e.:
x[k] : 1, 2, 3, 4

y[k]:
0 (none of the items in x, optionally ignorable)
1 (x[1])
2 (x[2])
3 (x[1] + x[2])
3 (x[3])
4 (x[4])
4 (x[1] + x[3])
5 (x[1] + x[4])
5 (x[2] + x[3])
6 (x[1] + x[2] + x[3])
6 (x[2] + x[4])
7 (x[1] + x[2] + x[4])
7 (x[3] + x[4])
8 (x[1] + x[3] + x[4])
9 (x[2] + x[3] + x[4])
10 (x[1] + x[2] + x[3] +x[4])

You could eliminate duplicated sums, I guess, but even so, an x-array of any real size is going to make your y-array huge.

One thing you should be aware of is that deciding whether there’s a subset of a given set S whose elements sum to k is NP-complete (in short, you’re not going to find an efficient solution). That really constrains any attempts at a clever solution.

Am I missing something here? Why not construct a list with every sum and every number(takes 2n time) and then sort the list with a O(n log n) algorithm?

If we must do something similar to your solution, use a self-balancing binary tree to hold the list, and then the run-time is still O(n log n). Of course, this is just applying my suggestion above to the binary tree sort algorithm.

Agreed.

One thing I should clarify is that N is on the order of 10,000, so constructing y[k] is out of the question.

I am interested in an algorithm that comes up with the first M elements of y[k], where M will be on the order of 100 to 1,000.

Maybe, as ultrafilter mentions, this is not an easy problem to solve.

The first two entries are easy to come up with
x[1]
x[2]

The third entry is either
x[3]
or
x[1]+x[2]

if the third entry is x[3], the fourth entry is
x[4]
or
x[1]+x[2]

if the third entry is x[1]+x[2], the fourth entry is
x[3]

etc
The problem is that this gets quite complex very quickly, and I don’t see an easy way of extending it to obtain later elements of y[k].

I though this problem may be a variation of a well-known problem, for which an algorithm exists, and which I could adapt for my case.

I guess that, since I only want the first M elements of y[k], I will only be interested in the first M elements of x[k], since x[M+1] will never be in any of the first M sums, since we already know M elements smaller than it.

So, the problem reduces to finding the first M elements of y[k] using the first M elements of x[k].

If you only want the first M elements of Y, there might be a reasonable approach. Stick the elements of X into Y, and then generate all the two-value sums, and all the three-value sums, and so on and so forth. At some point the smallest possible sum of k elements (x[1] + … + x[k]) will be larger than the largest value in Y, so you can stop then.

Could you clarify the thing asked above, about whether, say, x1 + x3 is a valid sum?

If you only care about sums of contiguous subsequences, then you can feasibly construct the entire array y, and Rysto’s solution is a fairly efficient one. [Although it seems to me the array y will have size on the order of n^2, and so the running time will be on the order of n^2 log(n) rather than n log(n). Where are people getting that the list will be of length 2n? Won’t there be n*(n-1)/2 possible contiguous subsequences?]

So does the y-array represent the sorted smallest sums of items in the x-array?

For example, if x[k] is {1,3,5,8,19}

y[k] is {1, 3, 4, 5, 6, 8, 9, 11, 12, 13, 14, 16, 17, 19… 36} ?

Is a sort necessary? If you wanted to know if, say, 25 was a possible y-value, a recursive algorithm could tell you quickly enough. 25 in fact is valid, though 26 is not.

With arbitrary subsets, it could? As ultrafilter points out, the subset-sum problem is NP-complete.

Well, I suppose what you’re saying is that, for small values like 25, you could find out fast enough.

Er, edit window lost.

Well, I suppose what you’re saying is that, for small values like 25, you could find out fast enough. But he wants the first 100 to 1000 values of y, for which I don’t think things will be so fast.

I’m sorry if I’m being dense but, for clarification, are you (effectively) talking about generating the power set of {1,…,N}? And that the sums under consideration will always begin with 1 (i.e., not an arbitrary starting point)?

I used the word “effectively” because you’d be adding the elements of each produced subset rather than simply enumerating them. I do realize there’s an additional constraint of producing an ordered list of the sums (which might be useful in predetermining a cutoff point), but I just wanted to be clear.

Anyone want to take a crack at determining the running time of a divide-and-conquer algorithm using lazy lists/streams? I’m interested in how long this takes to generate the first k elements of output.

Specifically, you take the original list, split it into two roughly equal parts, and recursively obtain (duplicate-free, ordered) streams for the sums from those two parts (call these the a-stream and the b-stream). Then, one takes each element from the a-stream and maps it with addition over the b-stream, giving a, uh, stream of streams, each of which is duplicate-free and ordered. One can merge these in any classical fashion, and the result is the desired stream of sums from the powerset of the original list.

If it’s not clear what I’m talking about, here’s some Haskell code:



--Splits a list [x1, x2, x3, x4, ...] into ([x1, x3, ...], [x2, x4, ...])
split [] = ([], [])
split (x:xs) = let (a, b) = split xs
               in (x:b, a)

--Standard function for merging two ordered duplicate-free lists
merge [] yl = yl
merge xl [] = xl
merge xl@(x:xs) yl@(y:ys) | x <  y = x:(merge xs yl)
                          | x == y = x:(merge xs ys)
                          | x >  y = y:(merge xl ys)

--This function merges many ordered duplicate-free lists at once
--I know merging of many lists can be done more efficiently than this
-- (by keeping the lists in a priority queue ordered by their first element)
mergemany = foldl merge []

--This function produces an ordered duplicate-free list of all the sums
-- of the powerset of a given list
sums []  = [0]
sums [0] = [0]
sums [x] = [0, x]
sums xs  = let (a, b) = split xs
               (asums, bsums) = (sums a, sums b)
               sumtable = [map (+x) bsums | x <- asums]
           in mergemany sumtable


Hm, I suppose the above code isn’t sufficiently lazy to do what I wanted with the efficiency I wanted it to. Feel free to ignore it while I try to improve it, if such improvement is possible.

Ok, working from the assumption that x[1] + x[3] is not a valid sum(ie only sums from 1 to k, where 1 <= k <= N, are valid), here’s an algorithm to do it in O(M) time:

x[1…N] is our original(sorted) sequence.

To construct the sequence we want, a, perform the following algorithm:


j = 1;
k = 3;
nextSum = x[1] + x[2]

for i from 1 to M do
    if(x[j] <= nextSum) then
        a* = x[j];
        j = j + 1;
    else
        a* = nextSum;
        nextSum = nextSum + x[k];
        k = k + 1;
    end if
end for

Things get a bit trickier if we need to allow negative numbers, but it’s still doable.

So you’re saying, Rysto, that you’re considering both A) singletons from the original list (i.e., x_k for any k) and B) sums over initial segments of the original list (i.e., x_1 + x_2 + … + x_k for any k)?

It’s a bit odd if that’s what the OP wants, but all his examples were of that form.

So, again, I call upon the OP to clarify: what kinds of sums do you want? Sums over arbitrary subsets of the original list (as seems indicated by your wording)? Subsets over contiguous subsets of the original list? Sums over both singletons and initial segments of the original list (as indicated by your examples, on one interpretation)? The answer to your question is, clearly, highly dependent upon, well, what your question is about, exactly.

Yes.

No.

I’m not sure how the “contiguous subset” constraint suddenly surfaced, since I don’t mention it in the wording of the problem

Sorry if my examples were misleading. They may have been misleading because they are examples about the first few terms of the sequence

Later on, the sequence might look like
x[4]+x[8]
x[3]+x[4]+x[7]
x[4]+x[9]

Just take all the subsets of {1,…,N}, calculate sum x[k], and sort the result, i.e. rank the subsets in order of increasing sum x[k].

The problem is coming up with an algorithm that finds the first M subsets, without having to calculate sum x[k] for all 2^N subsets

Since your values x[1],…,x[N] are all positive, the formal subset sums immediately generate a poset lattice with bottom 0 and top x[1]+…+x[N]. You can think of this lattice as the vertices of an N-dimensional hypercube {0,1}[sup]N[/sup], with each dimension representing the absence (0) or presence (1) of one of the values (so vertex <01101> represents the sum x[2]+x[3]+x[5]). The edges of the hypercube represent transitions with Hamming weight 1: i.e., changing a vertex sum by adding or removing one value. The “upward neighbors” (those at the other end of an upward edge) are those which result from changing a single “0” to a “1”.

You want to generate sums upward from the bottom. Maintain two data structures: a list minimal-sums (which will become your desired output, and can just be a linked list) and a set possible-next-sums (which should be able to efficiently maintain a sorted state on random-access insertion and deletion). Begin with both of these structures empty. Start at the bottom vertex <0…0>=0.

Now repeat, M+1 times (the first time through just gives you 0; the next M times give you the lowest M sums):[ol][li]Append <v>=s to minimal-sums.[/li][li]Insert all upward neighbors of <v>=s into possible-next-sums (unless already present).[/li][li]Find a minimal sum <v>=s in possible-next-sums.[/li][li]Remove <v>=s from possible-next-sums.[/ol]You can end the final iteration after step 1 if you want.[/li]
The set possible-next-sums will usually get large, eventually. But since each vertex has only O(N) upward neighbors, in M steps it will only get to size O(MN). Efficient access, insertion, and deletion into such a set should be O(log MN). So each iteration should take[ol][li]O(1)[/li][li]O(N log MN)[/li][li]O(log MN)[/li][li]O(log MN)[/ol]and the overall algorithm should run in O(MN log MN).[/li]
For example, with values {2,3,5,7,15} and M=5 (showing the last element of minimal-sums and the set possible-next-sums at the end of step 2 each time through the loop):


0=<00000>
   2: <10000>
   3: <01000>
   5: <00100>
   7: <00010>
  15: <00001>
2=<10000>
   3: <01000>
   5: <00100> <11000>
   7: <00010> <10100>
   9: <10010>
  15: <00001>
  17: <10001>
3=<01000>
   5: <00100> <11000>
   7: <00010> <10100>
   8: <01100>
   9: <10010>
  10: <01010>
  15: <00001>
  17: <10001>
  18: <01001>
5=<11000>
   5: <00100>
   7: <00010> <10100>
   8: <01100>
   9: <10010>
  10: <01010> <11100>
  12: <11010>
  15: <00001>
  17: <10001>
  18: <01001>
  20: <11001>
5=<00100>
   7: <00010> <10100>
   8: <01100>
   9: <10010>
  10: <01010> <11100>
  12: <11010> <00110>
  15: <00001>
  17: <10001>
  18: <01001>
  20: <11001> <00101>
7=<10100>
   7: <00010>
   8: <01100>
   9: <10010>
  10: <01010> <11100>
  12: <11010> <00110>
  14: <10110>
  15: <00001>
  17: <10001>
  18: <01001>
  20: <11001> <00101>
  22: <10101>

(the next sum is 7=<00010>).