Need an Algorithm

I wonder if anyone can point me to an algorithm for the following problem:

We have a graph G of N nodes. Edges between nodes define whether the two nodes are compatible, this is, can be merged. Compatibility for this problem is not transitive.
Each node also has a cost. The interesting thing about this problem is that when two nodes are merged their costs do not necessarily add. It might be the case that cost (ab) < cost(a) + cost(b). I don’t think that cost (ab) is ever > (cost(a) + cost(b)) but I’m not absolutely positive,

Though they cannot be merged, the costs of two incompatible sets of merged nodes A and B is cost(A) + cost (B)

The goal is to find the minimal cost merging of all nodes.

This is not quite bin packing, since we do not have fixed costs bins. I have found a bin packing algorithm that claims it handles conflicts, but the conflicts are transitive. It is not quite graph coloring either, because of the lack of transitivity.

I’d be happy with a heuristic - I’m not looking for optimality. My first thought is this:

Label each edge with a weight consisting of the increase in cost if nodes a and b connected by that edge are merged. (based on the greatest cost node.)
Find all edges with minimal cost. Merge the two nodes (if there is a choice) with maximum edges.
Recompute the edges if the merged node, removing those for nodes with which there is now a conflict. Also recompute node counts for all connected nodes, and the weight growth metric for all the edges still associated with the merged node.
Repeat until there are no more edges. The total cost is the sum of the costs of these nodes.

This seems fairly simplistic though - anyone have anything better?

If you get no bites, you may want to try stackoverflow.com or mathoverflow.net

There are hard core guys over there that answer these kinds of questions all the time.

The optimal solution will probably be much simpler if you have some kind of regularity condition on the cost of merged nodes. Can you bound it above by some function of the original costs?

I probably could, but it would be a very loose bound. I just came up with an example where cost(a) + cost (b) < cost(ab). Clearly I won’t merge two modes where the is the case, because the cost of the two unmerged nodes would be less.

I’m not sure of the technical sense of regularity, but if cost (a) = cost(c) and cost(b) = cost(d), cost (ab) != cost(cd) in all cases. (Or even probably many.)

This isn’t an algorithm–we’d need a lot more information about the cost structure to give a useful heuristic. But it might be another way of thinking about the problem.

So if I understand the problem correctly, you are allowed to merge a set of nodes only if all edges between them are present, right? That is, you can only merge cliques from the original graph.

So you can think of the allowed node mergers as a partially-ordered set, with an upward link between two candidate mergers if the upper one is formed by a single merger from the lower. For example, for the graph


B *---* A
  |  /|
  | / |
  |/  |
C *   * D

you can order all allowed mergers as


         ABC,D      BC,AD
      /    |    \    /  \
AB,C,D  AC,B,D  BC,A,D  AD,B,C
      \    |      |     /
           A,B,C,D

(sorry for the bad ASCII art; hope the idea is clear). This poset has a single bottom element (the original unmerged graph) but may have many maximal elements (here it has two, {ABC,D} and {BC,AD}; further mergers from maximal elements are not allowed).

Your proposed heuristic is just a greedy upward traversal of this partial ordering (if it’s better than where we are now, pick the best upward route and repeat). This will work reasonably well if the merger costs are nice enough; ideally you want there to be a decreasing path of costs as you work your way upward from the original graph to the global minimum solution, and this path should always look better than the paths to other local minima.

Obviously I have no idea whether this is true in your case. There are cases where this approach won’t work at all: for example, you can easily write down a cost function which allows you to solve MAX-CLIQUE, a known NP-hard problem (more precisely, you can solve the clique decision problem k-CLIQUE, which is NP-complete, if you can find the global minimum cost).

If you know that the optimum solutions are likely to be the ones with the fewest total components, or the components formed from the most original nodes, then you might be better off by trying to find maximal elements in this ordering and evaluating them. There are in general lots of these maximal elements, so finding them all is not easy (MAX-CLIQUE again), but you could try using heuristics for the clique problems to generate approximately-maximal elements.

Thanks, you have the problem basically right.
The cost of the merged nodes (there is no cost of merging them, but the cost is associated with each) depends strongly on the inner structure of the nodes. (And I can’t explain how - it is proprietary.) On Friday I implemented the function to compute the cost of two merged nodes versus the original ones, and it is all over the place. In some cases the cost is the cost of the larger of the two, in others it is > than the sum of the two.

Maximal cliques would be a good way to start - I’ll look up some algorithms for that. Finding all possible mergers is impractical - one example has 140 nodes. I plotted a 140 x 140 table of compatibility, and there are some cliques. Throwing out nodes whose contribution increases cost too much will help.

I think I will find the cliques, and then experiment with a few starting nodes and some heuristics. When combing nodes A and B, I dump cost relative to maxcost(A,b) and relative to cost (A) + cost (B). I’m not sure which will work best. I’m also not sure if the relation (does not increase the cost beyond cost(A) + cost(B) will define new sub-cliques inside of the ones you have shown, However, I can either start with a node not in the set created by combining useful ones, and see, or start over with a fresh node and see.

I have a small example which is almost strongly connected. BTW, as it turns out, if you compute the costs of combining all nodes, the lowest costs come for nodes which are actually incompatible. It is actually very reasonable given the constraints of the actual problem.

I don’t know what your requirements are regarding optimality vs processing time, but it’s kind of fun to throw a simple solution at problems like these and get reasonable results.

For similar (similar means anything like traveling salesman) problems, I’ve used the following (some form of hill climbing):
Create N (1,000?) random solutions

  1. Keep top X% (10?) solutions
  2. Tweak to varying degrees next Y% (40?) solutions
  3. Randomly generate Z% new solutions
  4. Re-calc and repeat

I published a bunch of papers in grad school comparing a bunch of heuristics for another problem with the optimal solution. The tough part was finding a small example for which the heuristics were non-optimal. This just has to beat a solution done by hand, and be faster than manual. The small example may or may not do it, the 140 node one will almost certainly be better.

It also only gets done once in a great while, so I like the idea of lots of solutions with different “seeds” - considering this analogous to pseudo-random generation with LFSRs.

I use this type of thing for evolving neural networks in a simulation and it works great. I also used it in a distribution center (problem was exactly traveling salesman) and it was fun to watch the total miles walked for the day drop from 500 miles (50 people * 10 miles per day) to around 250. At first it would drop rapidly and then slower and slower gains. Due to today’s processing power, it was pretty quick too, in this case there were about 10,000 location visits per day distributed amongst the 50 people and it would start bottoming out after a couple minutes (java).

Bumping to give the results. I implemented a fairly simple greedy heuristic. It turns out that the cases I need to worry about have a lot of node to node compatibility, but in most cases merging two nodes is more expensive than dealing with them one after the other. So it runs very quickly - a few minutes for a 140 node case.

However, when comparing to the manual solution, I found that I made use of some features considered too hard to use manually - so my solution is 10X better than the known one. Thus I’m not going to try to find better heuristics.

Heuristic?
I would have simply taken the “gain” from each merger;
Merge A,B - what is A+B-(A&B)?
Repeat for every pair.
Merge any for which the gain is positive, the higher the better.
Repeat until you are down to one giant node.
If the gains are all over the map, then set a threshhold - say at least 10% better for a merge.
Try this for different cut-off values.
If you find that (A&B)&C costs more than A&(B&C) but A&B gained more than B&C (or A&C) then the calculations are so weird that there’s no predicting the results.
However, heuristic suggests that a pair with a good profit from merger is always going to be a good choice to merge.
(Unless you can find a factor that predicts merger profits.)