Algorithm to create groups with minimal internal duplicates

tellyworth · September 22, 2011, 11:49pm

You might want to read about k-means clustering. It seems like a variation on that might work, if you have the distance metric take your same-state restriction into account.

ultrafilter · September 22, 2011, 11:50pm

If anything, this is an anti-clustering problem, because the most similar object to x is x itself.

tellyworth · September 23, 2011, 12:05am

Yes - well, clustering where certain objects repel rather than attract. Just thought it was worth a mention.

I don’t know what you mean by this.

BarryB · September 23, 2011, 2:49am

You can break heuristics up into two groups, construction and improvement, and people have been focusing on the construction without really thinking about improvement. A construction heuristic generates a “good” solution from scratch, though how good you want it depends on your needs. If you already have a solution and want to improve it is when you use an improvement heuristic. The original post had an improvement heuristic which was destroying part of the solution and calling a construction on what you destroyed, which I think is a bit wasteful. Why not simply try swapping cities between subgroups? If there is an overlap inside a subgroup, look at other subgroups until you find a subgroup that doesn’t have any of one of the overlapping criteria, and see if you can swap one of the cities in the original with that overlapping criteria for one in the other subgroup that doesn’t produce a new overlap. If you can, do it. For the continuous criteria, create a measure of spread, and see if a swap improves spread.

That is an oversimplification of NP-Complete and easily disproven for this problem. We know that the best answer has no overlaps, so simply producing an answer with no overlaps will suffice. A randomly produced answer can do that, so it isn’t necessary to check anything after finding one if one is found. Secondly, it might be possible to prove that it is impossible to have no overlaps relatively quickly (like in the original question if there is more than 200 cities in a state), in which case you can revise the best answer value upwards. What NP-Complete means is that you can’t guarantee it will be done in polynomial time, and in the worst case you might end up having to do something akin to a full enumeration. However even then the worst case might be a Dynamic Programming algorithm rather than a full complete enumeration.

Sage_Rat · September 23, 2011, 3:22am

Create a binary tree where each branch in each node has a double which is the quotient of the number of items down that branch over the number of unique properties on those items. When you insert, you travel down the branch with the better ratio. Once everything has been inserted, iterate through and place in groups up to whatever your desired size is.

LSLGuy · September 26, 2011, 6:24pm

I think Sage Rat’s tree will work great for properties constrained to a finite list of values where the only metric between two values is = vs. !=. As soon as you introduce continuous values or list values where the measure is more complex (e.g. CA is more like AZ (“closer” to AZ in some sense) than it is SC), then that approach falls apart completely.
The full bore rigourous treatment of the OPs restated problem would require defining a concept of distance between any two pairwise values of any single parameter.

e.g. for continuous values like AvgLifespan the distance could just be a difference or an RMS diference between the two values. For discrete values it could be as simple as = vs. !=, or it could be based on an arbitrarily complex metric. e.g. The distance between two GivenNames is the difference in the number of letters times whether the name differs in grammatical gender or source human language times the number of non-duplicated letters.

Many real world categories for real world objects of all types have a natural notion of distance even if it isn’t (yet) rigorously defined. And it’s almost always more complex / subtle than simple (in)equality.

Certainly if the problem is actually dealing with classifying animals, there are logical and fairly well-defined notions of distance across topics like geographic or climatic habitat or morphology or genetic relatedness, etc.
Once you have functions which can measure distance between any two possible points in each parameter separately, then you need to define some global notion of distance between any two points in the complete n-parameter space.

The typical common-sense notion is to treat the space as Euclidean in however many dimensions, so the distance between any two points P1 & P2 defined in n-space equals the nth root of the sum of each parameter distance to the nth power. i.e. the hyperspace extension of the familiar Pythagorean theorem in 2 dimensions.

But our n-space doesn’t *have *to treat each dimension equally like the above. And for most practical problems it should not. Some categories are more important than others. Which would represent a difference in unit size along that dimension.

Some may be correlated with other(s) which would represent an idea of non-orthogonal dimensions. Like a flat world whose lat/long coordinates were done in east/west and northeast/southwest.

Once all these decision functions are defined, then you can compute the complete pairwise set of n-space distances between each point and all the others. This gives you a diagonal matrix.

Now the problem becomes a straightforward set-packing to maximize the total inter-item distance in each sub-group. Which is NP-complete. But also readily subjectable to Linear Programming or Dynamic Programming approaches.

LSLGuy · September 26, 2011, 6:41pm

I’ve been intuiting around on whether simple total distance between the items in a subgroup is the right metric for group “diversity” or whether that too can / should have some more complex collective behavior. Arguing by analogy from force fields in physics, maybe inverse square law and total RMS distance is a better measure of each subgroup’s total diversity score.
Finally, is there some idea about how diverse the subgroups are collectively which represents a metric of quality for the overall solution?

e.g. Imagine we have two different permutations of the items into subgroups where the diversity scores come out the same. e.g. In each permutation we end up wth 4 subgroups with diversity scores of 12, 15, 19, and 24 respectively.

But in one permutation two subgroups end up nearby in subgroup space, while in the other arrangement the subgroups are spread farther apart.

Can this happen? Do we care? I suggest it depends on the actual problem we’re trying to solve.

LSLGuy · September 26, 2011, 7:10pm

Sorry to triple-post but I said something stupid & want to retract it. (Before you all help me retract it ).

I fouled up the Pythagorean Thereom. Which kinda blows my credibility on the rest of what I wrote. Oops. Oh well … Say enough stuff in public and you’re bound to say something stupid now and again.

The correct n-dimensional generalization of dear old Pythagoras is the square root of the sum of each distance measure squared. D’oh!

Punoqllads · September 26, 2011, 7:59pm

Ooo, good point about an LP/DP approach. There are a bunch of good solvers out there, I’m sure.

Sage_Rat · September 27, 2011, 7:46am

nm

Sage_Rat · September 27, 2011, 8:43am

nm again. My method can’t be fixed.

Napier · September 27, 2011, 8:56pm

LSLGuy:

I think Sage Rat’s tree will work great for properties constrained to a finite list of values where the only metric between two values is = vs. !=. As soon as you introduce continuous values or list values where the measure is more complex (e.g. CA is more like AZ (“closer” to AZ in some sense) than it is SC), then that approach falls apart completely.
The full bore rigourous treatment of the OPs restated problem would require defining a concept of distance between any two pairwise values of any single parameter.

e.g. for continuous values like AvgLifespan the distance could just be a difference or an RMS diference between the two values. For discrete values it could be as simple as = vs. !=, or it could be based on an arbitrarily complex metric. e.g. The distance between two GivenNames is the difference in the number of letters times whether the name differs in grammatical gender or source human language times the number of non-duplicated letters.

Many real world categories for real world objects of all types have a natural notion of distance even if it isn’t (yet) rigorously defined. And it’s almost always more complex / subtle than simple (in)equality.

Certainly if the problem is actually dealing with classifying animals, there are logical and fairly well-defined notions of distance across topics like geographic or climatic habitat or morphology or genetic relatedness, etc.
Once you have functions which can measure distance between any two possible points in each parameter separately, then you need to define some global notion of distance between any two points in the complete n-parameter space.

The typical common-sense notion is to treat the space as Euclidean in however many dimensions, so the distance between any two points P1 & P2 defined in n-space equals the nth root of the sum of each parameter distance to the nth power. i.e. the hyperspace extension of the familiar Pythagorean theorem in 2 dimensions.

But our n-space doesn’t *have *to treat each dimension equally like the above. And for most practical problems it should not. Some categories are more important than others. Which would represent a difference in unit size along that dimension.

Some may be correlated with other(s) which would represent an idea of non-orthogonal dimensions. Like a flat world whose lat/long coordinates were done in east/west and northeast/southwest.

Once all these decision functions are defined, then you can compute the complete pairwise set of n-space distances between each point and all the others. This gives you a diagonal matrix.

Now the problem becomes a straightforward set-packing to maximize the total inter-item distance in each sub-group. Which is NP-complete. But also readily subjectable to Linear Programming or Dynamic Programming approaches.

Oh, very nice indeed! I like where this is going! It’s out of my usual line and I am going to have to chase after some of the bits to grasp them, but what I think I hear sounds spot on.

Most of the variables are to me either clearly class variables for which the only test is equality, or continuous variables for which the test is the distance or a function of the distance. I think I realize that one of the continuous variables behaves in such a way that it would be better to have more nearly the same amount of spread in each group than it would be to have some groups more widely spread and others less widely, even given that the average spread would stay the same. That is, I want to fault points that are closely spaced more strongly than I favor points that are far apart. I think I might actually prefer something like the sum of square roots, just to force this issue. One construction step that occurs to me for such a variable is to divide the large population into internally ranked quartiles, and make a subgroup of 4 members from the highest value in each quartile, then another group from the next highest value in each quartile, and so forth. In the (unlikely) event that the value was perfectly uniformly distributed to begin with, this gives me the same dispersion in every single group by any measure. But this brings up a bunch of very particular and probing questions about what I want to happen in the downstream treatment of these subgroups.

“Now the problem becomes a straightforward set-packing to maximize the total inter-item distance in each sub-group.” - OK!
“Which is NP-complete.” Sorry, what’s this mean? Or, wait, see if I find it myself…
“But also readily subjectable to Linear Programming or Dynamic Programming approaches.” I don’t understand. I’m gonna go off to Wikipedia or one of the Mathematica forums or some such place in search of this - any other key phrases I could look for?

Thanks! Most intriguing!

Napier · September 27, 2011, 9:12pm

I agree it depends on the actual problem, and can say that there will be negligible interaction between subgroups, and for that matter if there were interactions I have no hint of whether it would be harmful or beneficial for there to happen to be subgroups in nearly identical subgroup space locations. Overall I think it best to evaluate the fitness of each subgroup as a group and to ignore the subgroup space.

I also think it is more important that none of the subgroups have a poor internal figure of merit than it is that the average figure of merit over all subgroups be high. We can’t have any of the subgroups turn out too suboptimal.

I was thinking about another angle. There are methods of designing fractional factorial experiments for different kinds of optimality, for example “d-optimality”. I actually have a tool for this, and it can draw n (e.g. 4) parameter value vectors from a pool of size N>n. I think this is like my problem because fractional factorial experiments work best when the measurements are widely scattered in the multidimensional space. Perhaps I can randomly put , say, 3n of my list items into a small pool, then use this tool to pull n to create an optimal subgroup, then randomly move n more from the original list into the pool, then create another subgroup from the pool, and iterate through the whole mess this way. This still feels a bit weird and like a dopey cheat to apply this tool to something it wasn’t intended for, and I worry that the last groups will have poorer figures of merit. But I like about the tool that it has all kinds of optimization strategies including swapping, reshuffling, and searching for the greatest distances.

ftg · September 27, 2011, 9:58pm

Following up on Punoqllads’s post. Such problems can be approached using a greedy algorithm. It won’t usually give you a great answer, but sometimes you get lucky.

I re-frame NP-complete problems into either SAT or graph coloring. And I like coloring for this.

Each node is a city (in the OP version), make edges to between cities in the same state. Pure graph coloring would give you the fewest number of sets (1 for each color). But no limits on size of sets (and probably fewer than the OP wants).

Note that one approach to approximate graph coloring is greedy coloring. Among the heuristics listed there, I’d suggest starting with the vertex degree ranking method.

Now, the large number of small sets throws a whole 'nother wrinkle in it. The hard cases would be small number of large sets. So this greedy method has a chance of doing fairly well. I.e., you get to create a new set/color far more often.

Topic		Replies	Views
How to generate all sets of 3 elements within sets of 5, drawn from 45 possibles? Factual Questions	9	1438	January 13, 2009
An algorithm for picking lists for pairing Factual Questions	10	901	August 31, 2002
Forming groups while avoiding some pairings - is this combinatorics? No idea how to calculate... Factual Questions	22	1929	August 25, 2016
Algorithm question - generating subsets Factual Questions	36	3887	July 20, 2014
Is there an algorithm for this optimization problem? Factual Questions	14	2052	March 31, 2013

Algorithm to create groups with minimal internal duplicates

Related topics