Algorithm to create groups with minimal internal duplicates

What are some algorithms for splitting a list into subgroups of a given size, such that subgroups are minimally likely to have duplicate values of a certain class variable?

Easier to say with an example, perhaps.

Suppose I have a list of 1000 US cities that includes CityName, State, Population, and Elevation. I want to break it into 200 subgroups of 5 cities each. I want NONE of the subgroups to contain multiple cities within the same State. Other than this constraint, I want everything random.

The algorithm I found was to break the big list into 200 subgroups, and then test each subgroup to see if it contains any duplicate values for State. If it does, I dissolve that subgroup and put its cities back into a recycle pile. Then I randomly create more subgoups from the recycle pile, and I iterate this whole sequence perhaps 20 times. This approach seems to work pretty well, and by nature the problem does not necessarily “work” in the sense that some starting lists can’t possibly be broken down as I want, such as a list containing more than 200 cities in California. Maybe I should be satisfied now. Nevertheless, I wonder if I am missing some tidier way of doing this.

Is there a famous algorithm for this I can look up by name?

Or does anybody have any ideas for this?

Thanks!

Another idea would be to group the list by the no-dupe-desired attribute & then sort the collection of groups by group size decreasing. e.g. 150 CAs first, 130 NYs second, 123 TXs third, …, 2 WVs last.

Then just “deal” them out from the sorted list of groups into however many groups you want. Just like taking a deck of cards and dealing out 13 piles of 4 cards each for 13 players around a table.

If you get all the way around your “table” & try to deal an item onto a pile that already has one of the same kind, well you’ve detected that your no-dupe condition can’t be met by your data. And, ignoring the sort effort, you’ve discovered that in O(SubGroupQty) time, unlike the random and potentially O(ItemQty!) approach in the OP.

You didn’t say whether you cared about randomness of distribution once the non-dupe criteria was met. If you did care about that, this algorithm can accomodate that to some degree by randomizing the sequence within each group. And, for groups of the same count, randomizing which group is “dealt” first. Finally, once you’ve completed the distribution into valid subgroups, you can randomize the order you consume them in.

And no, I don’t know of a name for this approach.

Building on LSLGuy, couldn’t you first randomly order the whole list, then sort by state into groups, then randomly choose the order of states and align them to make a list which you then deal sequentially. The order of operations isn’t that important.
This would give you a list that looks like:

WV2 WV1 NY1 CA1 CA3 CA2 IA3 IA1 IA2

Which would then be dealt into:

WV2 CA1 IA3
WV1 CA3 IA1
NY1 CA2 IA2

This is pretty simple and not much of an algorithm. Just start randomly assigning items to sub-groups and keep track of the states in each sub-group. Pick another another subgroup if the state already exists there for an item.

Is there more to this question? There isn’t much there to optimize.

When you write “random” do you mean exactly random in the sense that all admissible partitions have exactly equal likelihoods?

If so, the straightforward way is to start from scratch on failure, i.e. put everything on the recycle bin. You can fix the placements for whichever state has the most cities, but if more than one state has “too many” cities, this approach will take too long. But I doubt if there’s a simple alternative with the exactly-equal likelihood property.

This is the way to do this. Actually random and very simple.

Hmm, no, there isn’t much more to optimize. But the more I work with this the more I wish it had a certain complicating behavior.

Forget cities, now I want to do animals. So, one variable is the Name (Cupcake, Sport, Secretariate, Morris, Lassie). Species is one, and AvgLifespan is one, and Origin is one (e.g. Canada, Sahara, Antarctica, IndianOcean).

The new complicating behavior is that I want my subgroups to minimize duplicates in ANY of the variables, and especially in multiple or all variables. So, it’s best if every variable describing an animal within one subgroup is a mismatch for ALL the other animals. And, I can have a continuous variable such as AverageLifespan and it’s better if these range widely within a subgroup, and having 50 and 51 in the same group is not nearly as nice as if there are no AverageLifespans within 10 years of one another. Note also that I include class variables that don’t have an easily predictable number of possible values, as “State” did.

It’s almost as if I want things to drop into subgroups with some kind of attractive force pulling them towards the unpopulated possible states in subgroups.
Sorry about changing the rules - the goal of this little project is evolving (and partly being steered by what it occurs to me should be possible).

Strictly speaking, this isn’t an algorithm, because it isn’t guaranteed to terminate. It therefore has no upper bound on its completion time; the most you can say is that after a certain number of steps it will terminate with a certain probability. Not particularly rigorous, and a bad design from both a mathematical and software engineering point of view.

The simplest non-trivial case involves 3 subgroups on 4 states and 6 cities total. There are two possible outcomes, up to symmetries:

Outcome 1:[ul][li] A B[/li][li] A B[/li][li] C D[/li][/ul]

Outcome 2:[ul][li] A B[/li][li] A C[/li][li] B D[/li][/ul]
If equal likelihoods are desired, Outcome 2 should occur with probability 4/5.

Here, Outcome 2 occurs with probability 5/6.

Ok, the new rules are more complex, there is a behavior to optimize because you want to avoid the comparisons on each variable (assuming a lot more items and variables). Now I doubt there is a solution which in reality is any more efficient than extending the basic formula unless you can guarantee a non-conflicting distribution which can actually be seperated in this way.

I think your statement about attractive force is just a way of describing an inverse index on the variables to select a subgroup to drop an item in based on it’s lack of overlap with other items in the group, but that doesn’t do anything to optimize the algorithm, just the implementation.

Sorry guys, I work in the real world. My method was implemented and completed execution while you were still theorizing. I interpreted the requirements based on experience and came up with a result. You’ll have to do a lot more qualifying to show that you’ve attained something better. We don’t know what ‘random’ means in this sense, or the reason for the request in the first place, for instance he could need to minimize execution time, resource usage, or just to simplify the algorithm description.

If an algorithm isn’t guaranteed to terminate then it has serious issues. There is a sorting algorithm called bogosort that works like this:

1. Permute the order of your dataset.
2. Check to see if it’s sorted.
3. If so, you’re done. If not, go to step 1.

The algorithm you describe is mathematically equivalent to bogosort. Also, it will not terminate if there are more cities with a given property than there are groups to put them in.

I’ve worked in a realish world myself.
A real-world problem closely similar to OP’s is to calculate the odds in a bridge hand using Monte Carlo simulation, given constraints on suit lengths, etc. Guess which of the two approaches here will develop correct odds?

And if you are doing so in a professional capacity, then this fact truly scares me. I sincerely hope that the software you write is not responsible for anything important which I am likely to use, such as life support systems.

First of all, I believe that your question is an example of the packing problem, specifically, set packing. As the pages note, the class of problems is NP complete. That means that there is no way of computing the best answer aside from iterating through all possible ways of putting elements in groups and checking each one. For a large enough dataset, this makes calculating the best solution intractable. However, there are polynomial heuristic algorithms, which can provide you a “pretty good” answer.

My recommendation is the following: first, make a “weight” function which takes an element and a group, and determines how “good” (or how “bad”) it would be to add the element to the group. Next, you would randomize the input list of elements. Then, proceeding through this element list, for each one, you would calculate the weight for that element across each group. You then pick the group with the best weight, and add the element to that group.

E.g., for your initial example, the weight function would take a city c and a group of cities g, and it could count up the number of cities in g which are in the same state as c and the total number of cities in g, returning a 2-value vector. In this example, “better” numbers have a lower number in the first value, or have a lower number in the second value if the first values are identical.

If the speed wasn’t fast enough, instead of calculating the weight of an element across each group, you could calculate the weight for only a random subset of groups.

The problem with that is, what happens when you’re down to the last group, and more than one of the remaining cities are in the same state? You’ll be stuck in an infinite loop.

Too late to edit my own post: my recommendation is of order O(n[sup]2[/sup]). There will exist faster solutions, but they will probably be more complex to implement and thus more prone to bugs.

Guys, read the specs in the OP. That’s what I responded to. I didn’t specify the method of random distribution.

But you’re absolutely right in carrying through to the derivatives and re-application of the underlying algorithm. No argument there.

Here’s a possibility: