You have 49 chips of different colors (red, blue, green, white, yellow, and so on) of varying quantity per color and need to place them in 7 by 7 grid, one chip per cell, such that all chips of a given color are horizontally contiguous, if possible. There may be more then 7 chips of one color and there may be more then 7 colors total. What algorithm can I use to accomplish this? And for bonus points what is this class of problem called?
It sounds suspiciously like a knapsack optimization to me.
It’s kind of like a knapsack problem, but not very close.
If you have 8 reds, are two rows of four each considered horizontally contiguous because each set of four is horizontally contiguous, or are they not because all 8 chips are not in a single horizontal contiguous row?
If there is a single yellow chip is it horizontally contiguous (with itself) any place you put it?
If the answers to both those questions are “yes” then is a row of seven and a separate “row” with one red each horizontally contiguous?
It seems to me that if you have more than 7 of one colour that the problem can’t be solved. Otherwise, I would look for combinations of colours such that the number of chips of that combination add up to 7. Then they could be placed in a single row. If you reach a point where you can’t come up with a combination that adds up to 7, then the problem can’t be solved.
Horizontally contiguous in this case means not spanning more then one row. I do not expect to never have to span a row, I just want to avoid doing it as much as possible.
So for 7 colors and 6 chips in each, this
|AAAAAAB|
|BBBBBCC|
|CCCCDDD|
|DDDEEEE|
|EEFFFFF|
|FGGGGGG|
|_______|
is a suboptimal solution, and this is the optimal one:
|AAAAAA_|
|BBBBBB_|
|CCCCCC_|
|DDDDDD_|
|EEEEEE_|
|FFFFFF_|
|GGGGGG_|
If a color has to span a row then I want as few rows as possible of that color, so for 15 chips of color A this is suboptimal:
|AAAAA__|
|AAAAA__|
|AAAAA__|
||
||
||
||
and this is optimal:
|AAAAAAA|
|AAAAAAA|
|A______|
||
||
||
||
It does feel a lot like a knapsack problem, but somehow a little different.
Can I move a chip once I’ve placed it?
Do I know ahead of time how many colors and how many chips of each color there are?
Yep, postprocess all you want.
Yes you know the contents of the chip bag before you begin.
I don’t understand your criteria for saying these are any different. You have 3 rows of As in both solutions.
See if you can come up with a well-defined scoring algorithm where you can take a solution and give it a score. Then we can talk about how to minimize the score.
In the latter case there are larger sections of “A” next to each other. Imagine the chips representing family members at a ball game, the latter layout makes it so that more of the chips have a chip to their left and to their right.
I don’t know of any ready-made algorithm that I could point you to, but here’s how I’d do it.
First, find all colors with 7 or more chips and place each color in X rows, where X is the number of chips for a color modulo 7.
Now what you have is for each color of chips left in your bag, you have less than 7 of each of those remaining color.
Now try to pair up the numbers of remaining chips with others that add to 7. So if you have 3 blue chips and 4 green chips left over, plus other stuff, put the 4 green chips in a row and then put the 3 blue ones in. Pair 1s with 6s, 2s with 5s, 3s with 4s.
Once you have the A + B = 7 combinations, look for the A + B + C = 7 combinations, ad nauseum. When you can’t make any more combinations that add to 7, try to place the remaining chips as best you can.
This will scale for any number of rows, and of course you could change the variable of 7 columns.
It’s a set partitioning problem, where you have a set of 49 people, where you need to create 7 subsets such that each person is in one and only 1 subset (each subset represents a row). The value of each row isn’t defined by this, but I’d guess that saying that the value of a row is the number of people of the largest family on that row will work, e.g.
|AAAAAAB| 6A, 1B = 6
|BBBBBCC| 5B, 2C = 5
|CCCCDDD| 4C, 3D = 4
|DDDEEEE| 4D, 4E = 4
|EEFFFFF| 2E,5F = 5
|FGGGGGG| 1F, 6G = 6
|_______|
|AAAAAA_| 6
|BBBBBB_| 6
|CCCCCC_| 6
|DDDDDD_| 6
|EEEEEE_| 6
|FFFFFF_| 6
|GGGGGG_| 6
Missed the edit window.
My idea would of course get you this kind of solution:
|AAAAAAA|
|AAAAAAA|
|A______|
||
||
||
||
But you never said if that was better or worse than 3 rows of 5. I’m going to assume that mine is better, because if you do 3 rows of 5 then you’re liable to not have good placements for the other slots in those rows.
It’s like a knapsack problem, but seemingly simpler because of fewer dimensions to the units. It still has the same basic nature of knapsack problem. The OP states the problem in simple terms, but it could apply to an unknown number of chips, a variable grid size, and the total number of chips not matching the number of grid entries. The specific rules for what makes an ideal row may require the same kind of analysis as in knapsack problem. Given sufficient complexity, a very large percentage of possible combinations of colors may need to be examined to determine the best fit.
Even simple constraints on the problem as stated in the OP make this very complex. Assume that colors must be contiguous, wrapping at the end of a row, and the goal is to obtain the most colors that do not cross rows. A color that has a chip count that equals the row length may not be ideally placed in a single row. By having it span 2 rows, it could allow 2 other colors to fit entirely within a single row.
This. So far you haven’t given us a precise statement of what we’re trying to compute, and there’s no way to invent an algorithm for a problem this vague.
Ok, so it’s not really how many rows there are, but about neighbors for each chip?
Any chip that doesn’t have the same color to the left and the right is -1 point. Maximise the score?
Anything else?
Is
AAAAA
BBBBB
AAAAA
BBBBB
better or worse or the same as
AAAAA
AAAAA
BBBBB
BBBBB
?
Another question: of the two below examples, which (if any) is better than the other, and why?
AAAAD
BBBBD
CCCCD
or
DDDCC
AAAAC
BBBBC
That’s the problem here. we have no idea how to evaluate the result.
Consider more difficult cases. What if all the color had a count greater than 1/2 row length, and less than 1 row length. You can always get at a number of colors equal to the number of rows to fit in 1 row, but how do you evaluate the many ways that the other colors can be distributed across multiple rows.
A 7X7 grid will easily succomb to a reduction algorithm which extracts single colors and combinations that total the length of a row. Fitting the rest of the colors (even if none are eliminated in the first step) can be evaluated for every combination in reasonable time. But if the grid size is 1000x1000 or larger, smarter algorithms have to be put in place to produce results in reasonable time.
I was posing my question to Cornelius Tuggerson; I apologize if that was unclear.
I’m going to make two observations to start with.
- The exact ordering within a seating row is not needed:
If I were to say “a row has 5 of A and 2 of B”, it is fairly trivial to construct a row that groups together the A and B by hand. Even if each row had 3 or 4 different colours, it’s still trivial (just start with one, put them all in, choose another, put them in etc.). So any optimization can treat a seating row as simply the number of each colour that are in it.
2)The ordering of the rows is (probably) not needed:
If the optimization always find rows that can be ordered by hand that seem to fit every criteria, then there is no need to include what rows are which in the optimization. For example in the example with 15A, it can spit out “2 rows with 7 A and 1 with 1 A”, and its easy to bunch the As together. It is plausible that it might get a bunch of rows that don’t seem to fit together that well, in which case you can alter the costs of each row or add constraints to the problem, or even extend the optimization to include where the rows go. However I’ve got a gut feeling that any bunch of rows that can’t be put together nicely are going to be sub-optimal anyway, since that can only happen if colours are spread everywhere.
Anyway, in my opinion Cornelius Tuggerson has told us enough. It just happens to be a multi-objective optimization, minimizing the “spread” of colours over different rows (so in his first sub-optimal example A has spread 1 and the others spread 2), and to maximize “big rows”, e.g. choose rows with some of maximum of colour A and 1 with very few, instead of evenly spread over the rows (practically speaking, this criterion helps). I’ll add in another that I saw when playing around with this, minimize the total number of rows used. I.e. if there is a gap of 3 in a row, and 3 of G sitting by themselves on an otherwise empty row, move G into the gap. Multi-objective is “hard”, so I’ll cheat in that I’ll define a function for every objective, weight them, and add them together. This may not find every efficient solution, but since the actual values of the objectives are fairly arbitrary and there’s an implicit ordering to these objectives it won’t really matter.
So, is it a knapsack problem? Well, not really. A knapsack is a really easy problem to formulate, this one is a bit harder. What it can be formulated as is an Integer Program. (By the way, a knapsack problem can be formulated as a very simple Integer Program). I’d point you to the Integer Programming article on Wikipedia, but the English version happens to be one of the most useless articles in existence. The simple explanation of an Integer Program is that it is a Linear Program, where some of the variables must be integers(whole numbers). That doesn’t explain much if you don’t know what a Linear Program is. Luckily that has a better article, but I’ll explain what it is briefly. It has variables, linear constraints, and a linear objective function. A variable is a number we or the solver can choose to fulfill our constraints and often corresponds to a choice we can make. For example I am going to use “how many times this row is used” as a variable, with 0 being not used, 1 being once and 2 being twice. Both the constraints and objective involve a linear expression of my variables. “A linear expression of my variables” simply means that each of the variables is multiplied by a constant, and added together. Each constraint is a linear expression being either equal to,bigger than or equal to or smaller than or equal to. For example, the number of colour A is fixed, lets say at 8. If a row has 4 As in it, its coefficient is 4 for the A constraint. Overall, (number of As in a row)*(number of times row is used) = (number of As). This is a linear constraint. The objective function is also simply a linear expression. Each variable has a cost/profit, which I’m either trying to minimize/maximize (doesn’t matter which paradigm choose, it works the same both ways.)
Now here is how I formulated the problem. It is using something a lot like a Set Partitioning model (technically Set Partitioning uses 0/1 variables and constraints whereas I’m using integers) . A Set Partitioning model is used a lot in the airline industry to figure out rostering and the like. It can also be used for vehicle routing and other things. Simply, I need to break up my Set of chips into rows, such that every chip is in a row. Often the hard part of Set Partitioning is creating the variables. Every possible row permutation is a variable. In real life applications the number of variables is astronomical, so they often do something called (by chance somewhat confusingly) “column generation”, where variables are created on the fly. Since this problem is small I’m fairly certain I can enumerate all of the variables to begin with, but it’s also one where column generation could work quite well.
Now, I’m using puLP with the free COIN-OR solver and python code I wrote myself.
Here’s the output of a solution I found:
A: 16 ,B: 3 ,C: 5 ,D: 0 ,E: 0 ,F: 0 ,G: 0
Status: Optimal ,Value: 1.78417010665 Time to setup: 0.332826099406 Time to solve: 0.143558189658
SRows_(0,_3,_0,_0,_0,_0,0) = 1.0
SRows(7,_0,_0,_0,_0,_0,0) = 2.0
SRows(2,_0,_5,_0,_0,_0,_0) = 1.0
CSpread_0 = 3.0
CSpread_1 = 1.0
CSpread_2 = 1.0
This corresponds to
AAAAAAA
AAAAAAA
AACCCCC
BBB
The variables in the model are:
SRow integer variables - Defined by the number of each colour in the row, can take on 0,1,2,3 etc. There are a LOT of these. This is the meat of the setup time, just making all of these variables.
CSpread - Variables that correspond to the number of different rows that a colour exists in. These variables exist so I can explicitly cost them and look at the spread easily. I want to minimize the sum of these.
The constraints are simply: that there are a fixed number of A,B,C,D,E,F,G , and that the spread of A,B,C,D,E,F,G is equal to the number of different rows each colour is in. If I wanted more colours, I need to add more variables and add 2 constraint for every new colour.
The objective is slightly more interesting. I am maximizing here. Each CSpread variable has a negative coefficient, the spread cost, which for now I have made the same for each, which I can change quite easily. If I wanted to say “family A doesn’t want to spread at all, whereas family C doesn’t care as much”, I can simply alter these cost coefficients. Each Row variable also has an objective coefficient. It has two parts, the “bigness” value, and the row cost value. The bigness value is something like I said before. It takes the most numerous colour on the row, call this n, and has a positive value of (n + n^1.5/100). This will, absent anything else, produce rows full of a colour and 1 small one instead of evenly spread. The row cost is something I added after seeing this:
SRows_(2,_0,_0,_0,_0,_0,0) = 1.0
SRows(0,_3,_0,_0,_0,_0,0) = 1.0
SRows(0,_0,_5,_0,_0,_0,0) = 1.0
SRows(7,_0,_0,_0,_0,_0,_0) = 2.0
i.e.
AAAAAAA
AAAAAAA
AA
CCCCC
BBB
Since the optimization gains value for the max of every row, it wants to take up as many rows it can, keeping each colour spread the same. So I added the row cost so that if it can bunch together rows without affecting the spread, it does. The row cost is simply subtracting a fixed cost from each row coefficient.
Now, here’s the practical part. I put in the example data used and see what will happen. 6 of each:
SRows_(6,_0,_0,_0,_0,_0,0) = 1.0
SRows(0,_6,_0,_0,_0,_0,0) = 1.0
SRows(0,_0,_6,_0,_0,_0,0) = 1.0
SRows(0,_0,_0,_6,_0,_0,0) = 1.0
SRows(0,_0,_0,_0,_6,_0,0) = 1.0
SRows(0,_0,_0,_0,_0,_6,0) = 1.0
SRows(0,_0,_0,_0,_0,_0,6) = 1.0
That’s the ideal optimal solution we wanted. Now, here is where the weighting I didn’t really talk about before matters. What if I make the row cost large? The optimization will want to reduce the number of rows. Putting in a large row cost we get this:
SRows(6,_0,_0,_0,_1,_0,0) = 1.0
SRows(0,_6,_0,_0,_1,_0,0) = 1.0
SRows(0,_0,_6,_0,_1,_0,0) = 1.0
SRows(0,_0,_0,_6,_1,_0,0) = 1.0
SRows(0,_0,_0,_0,_1,_6,0) = 1.0
SRows(0,_0,_0,_0,_1,_0,_6) = 1.0
This spreads one of the colours all over the place, because it didn’t want to use the last row. The best weighting for the problem will be 1)Large spread costs, 2)row costs that can overcome the positive value of the “big rows” I made, which by itself uses as many rows as it can with the same spread.
Using the other example (with a single other colour thrown in just for luck)
A: 15 ,B: 1 ,C: 0 ,D: 0 ,E: 0 ,F: 0 ,G: 0
Status: Optimal ,Value: -159.819594816 Time to setup: 0.194169091324 Time to solve: 0.0518225081678
SRows_(7,_0,_0,_0,_0,_0,0) = 2.0
SRows(1,_1,_0,_0,_0,_0,_0) = 1.0
AAAAAAA
AAAAAAA
AB
The power of this formulation is the ease at which we can introduce new constraints and costs. Lets say we don’t want to allow a colour to be spread too much, for example we don’t want D to spread over more than 3 rows. If our optimal solution keeps doing this, we can fix this! We can up the costs of spreading D relative to the others, we can constrain the D spread to be under 3, or even introduce a new variable that is “the spread of D over 2” and give it a much greater cost. Let’s say two families hate each other, and don’t want to be on the same row. Give each row with those families on the same row an extra cost. Let’s say it’s alright if they are on the same row if they aren’t sitting next to each other. Then you need to solve a small sub-problem (the best ordering of a row, which happens to be a small TSP) when those families are on the row, and the cost of the best ordering is the extra cost of the row. We have rows of different length? Easy, give each row variable a 0/1 (i.e. no yes) type coefficient for constraints saying “you can only have up 3 rows of length 7, and 5 rows of length 5”. Got anything else? Throw it in as a cost! Note that it takes longer for me to setup the problem than to solve it. This optimization model will only “break” if it gets stuck in a degenerate loop, if lots of variables have the same cost. So in fact the extra costs I just talked about actually help the model, not confuse it.
As I said at the start, I ignored where the rows went, and the ordering of the rows. The second part by itself doesn’t matter, just figure out the optimal row when you figure out the costs. However if the real costs depend not only on what rows are used, but also on more complex relationship between rows, I can’t use this model anymore. You could try a set partitioning model where you are instead assigning colours to individual seats (e.g. a variable could be “colour G at (1,1),(1,2),(3,5)”. There are lots lots lots more possible variables this way), and see if you can model all the costs that way. If you really can’t do it that way either… then good luck.
Now I must make a confession. I didn’t actually produce every variable. I only created possible rows in the setup stage that have up to 3 different colours in them. This cuts down on the setup time and the solver time. There are 3 possible outcomes to this:
- The optimal solution to the true problem won’t have more than 3 colours in any row - Then it doesn’t matter. I don’t need to worry.
- There are feasible solutions to the true problem has more than 3 colours in a row, and I got an infeasible solution with my rows - If I see infeasibility, I can go back to my variable creator and tell it to produce more possible rows. If I produce all possible rows and it is still infeasible, then the problem IS infeasible.
- The optimal solution has more than 3 colours in a row. - I’m going to get a sub-optimal solution. I need to go back and add more colours in a row to get this. Of course, if my solution has the same as the guaranteed best case spread then I might not really care about finding the optimization of big rows and least rows.
My gut feeling that 1) is far more likely than 3), unless lots more colours were added. It’s always better to group together colours than share them amongst rows. Unless there are lots of colours with 1 or 2 amounts, you’re going to have rows mostly full of a colour in the optimal solution.
I’ll end this with an example:
A: 11 ,B: 7 ,C: 6 ,D: 3 ,E: 6 ,F: 2 ,G: 8
SRows_(0,_0,_0,_0,_0,_0,7) = 1.0
SRows(7,_0,_0,_0,_0,_0,0) = 1.0
SRows(0,_7,_0,_0,_0,_0,0) = 1.0
SRows(4,_0,_0,_0,_0,_0,1) = 1.0
SRows(0,_0,_0,_0,_6,_0,0) = 1.0
SRows(0,_0,_6,_0,_0,_0,0) = 1.0
SRows(0,_0,_0,_3,_0,_2,_0) = 1.0
Time to setup: 2.404878553 Time to solve: 0.592516926034
|BBBBBBB|
|AAAAAAA|
|AAAAG…|
|GGGGGG|
|EEEEEE.|
|CCCCCC.|
|DDDFF…|
The second one is better, if chips are relatives at a ball game, having your family in the next row is better then having them one row over.
The first one is preferable, we want the largest color groups possible without splitting. So splitting 3 Ds is better then splitting 4 Cs.
Thanks for all the responses guys, I really appreciate it. I’ll talk to my boss on Monday and ask him if its ok to give you the actual problem we’re trying to solve so that we can do away with these silly analogies.