An algorithm for picking lists for pairing

Recently I got to thinking about a problem and I wondered what sort of algorithm could be used to solve it.

Here’s the basic problem :

You have two sets of items, call them P and Q. For each item in each set, you also know whether it ‘matches’ with a particular item in the other set (and a ‘match’ is reciprocal, so if P[sub]5[/sub] matches with Q[sub]37[/sub], then Q[sub]37[/sub] matches with P[sub]5[/sub]).

The object is to pick two smaller subsets of the lists which maximize the number of matches between the smaller sets.

Here’s an example. Suppose you are in charge of coming up with an automatic candy making machine. The machine will randomly pick a flavoring and a coloring to mix in each candy (so you can get red-nectarine, chartreuse-chocolate, etc.). Some of the flavorings and colorings, however, just don’t mix well chemically, and make the candy taste bad. Now, you have a long list of flavorings, and a long list of colorings, and you know how they combine, but the machine can only hold 20 of each. You need to pick the 20 flavors and colors that produce the fewest number of bad candies as possible (The number doesn’t have to be the same; you could have 25 colors and 31 flavors, for example).

My questions, especially for those who know a lot about algorithms are –

Does it seem like this problem has optimal solutions that can be found analytically?

I came up with a algorithm for finding a solution, and I think it’s optimal (the solution, not the algorithm) but I’m not entirely sure. I’ll post it later but I’m wondering what else is out there already.

Also, what about expanding this for when the matches are not perfect? (Like, marketing shows that purple-tomato is popular with 60% of people, and green-popcorn only 20%).

This problem seems like it’d be NP-complete if you specified it properly. What you want is MATCH(p, q, n), which asks if there’s a pair of sublists of size p and q with n matches. As you phrased it, the answer’s easy: pick P and Q. Each is a sublist of itself, and that maximizies the matches.

I’ll work with it. If it is NP-complete, then I highly doubt that you’ve found an algorithm that finds the optimal solution in polynomial time–nothing personal, it’s just that so many other people (myself included) have failed to do it.

Yeah, your problem is NP-complete, I’m pretty sure. I need to work on the proof before I post it, so maybe I’ll find out I’m wrong.

This is the Maximum Matching Problem for Bipartitie Graphs. (Not to be confused to Maximal Matching.) Poly time.

Google gave me this link among others.

Huh. Guess I better check my work.

I apologize – I screwed up the statement of my problem above while editing it for preview, and my example only confused the issue. What I meant to say is that the size of the reduced list is fixed and the idea is to pick the items from both lists that match up best in that space.

So, if A has m members and B has n members, find a subset of A, A’ with d <=m members and B’ with e<=n, with the total number of matches the most (ideally d*e if you use ‘1’ as a value for a match and ‘0’ for non-match).

Somehow I think that’s easier than the other problem. And I’m afraid I need to learn more graph theory or find another page to understand if ftg’s link applies. Though I think it does, or at least it’s in the right direction, so thanks for giving me some terms to search on.
I also realized my algorithm wasn’t very good in the worst case ( though I never thought it’s performance to be very good anyway ). A brief description of it, though :

Create a mxn matrix with 1’s to mark matches and 0’s otherwise.
For a starting point, rearrange so that the rows and columns with the most 1’s are placed at the upper left.

Record the value (that is, the sum of all entries) in every e-length row (& d-length col.) in the upper dxe portion with the other e-length rows ( & d columns, etc.), recording only those which have either the same or greater value (‘potential rows’ (&c.) for swapping).

Also create a list of the values at the intersections of potential rows and columns (i.e the result of swapping both the row and the column).

Pick the greatest value combination and perform any necessary swap(s) into the upper dxe segment. If no values are greater than the current value, exit.

Since this solves (if not determinedly optimally) the problem I was trying to describe, maybe that will help clarify things.

Oh, that’s different.

I keyed in too much on the phrase “maximize the number of matches”. In combinatoric problems like this “matching” means exclusive. If red goes with banana, then red can’t go with chocolate.

Also, you left off the size of the sets in the initial description. Since these are input parameters to the problem itself, this is quite important. So in your example, I took the number examples to mean merely the possible sizes of the maximum matches. (I.e., of the answer.)

So, before I go on, let me get things straight.

Given a rectangular binary matrix and two numbers e and d, swap the rows and columns so that the number of 1’s in the upper exd submatrix is maximum. Do I have it right now?

(Using this form, Bipartite Maximum Matching would correspond to swapping to get the longest continuous strings of 1’s on the diagonal.)

Oh. Note that the adjacency matrix for a clique is a solid block of 1’s. Hence, the problem version as I just gave trivially reduces from k-clique, which is NP-Complete. ultrafilter is avenged.

Well, that bipartite question would probably have been my next one. I knew that whatever words I used they were going to end up having unintended meanings.

The problem is as you described it last (although it’s a d x e submatrix using my method.)

The reason I thought my method would get the highest score is that when it terminates you fail to find any increase from swapping with the ‘upper right’ (the d x (n-e) columns) or ‘lower left’ ( (m-d) x e).
Since all values are 1 or 0, the most increase that can be found from a value in the lower right will be 1 – and only if there is a row and column that do not decrease the value to be swapped. So by checking all nondecreasing swaps in the upper right and lower left, and all their intersections, you check all sources of increase ( I think). Perhaps this amounts to checking all possible answers (with just a reduction in time for most cases) which isn’t properly a computational solution, so it could be optimal.

To clarify the previous paragraph : Suppose a potential permutation is row i (in the UL) with row k (in the LL) of our matrix A. Another one is col. j (in the UL) with col. l (in the UR).

Swap(i,k) followed by swap(j,l) will result in A[sub]ij[/sub] being swapped with A[sub]kl[/sub] and will at best increase the value by 1. This increase only occurs if swap(i,k) and swap (j,l) are nonnegative.

(If they are negative because they replace element A[sub]ij[/sub] with 0 then swapping it with A[sub]kl[/sub] will only keep A[sub]ij[/sub] the same (1) at best.)

Checking only those swaps which increase the value should mean there’s no other possible increase in value, or am I missing something there?

Good, cause I was having a hard time finding an error. My reduction went from 3SAT to a variant on 3SAT (trying to decide whether a formula has a satisfying assignment with at least n variables marked true) to this problem.

I’m a little confused on the difference between this and the problem you linked to earlier. Could you explain it in just a little more detail?

(Just a quick reply to panamajack’s post. It’ll take a longer post to answer ultrafilter.

panamajack: this is like “walking uphill” to find a maximum. You will eventually get to a local maximum but there could be a still larger maximum somewhere else. This is in fact the exact difference between Maximum Matching and Maximal Matching (cf. my first post). The first is the best overall, the latter is just a matching that cannot be made larger (while keeping the current matches). For matching, Maximal Matching is a somewhat easier problem than Maximum Matching.

If this is a real life problem you really need to solve, one of the approximation techniques that has been used involves “simulated annealing”. Google on that term for more.