I have twenty images which I have randomly numbered 1-20. They are random images that might correspond to a persons likes and dislikes such a horse, an ipod, a person sleeping, food etc.

I would like to develop a system whereby a person could choose 5 of their likes and 5 of their dislikes **and compare those to other individuals to see which pair of people have the most similarities.** Think of it as a dating survey.

Could I come up with a single number to express their likes and a single number for their dislikes that could be used for comparison.

There are plenty of ways you could do so, but they wouldn’t be particularly useful, unless you were only interested in one aspect of personality. A better method would be to store the full set of likes and dislikes for each person, and then boil it down to a single number when you’re comparing two people-- Such a number, for instance, might be the number of items they agree on minus the number they take the opposite view on. For even more detail, you might let the users rate the various objects, instead of just approve-disapprove: A person might, for instance, very much like horses, but only mildly like the ipod, and mildly dislike sleep.

What’s wrong with just using the number of exact matches? 5 = perfect match in likes (or dislikes), 0 = no matches in likes (or dislikes). Since you’re offering random images, there’s no such thing as “closeness” in the images.

Could I allow them to string the images from 1 to 20 where 1 is most liked and is most disliked. If I have each image numbered and each box numbered, could I subtract the image number from the box number and add up the differences to get a single number?

I think all that would do would be to show how close to “average” each person is; it wouldn’t help you match people. It especially wouldn’t work on the tail ends, two people might be completely different, yet one liked 1, 3, 5, 7, and 9, while the other like the even numbers. Wouldn’t their score be identical?

of course it is easy to come up with a number, though comparing numbers is less straightforward than you would like. E.g Take just six pictures with 2 likes and two dislikes. A dislike is 0 , a neutral is 1 and a like is 2. So 102201 would mean neutral picture 1 , dislike picture 2 ect. It is trivial for a computer to compare any two numbers to come up with a similarity match, though a human would have trouble at first glance comparing 1002021102020111102 with 2011002021111000111022

What I’d do is store each one in a “slot” and compare them with that slot and each adjacent slot (or maybe two on each side, depending). Then add one to an arbitrary “compatibility” value if they’re similar, or two if they’re in the same place.

For example (I’m going to use shorthand here):

compatAB = 0 (the compatibility rating between A and B, starts at 0 for obvious reasons)

personA[3] = horse

personB [4] = horse

4 - 3 = 1 (within range)

∴ compatAB + 1

personA[2] = iPod

personB[2] = iPod

2 - 2 =0 (same spot)

∴ compatAB + 2

(so up to here computAB is 3)

personA[1] = food

personB[20] = food

20 -1 = 19 (on opposite sides of the chart!)

∴ compatAb - 1 (or two if you want things that extreme)

personA[4] = computers

personB[7] = computers

7 -4 = 3 (meh, not a disagreement, but not exactly a perfect match)

∴ compatAB +/- 0

The good thing about this method is if it’s not working well it allows you to easily expand or contract your windows in which things add or subtract values making it really easy to adapt if it’s not working well. This method, in hindsight, is pretty much an elaborated version of what Chronos suggested. If you really want to get out there you could group “related” things too, i.e. computers and tinkering with gadgets might be related, but I can’t really elaborate unless you tell us how exact you want it.

If you want to consolidate this to a single number, neutral vs anything you wouldn’t add anything, like vs like add 1, like vs dislike, subtract 1, dislike vs dislike add 1, and maybe neutral vs neutral add… one half? This is a nice method too and easily converts into a single number. In fact, a computer would probably use this method to do its “trivial comparison” anyway, and then filter compatibility and incompatibility based on the range the numbers are in, no need to ever show the human the weird string of like/dislike fields.

To compare the closeness of colors, like RGB, people generally consider this to be a 3D coordinate representation. I.e. instead of having X, Y, and Z axises you have R, G, and B. Then you simply use the distance between any two points to determine their closeness.

I believe that expanding the number of dimensions preserves the rule for calculation the distance between points:

2D

dist = sqrt( x_diff^2 + y_diff^2 )

3D

dist = sqrt( x_diff^2 + y_diff^2 + z_diff^2 )

4D

dist = sqrt( x_diff^2 + y_diff^2 + z_diff^2 + w_diff^2 )

etc.

Note that I say “I believe”. I’m not absolutely sure that the method stays the same.

But so yeah, you’ll have to divide your single number back up into the 20 positional coordinates and run some math over it, but it will be a single identifier.

Yeah, it’s something like that. But the problem is it gets really clunky (and hard to conceptualize without some logical reduction) when you get fields larger than say, six elements to compare.

I suppose it’s probably the best method if you have time to set it up (well, it admittedly wouldn’t be that hard with a couple nested for loops or recursion or something, but let’s not go there, this seems to actually be a math/logic question, not one of programming), but with something this big and depending on how exact this needs to be it may be unnecessary.

If all you want to do is measure the distance between users, the Hamming distance seems like the way to go.

It sounds to me like you want to assign a particular number to each person (27 or 151 or whatever), and subtract two numbers to determine similarities. For example, a person with “27” would be more similar to a person with “20” than to a person with “40” (the dfferences being 7 and 13, respectively). Is that correct?

If that’s the case, I don’t believe there’s any way to accomplish that. With 20 random pictures, “liking” or “disliking” any picture should be independant of liking or disliking any other picture, so you really have a 20-dimension vector, which you can’t represent as a scalar.

You *could* represent the vector as a string of numbers like **scm1001** suggests, but it’s still a vector, and treating it like a scalar with simple subtraction won’t work.

Now, assuming I’ve correctly interpreted your question, *why* do you want to come up with a single number to express likes and dislikes? Why would storing the “likes” and “dislikes” in vector format and doing vector subtraction (or whatever) be less desirable?

What the most recent posters said. Vector distance between points in 20-space is the only meaningful way to do it.

To make the points in 20 space have much predictive value, you really need to have the people rate each picture on a scale from, say, -5 (strong dislike) to 0 to +5 (strong like). Many people have a hard time with negative numbers, so you might get better results with a rating scale of 1-10, which you normalize to [-5, +5] before doing any calcs.

You might also want to use a non-linear distance function. e.g. instead of simply using compatibility score=squareroot(sum of square( dimensional distance N)), use squareroot(sum of square( weightingfunction(dimensional distance N))).

The weighting function could either apply extra weight to larger differences, or if you wanted to be really powerful, attempt to scale the relative importance of the various pictures. E.g. for a dating questionaire, people who react very differently to a picture of a baby are probably less compatible than people who react equally differently to a picture of an iPod.

If you don’t have experience in creating surveys, note that how you choose the pictures and the weights and any surrounding verbiage will almost totally determine the results. The survey takers are almost superfluous. Hence the surveys taken by politicians & talk show hosts showing that 9 out of 10 people agree with them.

If you are actually trying to gather useful data, not just engage in polemics, then you need to spend a lot of time & effort & expertise getting the biases out of your process (or more accurately, understanding what the biases are and applying accurate corrections).

It seems the simplest way of doing this is to just compare the first person with the second.

Now go through the pictures; for each picture they both like or both dislike, add one. (So you have a number from 0 to 10 for how close those people are). If you wanted, you could also subtract one for each picture that one person likes and the other dislikes, giving a number from -10 to 10.

Now compare the first person with the rest of the people, one at a time. Then the second person with everyone else, etc.

At the end, you’ll have a number for each pair of people. You can find the two most similar (might be ties), or the most similar to any given person, or whatever you want.

Or, if you wanted, when you’re going through the pictures, you could also subtract one for each picture that one person likes and the other dislikes, giving a number from -10 to 10 [assuming that the only data you have is ‘like/dislike/neutral’, this gives you the same ranking order that the 10-dimensional distance method does, but it’s much simpler].

You should be able to do an equation like that in a database query. It will have to process all of the fields, but it will be simple to write.

Datamining might reveal similarities of data which would allow one to do a faster–though probably less accurate–comparison. But that will require a decent dataset first.

There are many different rules one can use, each of which will produce a space with slightly different properties. Such a rule is called a “metric”.

You could produce a number but it would have very little meaning. The numbers on a likert scale do not have any set meaning relative to each other. A person who rates a picture as a 2 does not like it twice as much as a picture they rated 1. They liked the first picture more but that is all that can be said. How much more or less the values on a likert scale represent is unknowable. If you had each person rank the images according to how much they liked it, the numbers would be more comparable. Maybe doing matches based on a small likert scale could yield more useful data, but in general math with those kinds of numbers is meaningless.

Thanks everyone.

This was just a fun activity for my students. A “getting to know you” exercise. I thought it might be fun to compare the students.

I guess I will stick to +1 for similar likes and -1 for similar dislikes. There does not seem to be away to assign a single number for comparison.

The Hamming distance that I mentioned in post #11 is the standard measurement for what you’re trying to do. Use it.