Scoring an ordering question

I can think of two ways of scoring a trivia question about ordering a number of things according to some quantity. (For example, a quiz may ask the participants to order a given list of liquids according to their freezing point.)

  1. Compare each pair of items in the list and see if the order given by the participant is the same as the actual case. Give them one point if they are the same, otherwise no points. Whoever gets the highest number wins.

  2. Compare each item in a participant’s list with where that item is in the actual list. Score is the absolute value of the difference. Lowest score wins in this case.

Are these two ways of scoring equivalent? It seems to me they are, but perhaps some mathematician has found they aren’t.

The number of comparisons in case #1 is the (N-1)th triangular number, whereas the number in case #2 is N. So for large numbers, #2 is obviously much easier to do.

Assuming I understand what you mean correctly, it seems to me that they are not equivalent:

IIUC, if the correct order is ABCDE, then for the first scoring method, EDCBA would give the worst possible score, since every pairwise comparison would be wrong. And that worst score is unique, since any other order would reverse at least some of the pairwise comparisons and make them right.

Under the second scoring method the score of EDCBA is 4 + 2 + 0 + 2 + 4 = 12. But DEABC would give 3 + 3 + 2 + 2 + 2 = 12 which is the same, so either the worst score has multiple realizations under the second scoring system, and not in the first, or EDCBA doesn’t get the worst score. Either of those make the scoring system non-equivalent.

In fact, simply BA gives different results. Method 1 gives a score of 0, since the only pair is out of order. But method 2 gives a score of 2, since each element is 1 step away from where it should be. Perhaps I too am misunderstanding the algorithm.

Very good leahcim. I guess my intuition was wrong. So is there a “best” way to score such questions? (for whatever anyone thinks “best” means) It seems to me this could be an entire subdivision of mathematics all by itself.

I mean equivalent in that the scores would order the participants the same, not that they would get the same scores.

The real order is ABCDE.

              Method 1 Method 2
ABCED            3         2
EABCD            0         8
CDABE            1         8
EDCBA            0        12
EDABC            0        12

Just checking on a few cases looks funny. Method 1 scores EABCD as 0 even though it has three items in the right order. EDCBA and EDABC both as 0 gets scored the same in both methods, but EDABC has three items in the right order.
I think you need to total the number of contiguous items in the correct order. So EDCBA has nothing in the correct order and a score of 0. EDABC has three items in the correct order for a score of 3. ABECD has two pairs of ordered items for a score of 4. I don’t think their position in the real order is significant at all since one item out of order throws off all the remaining items. The algorithm will be similar to a text diff.

ETA: How should DEABC be scored? Can’t be 5, that’s the same score as ABCDE. Oh well, back to the drawing board.

How do you define ‘best’? Suppose that, instead of summing the absolute values of the differences in rank, you summed the squares of the differences instead. Is that better or worse for your purposes?

ETA: your two methods remind me of Kendall’s coefficient versus Spearman’s coefficient

Tripolar, I don’t think you understand method 1 correctly. For 5 elements, you have 10 comparisons: A with B, C, D and E; B with C, D and E; C with D and E; and D with E. So EABCD would score 3 + 2 + 1 + 0 = 6.

The Wikipedia article on Kendall’s tau distance reminds us that even a simple algorithm involving sorting the rankings can calculate it in time O(n log n), so it’s not that much more than method #2.

I think method #1 is probably best, since it measures the participants understanding of the relative ordering of each pair. However, the number of comparisons grows to unwieldy amounts if one is grading this manually. For example for 10 items, the number of comparisons is 45. Yes, that’s easy for computers to do, but manually, not so easy. So I was looking for an equivalent way that only grows by O(N). Using squares, as you suggest, may be the way.

I’ll have to look those up, since I’m unfamilar with them.

Ok, got it. And it makes a lot more sense.


If I got it right this time then that’s pretty good scoring. At least based on some of many ways of comparing ordered lists.

Not quite right. CDABE = 2 + 1 + 2 + 1 = 6

It’s not quite O(N), but it’s O(N log N) using merge sort. FWIW here is the link from Wikipedia: arrays - calculating the number of “inversions” in a permutation - Stack Overflow

For a Computer Science nerd like me, the problem is most similar to edit distance. I.e., what’s the fewest number of operations to convert FACDEB to ABCDEF.

For the standard metric and standard operations, the Wagner-Fischer algorithm is the gold standard. (Boy, that takes me back a while.)

But that’s for that metric and set of operations. You need to specify those and there are likely quite efficient algorithms for computing them. (And given the small size of your lists, even a dumb quadratic algorithm or a super-dumb cubic algorithm is enough.)

One key aspect of all this is that working through a good algorithm may very well help you decide what choices you want to make. To paraphrase McLuhan: “The solution is the metric.”

(Then there’s the possibility of using simple cycle notation for permutations. But one has to define a metric based on total number-size of cycles.)

To my mind, the simplest score is - drop any item on the list out of place.

I.e. The question is order of boiling point, low to high. ABCDE is correct. Anything before the first correct relationship, drop. DABCE gives a score of 4 for ABCE, since the others are correctly in ascending order. DABEC -3 for ABC since D and E are incorrect. EDCBA - none of these are correct, but A is there. (Anything before A is incorrectly listed as having a lower BP than A) Scoring on the full sequence, not individual runs. DEACB - 2 AB

Alternatively, start with the first and score from there DACBE - D ok, drop A and C and B, not higher than D, so answer DE scores 2.

The logic goes like this - you are asking - “which substance has a lower boiling point than which other substances?” If the answer recognizes that nitrogen boils before water, this shows a certain level of knowledge - those two are in order, no matter the other three or where they fall in the answer. yes, the default worst score is 1 and the best score is 5. You are essentially querying relationships, not substances, and scoring on 4 relationships.

The problem is that someone who gives DABCE (4) as the answer is really only wrong on one substance knows about ABCE, so knows better than DACBE (3) or worse EDBCA (2) who knows correctly B and C.

Just pointing out that if you do not count all pairs, then, e.g., for DEACB you are not giving credit for knowing that A > C and D > E. That also resolves your DABCE, DACBE, EDBCA issue.

I did some samples based on using the square of the difference in location (method 3) and comparing that to method 1 and method 2 in the OP:

	method 1	method 2	method 3
ABCDE	4+3+2+1 = 10	0+0+0+0+0 = 0	0+0+0+0+0 = 0
ABCED	4+3+2+0 = 9	0+0+0+1+1 = 2	0+0+0+1+1 = 2
ACBDE	4+2+2+1 = 9	0+1+1+0+0 = 2	0+1+1+0+0 = 2
ACDBE	4+2+1+1 = 8	0+2+1+1+0 = 4	0+4+1+1+0 = 6
BCADE	2+3+2+1 = 8	2+1+1+0+0 = 4	4+1+1+0+0 = 6
DABCE	3+2+1+1 = 7	1+1+1+3+0 = 6	1+1+1+9+0 = 12
CADEB	3+0+2+1 = 6	1+3+2+1+1 = 8	1+9+4+1+1 = 15
CDABE	2+1+2+1 = 6	2+2+2+2+0 = 8	4+4+4+4+0 = 16
EABCD	3+2+1+0 = 6	1+1+1+1+4 = 8	1+1+1+1+16 = 20
BEADC	2+3+0+0 = 5	2+1+2+0+3 = 8	4+1+4+0+9 = 18
DACEB	3+0+1+1 = 5	1+3+0+3+1 = 8	1+9+0+9+1 = 20
DBEAC	1+2+0+1 = 4	3+0+2+3+2 = 10	9+0+4+9+4 = 26
DEABC	2+1+0+1 = 4	3+3+2+2+2 = 12	9+9+4+4+4 = 30
EADCB	3+0+0+0 = 3	1+3+1+1+4 = 10	1+9+1+1+16 = 28
EDABC	2+1+0+0 = 3	2+2+2+2+4 = 12	4+4+4+4+16 = 32
EDCAB	1+0+0+0 = 1	3+3+0+2+4 = 12	9+9+0+4+16 = 38
EDCBA	0+0+0+0 = 0	4+2+0+2+4 = 12	16+4+0+4+16 = 40

It seems to be an improvement, but there’s still room for more. But perhaps there’s none that are useful for my purposes. I plan to use whatever method in a small quiz with probably no more than about 8 participants and probably less. So I’m not going to computerize the scoring. I don’t think it’s worth it.

Any other ideas?