In a multiple choice test, is there a consensus among "test experts’ on a percentage of wrong answers by the test-takers that would indicate that the question is invalid/poorly worded or that the material was not covered in class? For example, if half the class picks the wrong answer, I would say that 50% weren’t paying attention, mis-read the question or whatever. If 90% pick the wrong answer, something else is going on.
Material may have also been poorly covered (as opposed to not covered at all). There may also be common mistakes.
I once took a multiple choice calculus test. The answers we had weren’t always simple numeral values - sometimes they were functions/algebraic expressions, e.g. what is the first derivative of F(x) = 5x2 + 3x + 2. Every question had a “none of the above” option, and many included answer choices that were nearly correct, but not exactly correct. Sometimes the nearly-correct answers were ones that included commonly made mistakes.
I would think that if a lot of folks pick the wrong answer - especially a particular wrong answer - a conscientious test designer would would want to survey at least a few of the students to understand what went wrong.
There can of course be poorly written questions. But some questions should be easier and some more difficult, in order to differentiate between students of different abilities. So it should be a feature of a well designed test that the percentage of correct answers varies widely among questions. The most sensitive test will include some well written questions that nevertheless only 10% of the students get right.
With older kids, their input can be useful because they notice things. Sometimes, the mistake is that the correct answer isn’t one of the choices, and “none of these” is not a choice, either. Sometimes, the question itself is ambiguous and open to interpretation, which is obviously something you don’t want. In cases like that, you simply discount the question and recalculate grades.
I would think the strict answer for the OP is “no”.
The construction of multiple choice question tests is something of an art. And one without a simple answer on the right way to do so. The advice above to ensure that there be a spread of difficulty is a cornerstone of any well designed test or exam.
Multiple choice tests can be constructed so that wrong answers score negative marks. That is a good incentive for students not to guess. Guessing will usually yield a negative final mark. Just leaving the test blank at least avoids that.
Correlating test performance to teaching is always fraught. Students tend to perform to their abilities, and good bad or indifferent teaching can often be obscured by students generally getting much the same grades as they do in other subjects. I have seen utterly dreadful lecturers justify their teaching on the basis that the good students did well and the poor ones not so well - so clearly there was nothing wrong with the teaching.
When I took “Measurement and Evaluation” at the undergrad and master’s level, there was the following concept (and I don’t remember what it was called):
After the test you look at the breakdown of which students answered which questions incorrectly. If you see that the top percentile of test takers all missed question 7, and a number of low scorers got it right, then that question probably wasn’t very good.
In other words, the people who seem to understand the material well all missed question 7. So something is wrong there.
That’s a decent metric. For a major exam, if you had national statistics and local breakdowns, you could also differentiate between the likelihood of a poorly written question (top percentile get it wrong everywhere) and local deficiencies in how well that part of the course material is covered in classes.
Right. There is nothing to conclude from the results of one question. If half the class fails the test there are serious questions about it’s validity but one question might be coincidence, or well formed in the sense of revealing a common mistaken belief or failure of understanding.
As someone who has worked in the testing industry for 20+ years, @Llama_Llogophile has hit the nail on the head. A big part of evaluating the quality of a test question is looking at the correlation between a student’s performance on that specific question and their performance on the test as a whole. If, as others said, 90% of students get a question correct, but the 10% that got it incorrect scored highly on the remainder of the test, that would indicate an issue with the question - maybe it was phrased in such a way as to be unclear to a student that was familiar with a higher concept, for example.
Norm-referenced tests, in which the goal is to compare students’ performance against one another, the test should consist of questions across the spectrum of difficulty. A question which only 10% of students get correct is still useful because it helps distinguish amongst the higher-ability students. (And similarly, a question which 90% of students get correct helps distinguish amongst the lower-ability students.) Most norm-referenced tests should have a majority of their question in the “mid-range” of difficulty, with just a small percentage of questions on the extremes. Most statewide and national assessments are norm-referenced. So, @Riemann is correct for norm-referenced tests.
Criterion-referenced tests, in which the goal is to assess whether students mastered material, have different standards, no pun intended. These are the types of tests commonly created and administered by teachers in the classroom - “Here’s your Chapter 6 math test.” In this case, we aren’t worried about distinguishing students from one another. If every student gets a 100%, the test is useless as a norm-referenced test but, as a criterion-referenced test, tells us that every student mastered the material. (Of course, this assumes that the test accurately reflects the content being tested.) Evaluating the correlation between a student’s performance on a single question and their performance on the entire test is still useful, for the same reasons as described above. And, there still should be a variety of difficulty levels represented by the questions on criterion-referenced tests so as to determine how much of the material was mastered (or at what level).
Its actually difficult, because you have to assume a decent number of people are picking randomly. So the actual percentage would depend on the number of alternative choices.
Assuming four choices. 25% correct would indicate that the class as a whole did no better than picking a random number.
Though even that does not take into account some factors. The question could be perfect clear and valid, but have a obvious wrong answer that would be clearly wrong to the people who had actually studied but not those that didn’t.
e.g. which wall named for a roman emperor was built in modern day Scotland in 142AD? In the lecture you made a point of explaining this was not Hadrian’s wall but the Antonine Wall. That’s a perfectly valid question (possibly a little unfair, but valid), but I could see it getting a percentage correct that is below the 25% threshold (depending on how many people actually went to that lecture and were awake)
[Moderating]
I think that this question is open-ended and nonspecific enough that there isn’t really a factual answer. Moving to IMHO.