That would work if you compare an individual question with some sort of baseline. As a made-up example, let’s say that Star-Bellied Sneeches, on average, answer 75 questions correctly for every 100 that regular Sneeches answer correctly. If you have one particular question where only 25 Star-Bellys answer correctly for every 100 Regulars, then you can toss it out. Merely having a discrepancy isn’t evidence of bias, but if a question has more of a discrepancy than the others, it is evidence. I don’t know if that’s how they do it, but it would make sense.
Have you tried Umblats ‘R’ Us on xplfsnfo street?
Well, obviously that’s a solution.
However, an important part of reading comprehension is interpreting shades of meaning, discerning that something isn’t quite as right as another thing. If you’re going to test that, you’ll have to have some questions that require students to make choices among several plausible answers.
Also, and correct me if i am wrong here, I thought the SAT scoring system (and its ability to, supposedly, distinguish students) depends on questions offering tempting almost-right answers. They only want a student to answer a question right if they really know the right answer. Otherwise you couldn’t tell between students who got a hard question correct and students who got a lucky guess. At least back in the late 1980s, this is how it worked. Perhaps it has changed. Measurement is not really my field, I confess.