When we read that a medical test is, say, 95% accurate, I take that to mean that 95% of the test results or either true positive or true negative, and 5% of the results are false positive or false negative.
How do we know whether test subjects “really” have or do not have the condition for which they are being tested? I would guess that a more accurate test is used, but then how is the accuracy of that test determined? Seems like some sort of infinite regression.
Time. If you take a test and at some future date you either do not develop the condition that you tested positive for or do develop the condition you tested negative for, then, after testing a sufficient number of subjects, ‘they’ know how good the test is.
It is, of course, possible that the earlier estimates are derived from tests on animals.
This is a good question, and I’m not the one to give a good answer.
But I will point out, in case it isn’t obvious, that the rate of false positives and the rate of false negatives are not necessarily the same for a particular test.
In fact they are inversely related. As the percentage of false positives (Type I error) increase/decrease, the percentage of false negatives (Type II error) decreases/increases.
They can be inversely related. Often, what a test actually gives you will be a number, and you have to set a threshold to turn that number into a “yes” or “no”, and changing that threshold will change the false positive and false negative rates in an inverse way.
On the other hand, it’s also possible to have one test that’s just plain better than another, in both ways.
When you are researching a new test, you compare it to what you know. In the medical literature I used to read, there was often what they called a “gold standard” – mostly using quotation marks like that, to let you know that they were using a colloquial expression, not meaning an actual standard involving gold.
The “gold standard” would be some existing test that, in practice, was used to define the disease. So the “gold standard” might be a surgical procedure: if you did a lumpectomy, you might submit the lump for pathology examination, and the combination of surgery and pathology might be the “gold standard” for comparing your new imaging test or blood test.
And then there might be a discussion about known errors in the “gold standard”, and the risks and consequences of it and other tests.
Sometimes there isn’t a “gold standard”, and you compare your new test to some test that is known to be unsatisfactory, like the number of people who are dead with the year, for whatever reason.
But it’s mostly not an infinite progression. Normally you can find something to compare a new test to, or else it’s not an interesting test that anybody cares about.
Arguably, an IQ test doesn’t fit this pattern; does an IQ test result actually mean anything? In cases like that, you can still test for repeatability: randomness in your test results shows up and gives you an outside number for false positives and false negatives.
In some cases, you can be fairly sure that a subject really does not have the condition being tested for. Like, for COVID-19 tests, they can test against samples gathered 6 months ago. No one had it then, so testing against enough pre-COVID samples will give you good information about
In other cases, you cross correlate with other indicators and do some probability analysis. You never get a perfect objective “truth”, you get a probability range. Do enough tests and carefully analyze enough data, and you can get a probability range that’s small enough that the costs of more analysis to narrow it further isn’t worth it.
There’s no guarantee that the rate of false positives and negatives are inversely related for a given test, either. That’s true for threshold effects, but not necessarily for systemic errors in the test.
You could have a test for some disease that’s “perfect” with no particular threshold effects, except that it always reports positive for people with a genetic marker that 1% of the population has, and always reports negative for people with, I don’t know, green eyes. You can tweak your thresholds all you want, but those error rates are going to be based on the prevalence of features in the tested population, not your test.
Essentially, what you really have is a “has disease X or gene Y and not green eyes” test, and when you try to use it to get information about disease X, you’re always going to have errors that aren’t amenable to adjustment.