AHunter3, I want to say thanks very much for attempting to support an assertion with some referent. I do appreciate it and find it very much more beneficial to furthering an actual discussion.
Previously, you asserted this:
You then provided three references for this, which you did appropriately qualify.
Your first link is to a paper written by Kirk and Kutchins (1994). It is interesting, but does not really support the notion that reliability “kind of sucks,” depending on what the definition of “sucks” is. More specifically, it suggests that reliability sucks for some disorders and is good for others. It particularly seems like a straw man, since your original argument was that interrater reliability is good for Alzheimer’s disease and not good for all (other) psychiatric disorders (presumably excluding Alzheimer’s, since it is included in the DSM as a psychiatric disorder).
The intriguing critique of Kirk & Kutchins is that the improvement in reliability has not changed from DSM I days to DSM-III-R. While an important question, their method of answering it is not particularly good. They rely on work by Spitzer in 1975 and DSM III field trials around 1979. What they suggest is that considering all diagnoses, the reliability kappa’s did not meaningfully improve. They make the same point in transitioning from DSM-III to DSM-III-R by focusing on a paper detailing tests of the Structured Clinical Interview for DSM-III-R (Williams et al 1992). These are important points they raise regarding claims of reliability for the entire DSM, and ideally we will someday get to a point where all the diagnoses are 100% reliable. Taken on their face, the Kirk and Kutchins analyses put lie to the argument that Dseid (I am presuming) and I would put forth; that we are moving forward and improving our diagnostic tools.
However, there are two problems. First, regarding your assertion that the “overlap kind of sucks” for psychiatric disorders. The kappa statistic is a measure of agreement between two raters that removes chance association, meaning that a kappa of 0 implies association solely by chance, and a kappa of 1 perfect agreement. So a kappa of .50 is midway between those ends. Conventionally, kappa’s of .7 are regarded as acceptable levels of agreement. Kirk and Kutchins don’t provide the weighted mean of kappas for their DSM III data (although they do point out how bad the data they relied on sucked). What it says is that the whole DSM III (published in 1980 and replaced in 1987) averaged out somewhere around (from just eyeballing their graph) at a kappa of .60 - .65. This is not better than what they have presented for DSM II days, but it does mean that when you include all DSM III defined disorders (even the most shaky ones) interrater reliability is markedly better than chance, and is just slightly below the acceptable levels of agreement. To the point, it is a far cry from 10 different opinions from 10 different raters. While somewhat surprising, I wouldn’t jump up and down about it because the data they are using really isn’t all that good. To be fair, that is not exactly their fault – they are criticizing the “selling” of the DSM and using data that was apparently used by those doing the selling.
This leads me to point number two: A more fair comparison, both for them and for our purposes here would be to use good data and compare the improvement or worsening of interrater reliabilities for specific diagnoses. The questions should be: Which diagnoses show good reliability and which do not? What has been the improvement for both the good and the bad over revisions of the DSM? What are some possible explanations for any lack of improvement? What research is being done, and what should be done, to address areas of weakness?
The second paper appears to be a highly selective review paper that might have been written by a student at Duquesne University, just down the road. As such, it really only relies on the Kirk and Kutchins work and some others, and does not appear to add much empirically to the discussion on reliability. However, I skimmed it for discussion of reliability, and I could very well be missing something. The final paper appears only to discuss diagnostic reliability of schizophrenia prior to about 1970 or 1975 or so. Interesting history, but hard to really apply to the current discussion.
In general, I find the Kirk and Kutchins paper mildly useful. In part, it suffers from the same problems that many of the criticisms voiced in this thread display – it is overly broad without the empirical support to really back it up. On the other hand, it is useful because it does raise interesting questions (e.g. Can we take an overall kappa across the DSM as an indicator of a lack of progress?) and points out deficits in the literature.