By definition, such tests measure only one’s ability to take the tests. That’s all that there’s any metric for. It’s like asking whether yardsticks measure only linear distance. Of course they do. But given a relatively pure sample of a material of known density, there’s an extremely high correlation between the measured volume of the sample and its weight – measure the sample’s height, length, and depth, and you can say with a fairly high degree of accuracy what it weighs, even though you’ve done nothing to measure that attribute.
What you’re asking for, then, is some evidence that there’s little or no correlation between ability to take such tests and other, less directly measurable attributes of individuals. Obviously, there’s a high enough correlation between results of these tests and certain attributes that they’re considered to be useful in identifying and/or predicting the degree to which individuals who take the tests possess or will come to possess those attributes.
The reason some people (and I’ve been guilty of this myself) will say that tests generally measure only test-taking ability is that other people frequently make the mistake of assuming that correlation equals causality. The fact that people who possess a particular attribute (a large vocabulary, for example) tend to do well on these types of tests does not necessarily mean that they do well on the tests because they possess that attribute. If there is not an absolute correlation between test-taking ability and the attribute in question (in other words, if people who demonstrably possess that attribute in varying degrees perform identically on the test, or if people who demonstrably possess that attribute in the same degree perform differently on the test), then it is an inescapable conclusion that some factor other that the correlated attribute is at work in determining how they fare on the test. Scoring very highly on a test of a particular set of skills or knowledge, when one demonstrably possesses very little of that set of skills or knowledge, is reasonable proof that there is not an absolute causal relationship – doing well on the test may be a result of having the quality being tested, but it also may be the result of some other factor (skill at taking tests) that is unrelated, and there’s nothing about the test results themselves that will allow you to determine which factor is at work in a given instance.
This is particularly obvious to those of us who do have a facility for taking tests, without necessarily possessing the attributes nominally measured by the tests. For example, in high school, I invariably scored in the 90th percentile or above on the math portion of any standardized test I took (SAT, ACT, ASVAB, you name it). Yet I flunked Algebra the first time I took it, squeaked by with a C/D average the second time, and narrowly escaped flunking Algebra II, while in college I made a B in Elementary Functions only by virtue of a grading curve from hell (I had a 57 out of 100 and a 50 out of 150 on the last regular exam and the course final). Despite periods in which I studied my ass off and did everything I could think of to help me “get” Algebra, I never made much progress with it. Yet on standardized tests of math ability I consistently outscored classmates who were taking university calculus classes as juniors in high school. Likewise, I had numerous acquaintances who were demonstrably more successful academically than I was (not to mention smarter, by most subjective standards), but who consistently performed dramatically worse than I did on standardized tests.
I realize this veers onto turf you’ve already indicated you don’t want to explore, but the reason I mention it is that it’s critical to understanding the appropriate use of tests and why so many of us are contemptuous of them. Results on intelligence tests tend to correlate well with other measures of intelligence. As a statistical analysis of a sample of a particular population, they may even have important things to tell us about the attributes of that population in general. The problem is that for any given individual, with only the evidence of the test, it’s impossible to know whether the test results accurately reflect that individual’s possession or lack of that attribute, or whether the results are as they are because of one of the other factors that can influence them. In the aggregate, people who score a 145 on a standard IQ test may be statistically likely to be extremely intelligent, but there’s nothing about that score that can infallibly tell you anything about a specific individual who achieves it. Yet IQ tests, and other standardized tests, are routinely used to determine the attributes of individuals, and to control and/or shape the opportunities offered or denied to them. And that’s what I find objectionable about them, and why I’ve often attempted to minimize their importance.