Chapter 16: Validity and ReliabilityPreviewThe professional theoretical face of assessment discourse asks the question, is the test reliable? More ethically orientated assessors ask the additional question, is the assessment valid?The public wants to know, is it fair? And the more critical of them might add, are people being violated? In this chapter some of the more recent work on validity is discussed, and its positioning as advocacy demonstrated. Reliability is also discussed as a problematic, rather than as an obvious
prerequisite to validity.
Validity"Validity," states the first sentence of the APA Standards of educational and psychological testing (American Educational Research Association, 1985), "is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores" (p9). It goes on immediately to explain that: "Test validation is the process of accumulating evidence to support such inferences."Which all sounds very scientific and objective and devoid of bias. But is it so? Let me, from my own particular concern with the test taker, rewrite the first sentence to dovetail more accurately with my concerns. "Invalidity," states the first sentence of the alternative tract, "is the most important consideration in test evaluation. The concept refers to the inappropriateness, meaninglessness, and uselessness of the specific inferences made from test scores. Invalidity or error estimation is the process of accumulating evidence to problematise and ultimately reject such inferences." It should be clear even from this small rewrite that a text that began
with the second conceptualisation would be a very different text from one
that began with the first.
PositioningThe main participants in the testing process, we are told, are the test developer, the test user, and the test taker. Also often involved are the test sponsor, the test administrator and the test reviewer. Sometimes, many of these participants may be parts of the same organisation, with the notable exception, of course, of the test taker.As clearly stated in Chapter 1, my position of value, my backdrop when I seek information about events, concerns the violations perpetrated on the participants in those events. So in the matter of testing, my focus is on the test taker, and in what ways the taking of tests and the inferences and consequences flowing from such events constitute a violation - a diminishing of personhood, a misrepresentation of potential or action, a claim to unwarranted accuracy of description, and thus unwarranted control and construction of the living human person who is taking the test. The 1985 Standards acknowledge, with fine understatement, that "the interests of the various parties in the testing process are usually, but not always, congruent" (p1). This trivialisation of the traumatic effects, dislocations, and exclusions of millions of students based on test and examination results is quite remarkable. Perhaps it is just another example of the way social positioning can overwhelm interpersonal sensitivity and intellectual honesty. The concern of the test makers and users is, after all, with hundreds, thousands, or hundreds of thousands of test takers (not to mention their concern with their Board of Directors and shareholders). But their concern is with them, viewed as a group. Their interest is with groups, not individuals; in summaries, not raw data; with simplifying complexities, not with complexifying individuals; with objectifying human subjects, not with subjectifying human events. For the test constructor, sponsor and user there are so many difficult questions; so many criteria to consider; so many factors to consider if the overt and covert claims of the test makers are to be defended. We shall deal with these in due course. Yet to the test taker there is only one question, a normative question which emerges from his or her very construction as an individual. Have I passed or have I failed? Am I satisfactory or unsatisfactory? Am I normal or a nut case? Additionally and ironically, it is precisely because they see the testing
event from this individualised perspective, rather than from a group perspective,
that they do not ask the more crucial, the more fundamental question: How
much error, ambiguity, uncertainty, does this attribution contain? Or is
it their powerlessness, and unheard voice, that makes these questions at
the best unspeakable, at the worst unthinkable?
Sources of evidenceThe 1985 Guidelines describes an ideal validation as includingIs this an over-statement? Here is the first sentence of the next paragraph of the 1985 Standards: "Resources should be invested in obtaining a combination of evidence that optimally reflects the value of a test for an intended purpose" (p9). The word "optimally" says it all. So, validity is clearly an advocacy construct, based on the assumption that any assessment data is innocent until proved guilty. The discourse about validity presents the case for the defence. There is no advocate for the prosecution, so the prosecution case does not present its case. More than this; the very idea of a prosecution case is denied by the definition of validity. Yet here we also see, in the very heartland of post-positivist empiricism, the embryo of a discursive construct; an appeal, not to numbers, but to discourse. Over the next ten years Cronbach (1988) and Messick (1989a, 1989b,1994), doyens of psychometrics, in their born-again personas will enlarge the idea of construct validity to a point where Cherryholmes (1988) will nail it as fully discursive, and thus "linguistically, politically, economically, socially, culturally and professionally relative"(p450). Even so, the advocacy position remains essentially unchanged. Messick(1989b) asserts that : ReliabilityEven though validity has taken on a post-modernist hue of recent times, reliability has, until recently, remained untouched as a "foundational" cornerstone of educational measurement. Reliability was seen as the lower limit of validity. An assessment could not be more valid than it was reliable.The assessment industry, whether local, corporate, government, or quango, has embraced the reliability concept both ideologically and empirically. In contrast to validity, estimates of reliability are often obtained and circulated. There are two reasons for this: the reliability of the test can be measured using only data from the test scores; and often relatively high values (correlations of 0.7 - 0.9) can be obtained, if for no other reason that they are so constructed to ensure that such high internal consistency occurs. Politically such reliability data can be used to "prove" the quality of the test, and maintain the illusion that reliability refers to "the degree to which test scores are free from errors of measurement," which is how they are described in the first sentence about Reliability in the 1985 Standards. In fact, the Standards emphatically insist that: However, even reliability is now under threat. Is there nothing sacred? Moss (1994), has cogently argued that there can be validity without reliability. She points out that: Let's first look at this expectation of high reliability, and the theorising that precedes it. The argument is essentially this - if one test or examination is reliable then another similar test or examination will give the same verdict, however that verdict is communicated - as marks, grades, pass-fail, selected, or whatever. It is logical to assume, therefore, that one half of the test would give the same verdict as the other half, because all of the bits of the test contribute to the final score and hence the final verdict; putting it another way, we are dealing with some linear dimension here, some unitary idea or construct; all of the questions measure it with considerable error, but the more interconnected questions we ask, and the more inter-correlated answers we get, the more the error is reduced, and the more the measurement is refined to approach the true measure of it. Of one thing we are sure. The "it" is out there, waiting to be measured And "it" has a true value, that we can approach but never completely determine. This simplistic positivism is at the epistemological and ontological heart of educational measurement. Teachers and public examination boards do not believe that this is what they are doing, even though the latter have no hesitation is using measurement theory to manipulate their results and rationalise their processes. They do not necessarily believe there is some unilateral trait or ability or skill that underlies the total score or grade. Indeed, as Willmott and Nuttall (1975) point out: Perhaps one more very simple example of this may be pertinent. Imagine a course in electrical wiring which has only two objectives; one relates to the safety requirements, the other to the ability to problem solve in practical situations. An examination is devised to measure the attainment on the course; half of the marks in the examination relate to safety requirements, and half to problem solving. Two students each obtain fifty per cent of the marks. What do we know about their attainment of the objectives? Nothing! One student may have got all the safety questions correct, and the other all the problem solving questions correct. In this case between them they may be considered to know everything, or nothing! In regard to validity, to inferences about objectives made from test scores, the validity has to be zero, if we focus on these individual students. Note that the above argument is valid regardless of the correlations between the scores on the two parts of the paper for a group of students. It can be seen that the reliability of the test in this case is irrelevant, as is any estimate of inferences that may be made about the group of students. For the group we could indeed make inferences about the probability that they knew, on average, a certain proportion of the safety information, and could solve a certain proportion of the problems. But just as a total score loses all the information about individual questions, so does it lose in this case all the information about individual students. Incidentally, correlations across different subjects are often also of the order of 0.8. That is the correlation between two tests of different subjects is about as high as the reliability of any one test. (quoted by Nuttall & Willmott, 1975, p48). Perhaps there is a linear trait after all, but unrelated to the apparent construct being measured. What might this construct be? Traditionalists would be in no doubt that it was a general ability that they would label intelligence. Yet we know that the correlations between examination scores and other sorts of measures (eg, job performance) are very low, of the order of 0.3. So a more direct and sustainable interpretation is that "it" is the ability to perform in the events constructed around examinations. Examinations measure examination ability! The second issue is rarely mentioned
in the literature, and it relates to individual consistency of performance.
An example might be taken from cricket. Batsmen vary in the consistency
of their performance. Consider two batsmen who each has an average of about
thirty runs over a large number of innings. One may score very consistently
between 20 and 40 runs. Another may score the odd century, but may often
make less than 5 runs. Test theory cannot account for this. It defines
30 as an approximation to their "true score," the score that best matches
their "batting ability." But any deviation in a particular innings would
be attributed to "random error," and be expected to assume a random rather
than a consistent pattern. What becomes obvious from this example is that
the average (true) score for these two batsman has a very different meaning;
while for one it may indeed indicate the "most likely" score, for the other
is indicates a most unlikely score indeed.
A fundamental contradictionNow this argument, if we take it a little further, leads to a very strange conclusion. Let's go back to the first line of the Willmott and Nuttall (1975) quote: "it is quite possible that any increase in the reliability would be to the detriment of validity"(p55). They show why this is so in the measurement of any multi-dimensional area, and Moss (1994) indicates why it is so for "hermeneutical alternatives." But increase in reliability from what point? From 0.8, or from 0.5 ? Or from zero? Is there an argument to be made that all reliability negates validity. This would lead us to the apparently absurd conclusion that the greater the reliability the lower the validity, and the ultimately maximum validity is to be obtained from zero reliability. In terms of measurement, this would mean, of course, that human "constructs" were essentially unmeasurable. We can talk about them, but we can't measure them. Which is what Cherryholmes (1994) is really saying when he says the "construct validity is fully discursive." Isn't he?In the next chapter I list thirteen sources of error, thirteen sources of invalidity. Two of these, related to multi-dimensionality and values, are dealt with by Willmott and Nuttall, and by Moss. What of some of the others? Do they show the same pattern of an increase in reliability leading to a decrease in validity? Temporal errors are certainly increased by calculating reliability on the basis of one test at one time. As performance would be expected to vary with occasion and over time, one shot assessment certainly decreases validity error as it increases reliability Contextual errors are certainly increased by confining assessment to pencil and paper situations and producing a very singular and artificial environment in which the assessment occurs, to the extent of standardising format and time available to complete the tasks. Again reliability is obtained at the expense of validity, which implies generalising to other contexts. Construct errors are likewise increased through the limitations of content, form, process and media that is determined and narrowed through the testing or examination procedures. Again the capacity to generalise, and thus the validity, is diminished by the psychometric strictures required for high reliability. The effect of high reliability on categorisation errors is complex. Where categorisation is defined in terms of percentiles of the group tested, categorisation errors are reduced as reliability increases, leading to an increase in validity. However, when one particular marking scheme (rather than another marking scheme) is used to increase the reliability, the reduction in categorisation error is illusory rather than real. And where comparability issues intrude, meaning fogs up as psychometric solutions compound the categorisation problems. So in these areas the effects of reliability on validity are moot. In similar vein, errors attributable to frame of reference shifts, to labelling and attachment confusions, to prediction inaccuracies, or to logical type confusions, are largely indifferent to reliability. And whilst consequential errors, the negative effects of testing, have certainly been exacerbated by the quest for higher reliability, it is the quest rather than the empirical value that is involved. Instrumental errors of course are
reduced as reliability increases; indeed, reliability may be defined as
the inverse of instrument error. So in this one area it is clear that increases
in validity are dependent on increases on reliability. Yet if, as we have
shown, the effect elsewhere is that such increase in reliability either
decreases validity or has an indeterminate effect on it, then the general
proposition holds, and we may say that in the empirical world, the procedures
used to increase reliability result in a decrease in validity.
Born again validityMessick (1989a) has broadened the concept of validity to refer to "the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores"(p13), and this includes the way values "influence in more subtle and insidious ways the meanings and implications attributed to test scores"(p59), so that "test validation embraces all of the experimental, statistical and philosophical means by which hypotheses and scientific theories are evaluated" (p14).Messick's position seems to be generally accepted. The sources of potential error actually referred to do cover the range and depth of epistemological, ontological, and value sources referred to in this thesis. Yet even with this multiplicity of error, this proliferation of possibility of miscategorization, Messick (1989) insists that validity is a unitary concept, a singular "degree of support": Let's look at this in more detail; first it is apparent that appropriateness, meaningfulness, the usefullness are sometimes quite separable. Appropriateness applies very much to particular values. In my value system, any test which violates individual students is inappropriate. Yet it might be quite meaningful in that some inferences made from it can be understood and acted on by teachers and administrators, and it may be useful in that predictions made from it help selection processes. In another case a test of inverted neuroticism may be quite useful in predicting successful medical students, but may be considered inappropriate for that application. It's meaningfulness may be moot. Ultimately, of course, the very meanings of appropriate, meaningful and useful are deferred; they are partial synonyms for valid, the word they supposedly elucidate. It becomes clear that the "unifying force" then is not created by the congruencies among appropriateness, meaningfulness and usefulness, but rather by the "trustworthiness" of the "interpretation." In other words, by the power that resides in the status of the "expert" who controls the discourse in which the judgement is embedded. And because the discourse of validity is in essence about all the ways in which the measurement cannot do all the things it claims to do, and explicitly about some of the ways it might be done better, an advocacy judgment would concentrate on some way or ways in which the test was better than it might have been had such improvements not been made. According to Messick, this is the unifying force that asserts, and thus proves, validity. Specifically, my analysis of Messick's (1989a) definitive paper in the third edition of Educational measurement indicates that he makes reference to over fifty sources of potential invalidity; for indeed, how can he describe how a test may be valid without focussing on all of the ways in which it might not be valid. I have indicated some of these references, and their relation to the error sources that I specify, in the next chapter. Finally, the very existence of validity is established, validity is indeed made manifest, through the denseness of the arguments used to refute such existence, together with the reassurance that the battle continues, and some gains have been made. Let me be specific: The definition of the construct of validity does not exclude the notion of invalidity. However, the discourse on validity, constructed as it is from the position of advocacy, excludes the notion of invalidity as an issue. More than this, the discourse itself becomes the arbiter of the proof of validity claims, independently of empirical data, which becomes irrelevant within the density and complexity of the discourse; as a result, empirical data to justify validity claims is rarely collected, and when it is it is inevitably construed as supporting the claim. Evidence rejecting the validity claim is never collected because such positioning is absent from the discourse. Madaus (1986) puts it nicely: Validity and the predominant paradigmWhen advocacy is positioned, aligned to the predominant paradigm, then advocacy is interpreted as truth. Truth not as the production of true utterances, but in Foucault's (1982) sense of "the establishment of domains in which the practice of true and false can be made at once ordered and pertinent"(p8). From the 1980s, when the prevailing societal metaphor is the discourse that surrounds economic rationalism, and in particular those myths connected with people competencies, the metaphor is rabidly post-positivist, and validity definitions (advocacies) based on those assumptions will be seen as self-evidently true. As Cherryholmes (1988) puts it from his post-modern perspective: "boundaries limiting construct-validity discourse have yet to be justified. They are policed nonetheless "(p154).In contradistinction, advocacies for more post-modern descriptions (eg validity characteristics for qualitative research) are clearly not aligned to the prevailing world-view, and so will be interpreted as justifications. They advocate from a loser's position, so at the best their views are accepted as tentative, at the worst as unproven and hence unacceptable assumptions. This is inevitable because no abstraction can be proven to be correct, so acceptance is always a function of value, rather than of rational proof; and moral value is usually construed as stabilisation of the status-quo, as confirmation of the predominant paradigm. Shepard (1991) gives an example: "measurement specialists asserted that performance assessments are less reliable and less valid than traditional tests and that they are potentially biased because they rely on fewer tasks." But then she adds: "Why are existing tests presumed to have the high ground in this dispute? What claim do traditional tests have to validity?" (p10). This is not to deny the acceptance
of such advocacy in localised communities (eg some faculties of some Universities)
where a paradigm shift has already occurred.
Qualitative assessment and qualitative researchValidity criteria in qualitative assessment has lagged behind validity in quantitative research. However, the two fields are closely aligned. In fact Messick (1989a) regards then as virtually synonymous in thatSummaryWe have worked our way through some of the minefields of validity and reliability discourse. In particular I have indicated how the notion of advocacy built into the very definition of validity overwhelms scientific detachment, and effectively silences the logical inferences that derive from the voices of confusion and error that are the very basis of validity discourse.The emphasis on reliability of assessment instruments is also shown to be a misplaced source of credibility for assessment, because measures to increase reliability are shown to decrease validity. Now the coin can be flipped. The underside of validity can be examined. The nastiness of error can be exposed. In the next chapter the sources of invalidity are spelt out in detail. Return to Table of Contents |