|
Part 5: Synthesis
Chapter 17: Error and the reconceptualising of validityPreviewFrom the analysis so far, it is possible to produce a general definition of error as it applies to the field of educational measurement and/or categorisation. This is the flip side of validity which exposes that general nastiness called invalidity.In this chapter the notion of invalidity is reconceptualized, having
both discursive and measurable components. Thirteen (overlapping) sources
of error are examined, all contributing to the essential invalidity of
categorisations of persons. For easy reference I have indicated the summary
theoretical and practical definitions of these error sources in bold print.
Definition of errorError is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field.All assessment statements about a person are statements about that person engaged in an event, or a potential event. They are descriptions or indicators or inferences about the person's performance in that event. As such they involve at the very least an event in which the person being assessed is an element, and an event in which the assessor engages directly in the first event, or with a product (element) of it. Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable. I want to list some of those sources of error, some of the conditions
that change the measurement of a standard from a thin red line into a broad
blue band: In doing so I will reject the notion of construct validity as
a unitary concept, and dismember its dark side into disparate if sometimes
overlapping categories.
Sources of errorI have named these sources of error:
2. Contextual errors 3. Construction errors 4. Labelling errors 5. Attachment errors 6. Frame of reference errors 7. Instrument errors 8. Categorisation errors 9. Comparability errors 10. Prediction errors 11. Logical type errors 12. Value errors 13. Consequential errors 1. Temporal errorsWe would hope our description of performance would have some substance; would be a stable quantity, invariant over time and space, rather than some ephemeral numerical butterfly attaching itself momentarily to the person assessed. If the person's performance is described differently if done at another time, in another place, with another group of people, then such difference as there is represents a source of error.Or is it? Should we rather discount stability as being counterproductive in an educational situation? If stability is seen as the very antithesis of the educational enterprise, which we could define as being dedicated to change, then we would not wish any description to remain stable, as this would represent a nullification of the educational process. Contrarily, if we wish to maintain stability as a criteria for assessment accuracy, we must be certain that all learning pertaining to the performance ceases at the time of assessment. And that none occurs during the assessment process. As well as all forgetting for that matter. Otherwise the error of the description increases rapidly, as the permanency of the description becomes increasingly dismembered by the ravages of time. Regardless of which side of the fence we want to sit, or whether we want to sit on the fence, pretend it isn't there, and attribute the concomitant pain to other variables, stability must logically remain as a pertinent, or in conventional circles an impertinent criteria, to be considered in any estimate of error in assessment. My conclusion is that the logic of its contradictions makes most of the academic and psychometric definitions of reliability trivial. So temporal errors have their genesis in changes that occur over time; persons change over time; tests change over time; the "same" event has different meanings over time. People are not computers, they react differently at different times; and they forget. So temporal errors increase over time. (Not to mention that different people make different meanings out of the same event; which makes it, of course, a different event.) Temporal errors thus include all those confusions that constitute the dark side of stability, one aspect of reliability. Practically, temporal errors are indicated by the differences in assessment
description when the assessment occurs at different times
2. Contextual errorsContextual errors constitute the underside of claims to generality and generalisability.Any performance is relatively specific and defined: It is a single instance of possible instances; it is an event chosen from a multitude of possible events; it is a particular designed to illustrate a generality. Yet the performance will invariably be described (labelled) in terms of the generality it aspires to, rather than the specifics that define it. This is true of almost any evaluation, any test that goes beyond the description of a single behavioural objective, and even that, one step back, will often be found to be illustrative of a class of objectives, rather than of particular significance in its own right. In the old days (good or bad depending on our values), this would constitute an example of "transfer of training." The claim was that if you could think clearly in Latin, then this should transfer to dealing adequately with the complexities of life in the social world; or if you could think logically in mathematics, then you could do so in international affairs; not to mention playing Rugby being a necessary prerequisite to running an Empire. When empirical data showed that such transfer was tenuous, the notion was kept, but the name changed. Taxonomic terms such as application and analysis, or the more up-market process called problem solving, have latterly laid claim to this temporarily non-habitable area. As well, the notion of a "skill" has latterly become fashionable, and generalisable social, cognitive, emotional, spiritual, and psychomotor skills proliferate, securely untrammelled by prophylactic empirical data of any kind. As soon as assessment descriptions are committed to paper, their material permanency is dramatically increased. Likewise, the span of their associations is spread and emphasised. No longer just a description of a particular performance, the assessment becomes interpreted as a measure of knowledge and ability, an indicator of achievement on a course of study, and a predictor of future success or failure. One source of error then is the magic transformation that occurs between numbers and categorisations, between specific acts and generalised descriptions. Unless the assessment statement purports to be no more than a statement about a particular assessment event, then the differences between this statement, and those obtained from all other possible contexts, is error; these are the generality differences attributable to other equally relevant contexts, eg written, oral, cooperative, on-the-job; all those boundaries that possibly could contain the assessment event that are different to the boundaries of the particular assessment event. Context also includes those power relations that pervade it and the judgment processes embedded in it that affect the performance of the person assessed, and the judgment of the person assessing; and this includes those that the boundary localises, as well as those that invade its permeable surface. Contextual errors contain all the ambiguities inherent in those relations and elements and discourses that impinge on the event, but get excluded from the label. Practically, contextual errors include all those differences in performance
and its assessment that occur when the context of the assessment event
changes.
3. Construction errorsThe performance that is described in an assessment is generally built up of a number of parts; a science test is built up from a number of questions; an electrical automotive practical test requires the identification and repair of a selection of common electrical faults; a social skills assessment requires gradings on a number of interactional criteria, or more likely a game constructed about such criteria in multiple choice form. Such constructions are designed to represent the course of study, or the skill requirements, or the criterion referenced framework, that the assessment is supposed to describe. Further back still, the course has itself been constructed to improve performance in some areas of living, in some role as citizen, home maker, academic, engineer, baker, or whatever.Somewhere, sometime, someone must make a choice about how far back along the chain of constructions we go in order to estimate the error, the difference between the "perfect" description of performance and the actual one that our assessment produces. Let's take the electrical automotive test as an example. We could begin with a requirement to describe how well a student could identify and repair any electrical fault on any car brought into any garage (A). From this we construct a thirty hour course of study called Automotive Electrical Mechanics 2M, complete with course aims and objectives and assessment criteria (B). From this we construct a one hour pencil and paper test (C) and a two hour practical assessment (D). Now how are we to describe the construction error in assessing a particular person? Is it the difference between the descriptions given in C and D? Or the difference between the matches of B and C on the one hand, and B and D on the other? Or should we look at the matching between C and D and A? Or is it all of these? You mean put people who've done the course into a garage and see how they perform? Yeah. Why don't you do that? It would be very expensive to do it for everyone? You don't have to do it for everyone. Just for enough people so you'd know if there was an error. There's always going to be an error. OK, so you find there's an error. If it was a small one then you could assume that the course, or the test at the end of the course, was well constructed because it did what it was supposed to do. That would be nice. And if there was a big discrepancy then you'd have to do it different. Do what different? I dunno. You're supposed to be the expert. Do the end-test different. Or do the course different. Or it might be easier to find another garage. Bit dangerous. There could be a lot of people get upset if we did that. No telling what sort of litigation we might run into if we found that the course didn't do what we said it would do. So ignorance is bliss, huh? Certainly not. We just need to be very careful, in terms of spending time and money on obtaining information that at the best will be useless, and at worst will only erode confidence and create instability. Like I said. Mum's the word! Construction errors contain all of those errors in sampling, all the idiosyncrasies and biases that are contained in the construction of a specific test or set of demands that constitutes one element of the assessment event: these include not only the construction of the test content, of its elements, but also the construction of its form and style. Construction errors include all those generality errors attributable to the performance task itself, rather than to its timing or its context. Practically, construction errors
are indicated by all those differences in assessment description when the
same construct is assessed independently by different people in different
ways.
4. Labelling errorsAssessing is about describing some human performance. To give it a meaning the "some" must be specified: performance in typing; skill in mathematical problem solving; a dramatic presentation. So regardless of frame, it is necessary to specify in some way what it is that is being described. We must label the area of performance in some way, for otherwise it cannot be communicated.The meaning of a communication is its reception, not its intention. In assessment the label is the message which is intended to describe a particular area of performance - involving particular knowledge, understandings, skills, processes, or whatever. The label has a particular meaning for the assessor growing out of this intention. Different meanings before the event will result in different assessment events being constructed to fit the label. What meaning the assessed, or any other person who has access to the label, gives to it, is moot. But of one thing we may be certain. The meaning will not be identical to the meaning intended. The difference may be slight, or immense, but regardless of the magnitude will represent, at a fundamental level, an error (Korzybski 1933). Different meanings after the event will result in different interpretations of the assessment label, different inferences about what it implies. An assessment must be an indicator of something. It must have a name. Differences in the meaning of the name, both before and after the event, constitute confusion and hence error. Labelling errors are defined by all the differences given to the meaning of the assessment (what it actually measures) by all the participants in the assessment event(s), and by the users of the assessment information. Practically, labelling errors are
indicated by the range of meanings given to the label by all those who
use it before, during or after the assessment event.
5. Attachment errorsThere is a further issue in regard to labelling. Once the label has been marked in some way, once the description is attached to it, where is it pinned? Does it belong to the person assessed? Is it more a description of the assessor? Does it represent some quantity or quality that might more appropriately hover somewhere in the space between, a relational field vector describing a complex interactional phenomena involving task, performance, assessor and assessed?Given my ontological stance that all information is information about events, it follows that any attempt to attribute such information to a particular element of the event involves a fundamental epistemological error. To the extent that all other elements and conditions are held constant and overtly included in the description, to that extent is the simplification of language involved in the specific attribution partially justified; but such specificity of the conditions of the event tends at the same time to increase contextual error. Attachment errors are the ontological slides that occur when a description of a relational event is attached to one of the elements of that event; specifically, when a complex relational event involving the construction of a test, an interaction of the test with a person, and a judgment of an assessor, is described as a property of the assessed person, this is an error in attachment. Practically, attachment errors are
indicated by the specification of those elements and boundaries of the
assessment event that have become lost in the assessment description.
6. Frame of reference errorsWithin the assessment arena are four competing definitions of the true, the correct, the impeccable. It follows that there are four associated notions of error. To the extent that the definitions of assessment truth, or more specifically the assumptions underlying them, are contradictory, so will be our methods for reducing error in the different frames; further, to the extent that the frames are confused, to that extent is error compounded. (See Chapter 13).Frame of reference errors are defined by all those confusions and category differences that occur because of the different stable assumptions of the four frames of reference for assessment, as well as those contradictions and confusions that occur when shifts occur between frames during the assessment process. Practically, frame of reference errors
are indicated by specifying the frame in which the assessment is supposedly
based, and indicating any slides or confusions that occur during the assessment
events.
7. Instrument errorsAny measurement requires a measuring instrument. So any rank ordering, grading or scoring involves some measuring instrument; at the very least, such an instrument must attend to questions of calibration, which involves scale, replicability, and theory-practice bridging. Any claims to measurement must relate to some defined Standard scale. Whether the instrument is a test of some sort, or is assumed to have some material reality inside the mind of an examiner, all measuring instruments contain errors in mechanisms and hence in their readings. (See Chapter 9)When psychometric theories are used, instrument errors are fed by all of the discrepancies between the theory and the empirical data, and are intrinsic in all of the notions of probability that pervade such theories. Instrument errors then contain all those uncertainties of calibration, all those anomalies of replicability, all those confusions and discrepancies and mis-matches in theory-practice bridging, that are involved in the determination of the rank order, in the making of the mark, in the determination of the measure. Practically, many aspects of instrument
error are covered by other category errors. To avoid unnecessary overlap,
I will limit the practical indicator of instrumental error to those errors
implicit in the construction of the measuring instrument itself; what is
conventionally called standard error of the estimate.
8. Categorisation errorsAny categorisation involves a comparison between a standard of acceptability, and a particular measurement or judgment about adequacy or quality.Categorisation errors derive from confusions about the definition of standard of acceptability, from differences in the meaning of what is being assessed and in the magnitude of its measurement, and in the variability of the judgment process in which the comparison with the standard is made. ( See Chapter 11) Practically, categorisation errors
are all those differences in assessment description that occur when particular
data is compared with a particular standard to produce a categorisation
of the assessed person.
9. Comparability errorsComparability errors occur whenever assessment scores are added to produce a total score. Public examinations and grade point averages are examples of such summations, as are any qualitative assessments involving more than one criteria. What such additions mean, and who is privileged by such additions, are questions inherent in the process.Comparability errors include all those confusions about meaning and privileging that inhabit the addition of test scores, grades or criteria related statements. Practically, comparability errors
are indicated by constructing different aggregates according to the competing
models. The differences that these produce indicate the comparability error.
10. Prediction errorsImplicit in most assessment, and explicit in some, is the notion of prediction. Whilst the idea of generality contains some element of logic in its derivation, prediction can be pure magic - correlation without connection is very possible, and is not predicated on causal relationship. It has been reported that the number of storks sighted over London is correlated with the number of births in that city, and thus may be used as a predictor. The causal relation here is moot.More seriously, many assessment descriptions are overtly or covertly connected to expectations about future performance. High school grades are presumed to be related to success at College or University. School performance is expected to relate to job success. Trade courses are designed to improve quality of performance in the workplace. So assessments on those courses might be expected to correlate with later performance. Yet even if they do, this in no way proves there is any causal link. The criterion measures themselves are often problematic; most practical criterion measures themselves involve an assessment, subject to all of the sources of invalidity and error that dogged the original assessment. High predictive correlations may occur because both assessments are measuring something other that what they are described as measuring; for example, the ability to perform in competitive, written events, independent of the content. And low predictive correlations may mask genuine positive relationships because of all the errors entailed in the assessments, though such "genuine relationships" must forever be hidden, relegated to fantasy because divorced from empirical sustenance. Alternatively low correlations may mask the reality of relative homogeneity of performance status, or of genuine multi-dimensionality of that performance. So interpreting the meaning of high correlations can be quite tricky. For example, if the rank order of students on a university entry examination in Physics correlates 0.9 with their first year Physics results at University this could be interpreted as an enormously successful outcome in terms of educational prediction. It is also completely consistent with the implication that no new Physics has been learnt, or that the University course has been completely unsuccessful in compensating for initial inequities in knowledge and opportunity. What becomes apparent is that this area of prediction, which on the surface seems very amenable to empirical verification, is fraught with errors of interpretation which are neither measurable nor resolvable. Positioning and power relations will largely determine the trend of the discourse, and whether such discourse becomes a verification of validity, or an explication of error. Explicit or implicit in most assessments is the claim that they relate to some future performance, that they predict a particular product from some future event, a quality of some future action. Prediction error is the extent to which these predictions, and the subsequent events, are not identical. Practically, prediction error is
indicated by the differences between what is predicted by the assessment
data, and what is later assessed as the case in the predicted event.
11. Logical type errorsTest scores are often interpreted as giving specific information about what a student can or cannot do. For example, a score of 90 per cent on a spelling test gives no information about whether any individual item on the test was actually spelt correctly by a particular student. Any assumption to the contrary is a logical type error. Similarly, a score of 80 per cent on a mastery test gives no information about what information or skill has been mastered. Common inferences made from test scores are riddled with such logical type errors.Logical type errors occur whenever there is confusion between statements about a class of events, and statements about individual items of that class. Practically, logical type errors
are made explicit when the explicit and implicit truth claims of a particular
assessment are examined and any logical type errors are made explicit.
Such exposure may invalidate such claims.
12. Value errorsAll tests and examinations involve the construction of questions and the interpretation and valuation of answers. As such they are explicit and implicit statements about value; these particular questions, and these favoured answers, are implicit statements about what knowledge, actions, processes and interpretations are valued. And by implication, which are not so valued. Such implications move well beyond content; style and form and medium are of equal or more importance.To the extent that the values implicit in the assessment event are not explicit, are contested, or are contradictory, to that extent is the assessment event invalid with respect to value. To the extent that the assessment event(s) and the event about which inferences are made are incongruent in terms of their value assumptions and emphases, to that extent is the error component engorged. Practically, value errors are indicated
by making explicit the value positions explicit or implicit in the various
phases of the assessment event, including its consequences, and specifying
any contradiction or confusion (difference) that is evident.
13. Consequential errorsMessick (1989a) and Cronbach (1988) both accept that the effects of testing have to be taken into account when assessing the validity of testing. It follows that any distortion of learning through the assessment process constitutes a source of error.To take this view, however, is to make an extension to the meaning of validity, or of invalidity. For we have to ask, in what way does such distortion of learning detract from the appropriateness, usefulness, or meaningfulness of the inferences made from the test scores? Are the test scores less useful because they have distorted the learning process? Certainly in such a situation the testing process has been counterproductive, which is a good reason for dismantling it, if learning is a major purpose of education. However, earlier chapters have shown this to be a naive proposition. Assessment has other more important if less salubrious social purposes. Logically, distortion of learning increases error only if we take error to include not only the differences between what the test measures and what is or might be, but also between what the test measures and what might have been. This seems to take us into a rather transmogrified realm. Even so, any distortion of learning possibilities contributes to the violation of those persons whose learning, and possibility of growth, is thus diminished. And as that very learning is part of the event that the assessment presumes to measure, then it is legitimately included as inappropriate, and thus a source of error, a (retrospective) interactive interference effect. Consequential errors involve all those negative effects on a student's learning and a teacher's teaching that are attributable to the assessment event. (To the extent that it produces inequity among sub-groups, positive effects on learning may also be involved). Practically, at a simplistic level,
consequential errors are indicated by the differential positive and negative
effects that individual teachers and students attribute to the assessment
process: At a more profound level it involves an explication of the very
construction of their individuality, and all of the potentially violating
consequences of those constructions. (See Chapters 4 & 5)
Invalidity according to MessickMessick's (1989a) treatment of Validity in Educational Measurement is an excellent review of current (theoretical) state of the art, progressive in stance, and its implications vastly surpass current practice.In this section Messick's work is looked at from the standpoint of invalidity, in order to indicate that the sources of invalidity indicated above are indeed well- established, if somewhat opaquely discerned, in the literature on validity. Temporal errorHere are two passages from Messick that illustrate some of the temporal problems of validity. The first relates to the lack of necessary conjunction between construct meaning on the one hand, and stability of measure on the other:Here is the second example. Messick argues that it is not necessary to assume that Contextual errorContextual errors receive a lot of attention from Messick. Here is one example:Moreover, in terms of error in individual measures he misses the point; for even with knowledge of the relationships between test measure - group - context, we still have no knowledge about the specific error in an individual score. (In group terms it could be anywhere within plus or minus three standard errors from the estimate). Here is another example that raises the more fundamental issue of context as boundary condition: Construction errorsConstruction problems are often dealt with in terms of content validity. Messick comments that "the heart of the notion of so-called content validity is that the test items are samples of a behavioural domain or item universe about which inferences are to be drawn" (p36). He has some problems with this, for "to achieve representativeness . . . one must specify not only the domain boundaries but also the logical psychological subdivisions or facets of the behaviour or trait domain" (p39). Furthermore, "in point of fact, items are constructed, not sampled" (p40). And finally, Messick's crunch point :Labelling errorsMessick is adamant that "the meaning of the measure . . . must always be pursued - not only to support test interpretation but also to justify test use" (p17).At least some of this meaning is carried by the construct label, and "constructs are broader conceptual categories than are test behaviours, and they carry with them into score interpretation . . . the evaluative overtones of the construct labels (p59). One such problem with the label is how broad to make it. Messick spells out the dilemma: It would seem from Messick's own example that the label must be individualised in meaning before it can validly be applied to an individual person. Attachment errorsThe idea that assessment data gives information about an event rather than about a person is contrary to the very conception of assessment in general, and to psychometrics in particular. However, there are glimmerings of light in Messick's work that are encouraging. Here are two examples:. . . the important validity principle embodied by this term (trait validity) might be mistakenly limited to the measurement of personal attributes when it applies as well to the measurement of object, situation and group characteristics (p15). Frame of reference errorsMessick does not mention frame of reference errors in the form that I have developed them in this dissertation. However he does talk of the various theoretical frameworks for intelligence, including the two well-known "geographic" models of intelligence as a single dimension, or as multiple discrete abilities. And then goes on to mention a computer model, an anthropological model, a sociological model and a political model. He then comments:Instrument errorsInstrument errors as such don't get much attention in this work, perhaps because, as I have defined them, they are an aspect of reliability rather than of validity, and so are dealt with in a different chapter in Educational measurement (Linn, 1989a).However, he does note that "the very fact that one set of behaviours occurs in a test situation and the other outside the test situation introduces an instrument error"(p37), indicating that he is aware of a fundamental shift in context that pervades the use of tests for assessment. Categorisation errorsAbout the validity of any particular categorisation Messick is remarkably silent. A short section on decision models of cost - benefits is all that scratches the surface of the chasm of silence (p78-80). This despite the fact that in practice the meaning of the categorisation assumes more importance than the meaning of the construct; to the individual student the distinction, or the failure, is more important than whether the assessment measured what it claimed to measure.The substantiality of the standard is a necessary prerequisite to the allocation of a measure to a category. Or, for that matter, of the conversion of a category to a measure, as in a conversion of "better or worse" to "more or less." Are standards then irrelevant to construct validity, which in Messick's model is all validity? For surely the construct meaning given to a test score is submerged in the social world, in most cases, under the weight of its categorisation as a grade. To limit the definition of validity to test scores hardly affects the issue, because surely the categorisation then becomes the first interpretation, the first utility, the first action, and hence a crucial element in the validity discourse. Should I really have been so surprised, as I most genuinely was when I realised for the first time, as I wrote the two preceding paragraphs, what had occurred? Was it a conscious decision on Messick's part not to include the categorisation issue in his extremely comprehensive study? Or is the erosion of the problem of the standard from professional and public memory so complete. Certain it is that though I have been very familiar with Messick's chapter for four years, and standards are my major area of interest, I had not noticed the almost complete omission of any treatment of the issue in his definitive paper on validity till now. Whatever, categorisation errors remain a major source of invalidity in assessment, and without clear evidence to the contrary, must be assumed to be very large indeed, making most categorisations of individuals invalid. Comparability errorsNow whilst Messick is certainly aware that "a single total score usually implies a unitary construct and vice versa" (p44), he does not develop many validity implications of this until he begins to discuss test-criterion relationships. He makes the point that "criterion measures must be evaluated like all measures in terms of their construct validity"(p70). He seems to accept that most criterion measures are "multiple and complex." He points out that it does not "make such sense logically to combine several relatively independent criterion measures . . . into a single composite as if they were all measuring different aspects of the same unitary phenomenon"(p74). He goes on to state that:Prediction errorsMessick discusses prediction errors under the general rubric of test-criterion relations and decision making (p69 -88). He points out that "the major threats to criterion measurement . . . are basically the same as the threats to construct validity in general" (p73). In other words, errors are compounded in prediction errors because the errors in the test are multiplied by the errors in the criterion measure. In addition "other biasing factors include inequality of scale units on the criterion measure, which is a continual concern when ratings serve as criteria, and distortion due to improperly combining criterion elements into a composite" (p73). He talks of "inappropriate weights . . . applied to various elements in forming composites" (p73), yet who could say what an "appropriate" weight was?So one source of confusion is whether the criterion domain "entails a single criterion or multiple criteria" (p74). He concludes that: Value errorsIn terms of the validity of tests, Messick is adamant that "the issue is no longer whether to take values into account, but how" (p58). It follows that "because validity and values go hand in hand, the value implications of score interpretation should be explicitly addressed as part of the validation process itself" (p59).He is also clear that "data and values are intertwined in the concept of interpretation"(p16), and furthermore, "values . . . influence in more subtle and insidious ways the meanings and implications attributed to test scores with consequences not only for individuals but for institutions and society" (p59). So it is not only obvious biases expressed in interpretations that we are dealing with here, but "more subtle" mechanisms. For example, not only are "some traits . . . open to conflicting value interpretations" (p60), (shouldn't this read "all traits"), but "the tenability of cause-effect implications is central, even if often tacitly, to the construct validation of a variety of educational and psychological measures such as those interpreted in terms of ability, intelligence, and motivation" (p58). So if cause-effect thinking is shown to be simplistic and epistemologically bankrupt in a more ecological world-view, where does that leave such "traits"? So Messick centred his attention Consequential errorsMessick pays considerable attention to the consequential basis of test validity (p58-63). By this he means "the often subtle systematic effects of recurrent or regularised testing on institutional or societal functioning" (p18). He is firm that "social consequences cannot be ignored in considerations of validity" (p19). He then spells it out in more detail:Messick's fudged solutionAs briefly indicated above, Messick's chapter on Validity is a chamber of horrors, a gruelling journey through deep and varied sources of invalidity that would surely deter any rational person from ever attempting to show that any test was valid. Yet again and again he slides back into psychometrics, into "multiple choice" tests, into technological fixes, into the fudged solution.Here is one such: "Tests," explains Messick, "are imperfect measures of constructs because they either leave out something that should be included according to the construct theory or else include something that should be left out, or both"(p34). Not so. Messick has, conveniently, left out the fourth alternative, "or neither." And surely this is the alternative most congruent with his own analysis. By doing this he has assumed the very thing that is in doubt - that the construct can, in fact, be measured at all, in the light of epistemological issues, multi-dimensionality problems, value confusions, comparability errors, and so on. SummaryTo summarise, the notion of error is circumscribed by the construction of the event being described, just as it is boundaried by the epistemological assumptions of the judgment process.Theoretically, error in assessment contains within its ambit all those ontological inadequacies, all those epistemological slides, all those logical contradictions, all those semantic obfuscations, all those definitional fudges, all those ideological camouflages, all those value variations, as well as all those potential empirical falsifications of implicit truth and accuracy claims, that characterise the field. Practically, the description (measurement)
of error is not dependent on any notion of a single truth, but rather on
one of differences between multiple truths, all with some claim to legitimacy;
these are implicit in the production of the assessment event, in the interpretations
of the assessed and the assessor's experience of that event, including
categorisations, and in the particular intended and received meaning of
the communication of that judgment to others. The error becomes explicit
when all of these phases of the assessment event are pluralised; when genuinely
independent events are constructed; when independent categorisations are
produced by participants in the event; when the judgments, and the meanings
given to those judgments by involved persons, are compared.
ConclusionThus whilst the theoretical aspects of validity may indeed be fully discursive as Cherryholmes (1988) argues, the practical extent of invalidity is demonstrable as an empirical reality in the material world, partly as a result of that very discursiveness. For example the analysis presented earlier of the electrical automotive test presented irresolvable complexities in determining what empirical meaning could be given to the validity of the assessment. As the notion of validity is currently constructed, it would be resolved, if it was attended to at all, by the validity advocate giving an expert and coherent case for the defence, which would be unchallenged. That is, it would be resolved by resort to the Judge's frame of reference, and ignoring the other frames.From the standpoint of invalidity, there is no such confusion. All of the suggested measures are useful measures, and the range of estimates that they produce for any one trainee indicates the range of error within which that person is being categorised. And we should not be surprised if at times this range covers the whole range of categories available. As indicated in earlier chapters, the categorisation of persons has enormous effects on people, both in terms of their conceptions of themselves, and in their subsequent implicit and explicit exclusion from occupational opportunities. Such exclusion is not a discursive practice, but a very practical reality, though doubtless language is a significant factor in the acceptance of the violation. Further, the immense uncertainties associated with such categorisations is both demonstrable and measurable. I have argued that validity discourse is currently constructed in such a way as to deny this demonstration. Invalidity discourse, based on the detailing of error components as presented here, is an advocacy for the defence of the examined rather than the examiner. As such it tends to redress the power imbalance, and hence reduce structural violence and increase social justice. Return to Table of Contents |