Part 5: Synthesis
 

Chapter 17: Error and the reconceptualising of validity

Preview

From the analysis so far, it is possible to produce a general definition of error as it applies to the field of educational measurement and/or categorisation. This is the flip side of validity which exposes that general nastiness called invalidity.

In this chapter the notion of invalidity is reconceptualized, having both discursive and measurable components. Thirteen (overlapping) sources of error are examined, all contributing to the essential invalidity of categorisations of persons. For easy reference I have indicated the summary theoretical and practical definitions of these error sources in bold print.
 

Definition of error

Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field.

All assessment statements about a person are statements about that person engaged in an event, or a potential event. They are descriptions or indicators or inferences about the person's performance in that event. As such they involve at the very least an event in which the person being assessed is an element, and an event in which the assessor engages directly in the first event, or with a product (element) of it.

Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.

I want to list some of those sources of error, some of the conditions that change the measurement of a standard from a thin red line into a broad blue band: In doing so I will reject the notion of construct validity as a unitary concept, and dismember its dark side into disparate if sometimes overlapping categories.
 

Sources of error

I have named these sources of error:
 
    1. Temporal errors

    2. Contextual errors

    3. Construction errors

    4. Labelling errors

    5. Attachment errors

    6. Frame of reference errors

    7. Instrument errors

    8. Categorisation errors

    9. Comparability errors

    10. Prediction errors

    11. Logical type errors

    12. Value errors

    13. Consequential errors

 

1. Temporal errors

We would hope our description of performance would have some substance; would be a stable quantity, invariant over time and space, rather than some ephemeral numerical butterfly attaching itself momentarily to the person assessed. If the person's performance is described differently if done at another time, in another place, with another group of people, then such difference as there is represents a source of error.

Or is it? Should we rather discount stability as being counterproductive in an educational situation? If stability is seen as the very antithesis of the educational enterprise, which we could define as being dedicated to change, then we would not wish any description to remain stable, as this would represent a nullification of the educational process.

Contrarily, if we wish to maintain stability as a criteria for assessment accuracy, we must be certain that all learning pertaining to the performance ceases at the time of assessment. And that none occurs during the assessment process. As well as all forgetting for that matter. Otherwise the error of the description increases rapidly, as the permanency of the description becomes increasingly dismembered by the ravages of time.

Regardless of which side of the fence we want to sit, or whether we want to sit on the fence, pretend it isn't there, and attribute the concomitant pain to other variables, stability must logically remain as a pertinent, or in conventional circles an impertinent criteria, to be considered in any estimate of error in assessment. My conclusion is that the logic of its contradictions makes most of the academic and psychometric definitions of reliability trivial.

So temporal errors have their genesis in changes that occur over time; persons change over time; tests change over time; the "same" event has different meanings over time. People are not computers, they react differently at different times; and they forget. So temporal errors increase over time. (Not to mention that different people make different meanings out of the same event; which makes it, of course, a different event.)

Temporal errors thus include all those confusions that constitute the dark side of stability, one aspect of reliability.

Practically, temporal errors are indicated by the differences in assessment description when the assessment occurs at different times
 

2. Contextual errors

Contextual errors constitute the underside of claims to generality and generalisability.

Any performance is relatively specific and defined: It is a single instance of possible instances; it is an event chosen from a multitude of possible events; it is a particular designed to illustrate a generality. Yet the performance will invariably be described (labelled) in terms of the generality it aspires to, rather than the specifics that define it. This is true of almost any evaluation, any test that goes beyond the description of a single behavioural objective, and even that, one step back, will often be found to be illustrative of a class of objectives, rather than of particular significance in its own right.

In the old days (good or bad depending on our values), this would constitute an example of "transfer of training." The claim was that if you could think clearly in Latin, then this should transfer to dealing adequately with the complexities of life in the social world; or if you could think logically in mathematics, then you could do so in international affairs; not to mention playing Rugby being a necessary prerequisite to running an Empire. When empirical data showed that such transfer was tenuous, the notion was kept, but the name changed. Taxonomic terms such as application and analysis, or the more up-market process called problem solving, have latterly laid claim to this temporarily non-habitable area. As well, the notion of a "skill" has latterly become fashionable, and generalisable social, cognitive, emotional, spiritual, and psychomotor skills proliferate, securely untrammelled by prophylactic empirical data of any kind.

As soon as assessment descriptions are committed to paper, their material permanency is dramatically increased. Likewise, the span of their associations is spread and emphasised. No longer just a description of a particular performance, the assessment becomes interpreted as a measure of knowledge and ability, an indicator of achievement on a course of study, and a predictor of future success or failure.

One source of error then is the magic transformation that occurs between numbers and categorisations, between specific acts and generalised descriptions. Unless the assessment statement purports to be no more than a statement about a particular assessment event, then the differences between this statement, and those obtained from all other possible contexts, is error; these are the generality differences attributable to other equally relevant contexts, eg written, oral, cooperative, on-the-job; all those boundaries that possibly could contain the assessment event that are different to the boundaries of the particular assessment event. Context also includes those power relations that pervade it and the judgment processes embedded in it that affect the performance of the person assessed, and the judgment of the person assessing; and this includes those that the boundary localises, as well as those that invade its permeable surface.

Contextual errors contain all the ambiguities inherent in those relations and elements and discourses that impinge on the event, but get excluded from the label.

Practically, contextual errors include all those differences in performance and its assessment that occur when the context of the assessment event changes.
 

3. Construction errors

The performance that is described in an assessment is generally built up of a number of parts; a science test is built up from a number of questions; an electrical automotive practical test requires the identification and repair of a selection of common electrical faults; a social skills assessment requires gradings on a number of interactional criteria, or more likely a game constructed about such criteria in multiple choice form. Such constructions are designed to represent the course of study, or the skill requirements, or the criterion referenced framework, that the assessment is supposed to describe. Further back still, the course has itself been constructed to improve performance in some areas of living, in some role as citizen, home maker, academic, engineer, baker, or whatever.

Somewhere, sometime, someone must make a choice about how far back along the chain of constructions we go in order to estimate the error, the difference between the "perfect" description of performance and the actual one that our assessment produces.

Let's take the electrical automotive test as an example. We could begin with a requirement to describe how well a student could identify and repair any electrical fault on any car brought into any garage (A). From this we construct a thirty hour course of study called Automotive Electrical Mechanics 2M, complete with course aims and objectives and assessment criteria (B). From this we construct a one hour pencil and paper test (C) and a two hour practical assessment (D).

Now how are we to describe the construction error in assessing a particular person? Is it the difference between the descriptions given in C and D? Or the difference between the matches of B and C on the one hand, and B and D on the other? Or should we look at the matching between C and D and A? Or is it all of these?

Why don't you describe A directly?

You mean put people who've done the course into a garage and see how they perform?

Yeah. Why don't you do that?

It would be very expensive to do it for everyone?

You don't have to do it for everyone. Just for enough people so you'd know if there was an error.

There's always going to be an error.

OK, so you find there's an error. If it was a small one then you could assume that the course, or the test at the end of the course, was well constructed because it did what it was supposed to do.

That would be nice.

And if there was a big discrepancy then you'd have to do it different.

Do what different?

I dunno. You're supposed to be the expert. Do the end-test different. Or do the course different. Or it might be easier to find another garage.

Bit dangerous. There could be a lot of people get upset if we did that. No telling what sort of litigation we might run into if we found that the course didn't do what we said it would do.

So ignorance is bliss, huh?

Certainly not. We just need to be very careful, in terms of spending time and money on obtaining information that at the best will be useless, and at worst will only erode confidence and create instability.

Like I said. Mum's the word!

So error is immanent not only in the selection that determines the content and process of the assessment event, but also in the choice about what aspects will be elucidated in the assessment description.

Construction errors contain all of those errors in sampling, all the idiosyncrasies and biases that are contained in the construction of a specific test or set of demands that constitutes one element of the assessment event: these include not only the construction of the test content, of its elements, but also the construction of its form and style. Construction errors include all those generality errors attributable to the performance task itself, rather than to its timing or its context.

Practically, construction errors are indicated by all those differences in assessment description when the same construct is assessed independently by different people in different ways.
 

4. Labelling errors

Assessing is about describing some human performance. To give it a meaning the "some" must be specified: performance in typing; skill in mathematical problem solving; a dramatic presentation. So regardless of frame, it is necessary to specify in some way what it is that is being described. We must label the area of performance in some way, for otherwise it cannot be communicated.

The meaning of a communication is its reception, not its intention. In assessment the label is the message which is intended to describe a particular area of performance - involving particular knowledge, understandings, skills, processes, or whatever. The label has a particular meaning for the assessor growing out of this intention. Different meanings before the event will result in different assessment events being constructed to fit the label.

What meaning the assessed, or any other person who has access to the label, gives to it, is moot. But of one thing we may be certain. The meaning will not be identical to the meaning intended. The difference may be slight, or immense, but regardless of the magnitude will represent, at a fundamental level, an error (Korzybski 1933). Different meanings after the event will result in different interpretations of the assessment label, different inferences about what it implies.

An assessment must be an indicator of something. It must have a name. Differences in the meaning of the name, both before and after the event, constitute confusion and hence error. Labelling errors are defined by all the differences given to the meaning of the assessment (what it actually measures) by all the participants in the assessment event(s), and by the users of the assessment information.

Practically, labelling errors are indicated by the range of meanings given to the label by all those who use it before, during or after the assessment event.
 

5. Attachment errors

There is a further issue in regard to labelling. Once the label has been marked in some way, once the description is attached to it, where is it pinned? Does it belong to the person assessed? Is it more a description of the assessor? Does it represent some quantity or quality that might more appropriately hover somewhere in the space between, a relational field vector describing a complex interactional phenomena involving task, performance, assessor and assessed?

Given my ontological stance that all information is information about events, it follows that any attempt to attribute such information to a particular element of the event involves a fundamental epistemological error. To the extent that all other elements and conditions are held constant and overtly included in the description, to that extent is the simplification of language involved in the specific attribution partially justified; but such specificity of the conditions of the event tends at the same time to increase contextual error.

Attachment errors are the ontological slides that occur when a description of a relational event is attached to one of the elements of that event; specifically, when a complex relational event involving the construction of a test, an interaction of the test with a person, and a judgment of an assessor, is described as a property of the assessed person, this is an error in attachment.

Practically, attachment errors are indicated by the specification of those elements and boundaries of the assessment event that have become lost in the assessment description.
 

6. Frame of reference errors

Within the assessment arena are four competing definitions of the true, the correct, the impeccable. It follows that there are four associated notions of error. To the extent that the definitions of assessment truth, or more specifically the assumptions underlying them, are contradictory, so will be our methods for reducing error in the different frames; further, to the extent that the frames are confused, to that extent is error compounded. (See Chapter 13).

Frame of reference errors are defined by all those confusions and category differences that occur because of the different stable assumptions of the four frames of reference for assessment, as well as those contradictions and confusions that occur when shifts occur between frames during the assessment process.

Practically, frame of reference errors are indicated by specifying the frame in which the assessment is supposedly based, and indicating any slides or confusions that occur during the assessment events.
 

7. Instrument errors

Any measurement requires a measuring instrument. So any rank ordering, grading or scoring involves some measuring instrument; at the very least, such an instrument must attend to questions of calibration, which involves scale, replicability, and theory-practice bridging. Any claims to measurement must relate to some defined Standard scale. Whether the instrument is a test of some sort, or is assumed to have some material reality inside the mind of an examiner, all measuring instruments contain errors in mechanisms and hence in their readings. (See Chapter 9)

When psychometric theories are used, instrument errors are fed by all of the discrepancies between the theory and the empirical data, and are intrinsic in all of the notions of probability that pervade such theories.

Instrument errors then contain all those uncertainties of calibration, all those anomalies of replicability, all those confusions and discrepancies and mis-matches in theory-practice bridging, that are involved in the determination of the rank order, in the making of the mark, in the determination of the measure.

Practically, many aspects of instrument error are covered by other category errors. To avoid unnecessary overlap, I will limit the practical indicator of instrumental error to those errors implicit in the construction of the measuring instrument itself; what is conventionally called standard error of the estimate.
 

8. Categorisation errors

Any categorisation involves a comparison between a standard of acceptability, and a particular measurement or judgment about adequacy or quality.

Categorisation errors derive from confusions about the definition of standard of acceptability, from differences in the meaning of what is being assessed and in the magnitude of its measurement, and in the variability of the judgment process in which the comparison with the standard is made. ( See Chapter 11)

Practically, categorisation errors are all those differences in assessment description that occur when particular data is compared with a particular standard to produce a categorisation of the assessed person.
 

9. Comparability errors

Comparability errors occur whenever assessment scores are added to produce a total score. Public examinations and grade point averages are examples of such summations, as are any qualitative assessments involving more than one criteria. What such additions mean, and who is privileged by such additions, are questions inherent in the process.

Comparability errors include all those confusions about meaning and privileging that inhabit the addition of test scores, grades or criteria related statements.

Practically, comparability errors are indicated by constructing different aggregates according to the competing models. The differences that these produce indicate the comparability error.
 

10. Prediction errors

Implicit in most assessment, and explicit in some, is the notion of prediction. Whilst the idea of generality contains some element of logic in its derivation, prediction can be pure magic - correlation without connection is very possible, and is not predicated on causal relationship. It has been reported that the number of storks sighted over London is correlated with the number of births in that city, and thus may be used as a predictor. The causal relation here is moot.

More seriously, many assessment descriptions are overtly or covertly connected to expectations about future performance. High school grades are presumed to be related to success at College or University. School performance is expected to relate to job success. Trade courses are designed to improve quality of performance in the workplace. So assessments on those courses might be expected to correlate with later performance. Yet even if they do, this in no way proves there is any causal link.

The criterion measures themselves are often problematic; most practical criterion measures themselves involve an assessment, subject to all of the sources of invalidity and error that dogged the original assessment. High predictive correlations may occur because both assessments are measuring something other that what they are described as measuring; for example, the ability to perform in competitive, written events, independent of the content. And low predictive correlations may mask genuine positive relationships because of all the errors entailed in the assessments, though such "genuine relationships" must forever be hidden, relegated to fantasy because divorced from empirical sustenance. Alternatively low correlations may mask the reality of relative homogeneity of performance status, or of genuine multi-dimensionality of that performance.

So interpreting the meaning of high correlations can be quite tricky. For example, if the rank order of students on a university entry examination in Physics correlates 0.9 with their first year Physics results at University this could be interpreted as an enormously successful outcome in terms of educational prediction. It is also completely consistent with the implication that no new Physics has been learnt, or that the University course has been completely unsuccessful in compensating for initial inequities in knowledge and opportunity.

What becomes apparent is that this area of prediction, which on the surface seems very amenable to empirical verification, is fraught with errors of interpretation which are neither measurable nor resolvable. Positioning and power relations will largely determine the trend of the discourse, and whether such discourse becomes a verification of validity, or an explication of error.

Explicit or implicit in most assessments is the claim that they relate to some future performance, that they predict a particular product from some future event, a quality of some future action. Prediction error is the extent to which these predictions, and the subsequent events, are not identical.

Practically, prediction error is indicated by the differences between what is predicted by the assessment data, and what is later assessed as the case in the predicted event.
 

11. Logical type errors

Test scores are often interpreted as giving specific information about what a student can or cannot do. For example, a score of 90 per cent on a spelling test gives no information about whether any individual item on the test was actually spelt correctly by a particular student. Any assumption to the contrary is a logical type error. Similarly, a score of 80 per cent on a mastery test gives no information about what information or skill has been mastered. Common inferences made from test scores are riddled with such logical type errors.

Logical type errors occur whenever there is confusion between statements about a class of events, and statements about individual items of that class.

Practically, logical type errors are made explicit when the explicit and implicit truth claims of a particular assessment are examined and any logical type errors are made explicit. Such exposure may invalidate such claims.
 

12. Value errors

All tests and examinations involve the construction of questions and the interpretation and valuation of answers. As such they are explicit and implicit statements about value; these particular questions, and these favoured answers, are implicit statements about what knowledge, actions, processes and interpretations are valued. And by implication, which are not so valued. Such implications move well beyond content; style and form and medium are of equal or more importance.

To the extent that the values implicit in the assessment event are not explicit, are contested, or are contradictory, to that extent is the assessment event invalid with respect to value. To the extent that the assessment event(s) and the event about which inferences are made are incongruent in terms of their value assumptions and emphases, to that extent is the error component engorged.

Practically, value errors are indicated by making explicit the value positions explicit or implicit in the various phases of the assessment event, including its consequences, and specifying any contradiction or confusion (difference) that is evident.
 

13. Consequential errors

Messick (1989a) and Cronbach (1988) both accept that the effects of testing have to be taken into account when assessing the validity of testing. It follows that any distortion of learning through the assessment process constitutes a source of error.

To take this view, however, is to make an extension to the meaning of validity, or of invalidity. For we have to ask, in what way does such distortion of learning detract from the appropriateness, usefulness, or meaningfulness of the inferences made from the test scores? Are the test scores less useful because they have distorted the learning process? Certainly in such a situation the testing process has been counterproductive, which is a good reason for dismantling it, if learning is a major purpose of education. However, earlier chapters have shown this to be a naive proposition. Assessment has other more important if less salubrious social purposes.

Logically, distortion of learning increases error only if we take error to include not only the differences between what the test measures and what is or might be, but also between what the test measures and what might have been. This seems to take us into a rather transmogrified realm. Even so, any distortion of learning possibilities contributes to the violation of those persons whose learning, and possibility of growth, is thus diminished. And as that very learning is part of the event that the assessment presumes to measure, then it is legitimately included as inappropriate, and thus a source of error, a (retrospective) interactive interference effect.

Consequential errors involve all those negative effects on a student's learning and a teacher's teaching that are attributable to the assessment event. (To the extent that it produces inequity among sub-groups, positive effects on learning may also be involved).

Practically, at a simplistic level, consequential errors are indicated by the differential positive and negative effects that individual teachers and students attribute to the assessment process: At a more profound level it involves an explication of the very construction of their individuality, and all of the potentially violating consequences of those constructions. (See Chapters 4 & 5)
 

Invalidity according to Messick

Messick's (1989a) treatment of Validity in Educational Measurement is an excellent review of current (theoretical) state of the art, progressive in stance, and its implications vastly surpass current practice.

In this section Messick's work is looked at from the standpoint of invalidity, in order to indicate that the sources of invalidity indicated above are indeed well- established, if somewhat opaquely discerned, in the literature on validity.

    Temporal error

Here are two passages from Messick that illustrate some of the temporal problems of validity. The first relates to the lack of necessary conjunction between construct meaning on the one hand, and stability of measure on the other: In regard to temporal generalizability, two aspects need to be distinguished: one for cross-sectional comparability of construct meaning across historical periods . . and the other for longitudinal continuity in construct meaning across age or developmental level. It should be noted that individual differences in test scores can correlate highly from one time to another (stability) whether the measure reflects the same construct on both occasions (continuity) or not. Similarly, scores can correlate negligibly from one time to another (instability), again regardless of whether the measure reflects the same or a different construct (discontinuity) on the two occasions (p57). So even if the measure remains the same at different times, it may mean different things. And if the measure is different at different times, it may mean the same!

Here is the second example. Messick argues that it is not necessary to assume that

the more generalizable a measure is, the more valid. This is not generally the case, however, as in the measurement of such constructs as mood, which fluctuates over time; or concrete operational thought, which typifies a typical developmental stage (p57). From the standpoint of invalidity, that a test is invalid unless proved otherwise, how could the measurement of such an ephemeral quality ever be validated?

    Contextual error

Contextual errors receive a lot of attention from Messick. Here is one example: Tests do not have reliabilities and validities, only test responses do. . . . test responses are a function not only of items, tasks, or stimulus conditions but of the persons responding and the context of measurement. This context includes factors in the environmental background as well as the assessment setting. . . . Thus, the extent to which a measure displays the same properties and patterns of relationships in different population groups and under different ecological conditions becomes a pervasive and perennial empirical question (p14-15). This certainly captures the idea that the data belongs to a complex event, even though Messick does not follow through to the logical conclusion that the test score data cannot then be detached from the event and attached to an individual.

Moreover, in terms of error in individual measures he misses the point; for even with knowledge of the relationships between test measure - group - context, we still have no knowledge about the specific error in an individual score. (In group terms it could be anywhere within plus or minus three standard errors from the estimate).

Here is another example that raises the more fundamental issue of context as boundary condition:

studies of the transportability of measures and findings from one context to another should focus on identifying all of the boundary variables that are a source of critical differences between the two contexts, as well as gauging the potency and direction of the effects of these boundary variables on events in the two conditions (p58). Indeed, for science is nothing if it cannot adequately define the boundary conditions within which the limited experimental events that define its world can be controlled. So the assessment is invalid unless all the boundary conditions (that cause unexplained variance) can be specified. And, of course, they never can be.

    Construction errors

Construction problems are often dealt with in terms of content validity. Messick comments that "the heart of the notion of so-called content validity is that the test items are samples of a behavioural domain or item universe about which inferences are to be drawn" (p36). He has some problems with this, for "to achieve representativeness . . . one must specify not only the domain boundaries but also the logical psychological subdivisions or facets of the behaviour or trait domain" (p39). Furthermore, "in point of fact, items are constructed, not sampled" (p40). And finally, Messick's crunch point : knowing that the test is an item sample from a circumscribed item universe merely tells us, at the most, that the test measures whatever the universe measures, and we have no evidence about what that might be, other than a rule for generating items of a particular type (p40). So even the apparently simple task of getting some test questions together is fraught with difficulties, again justifying a invalidity label until compelling evidence is presented that these problems have been solved.

    Labelling errors

Messick is adamant that "the meaning of the measure . . . must always be pursued - not only to support test interpretation but also to justify test use" (p17).

At least some of this meaning is carried by the construct label, and "constructs are broader conceptual categories than are test behaviours, and they carry with them into score interpretation . . . the evaluative overtones of the construct labels (p59).

One such problem with the label is how broad to make it. Messick spells out the dilemma:

In choosing the appropriate breadth or level of generality for a construct and its label, one is buffeted by opposing counterpressures toward oversimplification on the one hand and overgeneralization on the other. . . . choices on this side (of oversimplification) sacrifice interpretative power and range of applicability as the construct might be defensibly viewed more broadly. At the other extreme is the apparent richness of high-level inferential labels such as intelligence, creativity, or introversion. Choices on this side suffer from the mischievous value consequences of untrammelled surplus meaning (p60). Another problem with a label that applies to everybody is that different people do things in different ways: In numerous applications of these various techniques for studying process, it became clear that different individuals performed the same task in different ways and that even the same individual might perform in a different manner across items or on different occasions. . . that is, individuals differ consistently in their strategies and styles of task performance. . . this has consequences for the nature and sequence of processes involved in item responses and, hence, for the constructs implicated in test scores. . . test scores may mean different things for different people. . . for different individuals as a function of personal styles and intentions. . . Indeed, . . . a test's construct interpretation might need to vary from one type of person to another (p54-5). So why not from one person to another? In this regard note that validity has always been a group concept. Human rights, with its associated absence of violence, is a term that applies to individuals and not to groups; to claim that 95 per cent of a population is not subjected to human rights violations such as torture, incarceration and extermination is hardly a claim for a good human rights record. Why is assessment any different?

It would seem from Messick's own example that the label must be individualised in meaning before it can validly be applied to an individual person.

    Attachment errors

The idea that assessment data gives information about an event rather than about a person is contrary to the very conception of assessment in general, and to psychometrics in particular. However, there are glimmerings of light in Messick's work that are encouraging. Here are two examples: The possibility of context effects makes it clear that what is to validated is an interpretation of data arising from a specified procedure (p15).

. . . the important validity principle embodied by this term (trait validity) might be mistakenly limited to the measurement of personal attributes when it applies as well to the measurement of object, situation and group characteristics (p15).

In the first quote the data is seen to be related to a procedure, that is, an event involving relationships; in the second the validity, if not the data, is seen clearly not to be limited to the personal.

    Frame of reference errors

Messick does not mention frame of reference errors in the form that I have developed them in this dissertation. However he does talk of the various theoretical frameworks for intelligence, including the two well-known "geographic" models of intelligence as a single dimension, or as multiple discrete abilities. And then goes on to mention a computer model, an anthropological model, a sociological model and a political model. He then comments: If two intelligence theories sharing a common metaphorical perspective - such as uni-dimensional and multi-dimensional conceptions within the so-called geographical model - can engender the different world phenomenon of investigators talking past one another, as we have seen, just imagine the potential babble when more disparate models are juxtaposed (p61). A close inspection of the literature on assessment obviates the necessity to imagine, for fact is indeed stranger than fiction, and indicates the massive sources of invalidity from this source.

    Instrument errors

Instrument errors as such don't get much attention in this work, perhaps because, as I have defined them, they are an aspect of reliability rather than of validity, and so are dealt with in a different chapter in Educational measurement (Linn, 1989a).

However, he does note that "the very fact that one set of behaviours occurs in a test situation and the other outside the test situation introduces an instrument error"(p37), indicating that he is aware of a fundamental shift in context that pervades the use of tests for assessment.

    Categorisation errors

About the validity of any particular categorisation Messick is remarkably silent. A short section on decision models of cost - benefits is all that scratches the surface of the chasm of silence (p78-80). This despite the fact that in practice the meaning of the categorisation assumes more importance than the meaning of the construct; to the individual student the distinction, or the failure, is more important than whether the assessment measured what it claimed to measure.

The substantiality of the standard is a necessary prerequisite to the allocation of a measure to a category. Or, for that matter, of the conversion of a category to a measure, as in a conversion of "better or worse" to "more or less." Are standards then irrelevant to construct validity, which in Messick's model is all validity? For surely the construct meaning given to a test score is submerged in the social world, in most cases, under the weight of its categorisation as a grade. To limit the definition of validity to test scores hardly affects the issue, because surely the categorisation then becomes the first interpretation, the first utility, the first action, and hence a crucial element in the validity discourse.

Should I really have been so surprised, as I most genuinely was when I realised for the first time, as I wrote the two preceding paragraphs, what had occurred? Was it a conscious decision on Messick's part not to include the categorisation issue in his extremely comprehensive study? Or is the erosion of the problem of the standard from professional and public memory so complete. Certain it is that though I have been very familiar with Messick's chapter for four years, and standards are my major area of interest, I had not noticed the almost complete omission of any treatment of the issue in his definitive paper on validity till now.

Whatever, categorisation errors remain a major source of invalidity in assessment, and without clear evidence to the contrary, must be assumed to be very large indeed, making most categorisations of individuals invalid.

    Comparability errors

Now whilst Messick is certainly aware that "a single total score usually implies a unitary construct and vice versa" (p44), he does not develop many validity implications of this until he begins to discuss test-criterion relationships. He makes the point that "criterion measures must be evaluated like all measures in terms of their construct validity"(p70). He seems to accept that most criterion measures are "multiple and complex." He points out that it does not "make such sense logically to combine several relatively independent criterion measures . . . into a single composite as if they were all measuring different aspects of the same unitary phenomenon"(p74). He goes on to state that: On the contrary, the empirical multidimensionality of criterion measures indicates that success is not unitary for different persons on the same job or in the same educational program or, indeed, for the same person in different aspects of a job or program. furthermore, because two persons might achieve the same overall performance levels by different strategies or behavioural routes, it would seem logical to evaluate both treatments and individual differences in terms of multiple measures (p74-5). Easy to say, of course, but much harder to do. Because this leads inevitably to the use of "judgmental weights that reflect the goals or values of the decision maker"(p75), which leads directly into all the confusions and errors dealt with in the chapter on comparability.

    Prediction errors

Messick discusses prediction errors under the general rubric of test-criterion relations and decision making (p69 -88). He points out that "the major threats to criterion measurement . . . are basically the same as the threats to construct validity in general" (p73). In other words, errors are compounded in prediction errors because the errors in the test are multiplied by the errors in the criterion measure. In addition "other biasing factors include inequality of scale units on the criterion measure, which is a continual concern when ratings serve as criteria, and distortion due to improperly combining criterion elements into a composite" (p73). He talks of "inappropriate weights . . . applied to various elements in forming composites" (p73), yet who could say what an "appropriate" weight was?

So one source of confusion is whether the criterion domain "entails a single criterion or multiple criteria" (p74). He concludes that:

use of measures of multiple criterion dimensions or components affords a workable approach to composite criterion prediction . . . by combining correlations between tests and separate criterion dimensions using judgmental weights that reflect the goals or values of the decision maker (p75). Maybe, but this takes us into further sources of confusion related to differing values, differing goals, of different decision makers, and a concomitant further proliferation of error.

    Value errors

In terms of the validity of tests, Messick is adamant that "the issue is no longer whether to take values into account, but how" (p58). It follows that "because validity and values go hand in hand, the value implications of score interpretation should be explicitly addressed as part of the validation process itself" (p59).

He is also clear that "data and values are intertwined in the concept of interpretation"(p16), and furthermore, "values . . . influence in more subtle and insidious ways the meanings and implications attributed to test scores with consequences not only for individuals but for institutions and society" (p59). So it is not only obvious biases expressed in interpretations that we are dealing with here, but "more subtle" mechanisms.

For example, not only are "some traits . . . open to conflicting value interpretations" (p60), (shouldn't this read "all traits"), but "the tenability of cause-effect implications is central, even if often tacitly, to the construct validation of a variety of educational and psychological measures such as those interpreted in terms of ability, intelligence, and motivation" (p58). So if cause-effect thinking is shown to be simplistic and epistemologically bankrupt in a more ecological world-view, where does that leave such "traits"?

So Messick centred his attention

on the value implications of test names, construct labels, theories and ideologies, as well as on the need to take responsibility for these value implications in test interpretations. That is, the value implications, no less than the substantive or trait implications, of score-based inferences need to be supported empirically and justified rationally (p63). Here Messick makes a brilliant case for the fundamental invalidity of all test data on the basis of value confusion and hence inability to interpret meaningfully test measures.

    Consequential errors

Messick pays considerable attention to the consequential basis of test validity (p58-63). By this he means "the often subtle systematic effects of recurrent or regularised testing on institutional or societal functioning" (p18). He is firm that "social consequences cannot be ignored in considerations of validity" (p19). He then spells it out in more detail: The consequential basis of test interpretation is the appraisal of the value implications of the construct label, of the theory underlying test interpretation, and of the ideologies in which the theory is embedded. A central issue is whether or not the theoretical implications and the value implications of the test interpretation are commensurate (p20). This may well be a central issue, but surely not the central issue. They may be commensurate and yet be utterly unequable to groups or to individuals. Messick himself acknowledges this later when discussing cost-benefit decision making: This concern with minimizing overpredictions, or the proportion of accepted individuals who prove unsatisfactory, is consistent with the traditional institutional values of efficiency in educational and personnel selection. But concern with minimizing underpredictions, or the proportion of rejected individuals who would succeed if given the opportunity is also an important social value in connection both with individual equity and with parity for minority and disadvantaged groups (p80). Exactly, and Messick is equally precise when on the next page he concludes that "in practice, however, such balancing of needs and values comes down to a political resolution" (p81). That is, a solution based on power relations, which are inevitably asymmetrical. So if we are to be clear about invalidity errors of a consequential nature, we had best be mindful of the mechanisms through which such power relations are distributed and applied.

Messick's fudged solution

As briefly indicated above, Messick's chapter on Validity is a chamber of horrors, a gruelling journey through deep and varied sources of invalidity that would surely deter any rational person from ever attempting to show that any test was valid. Yet again and again he slides back into psychometrics, into "multiple choice" tests, into technological fixes, into the fudged solution.

Here is one such: "Tests," explains Messick, "are imperfect measures of constructs because they either leave out something that should be included according to the construct theory or else include something that should be left out, or both"(p34).

Not so. Messick has, conveniently, left out the fourth alternative, "or neither." And surely this is the alternative most congruent with his own analysis. By doing this he has assumed the very thing that is in doubt - that the construct can, in fact, be measured at all, in the light of epistemological issues, multi-dimensionality problems, value confusions, comparability errors, and so on.

Summary

To summarise, the notion of error is circumscribed by the construction of the event being described, just as it is boundaried by the epistemological assumptions of the judgment process.

Theoretically, error in assessment contains within its ambit all those ontological inadequacies, all those epistemological slides, all those logical contradictions, all those semantic obfuscations, all those definitional fudges, all those ideological camouflages, all those value variations, as well as all those potential empirical falsifications of implicit truth and accuracy claims, that characterise the field.

Practically, the description (measurement) of error is not dependent on any notion of a single truth, but rather on one of differences between multiple truths, all with some claim to legitimacy; these are implicit in the production of the assessment event, in the interpretations of the assessed and the assessor's experience of that event, including categorisations, and in the particular intended and received meaning of the communication of that judgment to others. The error becomes explicit when all of these phases of the assessment event are pluralised; when genuinely independent events are constructed; when independent categorisations are produced by participants in the event; when the judgments, and the meanings given to those judgments by involved persons, are compared.
 

Conclusion

Thus whilst the theoretical aspects of validity may indeed be fully discursive as Cherryholmes (1988) argues, the practical extent of invalidity is demonstrable as an empirical reality in the material world, partly as a result of that very discursiveness. For example the analysis presented earlier of the electrical automotive test presented irresolvable complexities in determining what empirical meaning could be given to the validity of the assessment. As the notion of validity is currently constructed, it would be resolved, if it was attended to at all, by the validity advocate giving an expert and coherent case for the defence, which would be unchallenged. That is, it would be resolved by resort to the Judge's frame of reference, and ignoring the other frames.

From the standpoint of invalidity, there is no such confusion. All of the suggested measures are useful measures, and the range of estimates that they produce for any one trainee indicates the range of error within which that person is being categorised. And we should not be surprised if at times this range covers the whole range of categories available.

As indicated in earlier chapters, the categorisation of persons has enormous effects on people, both in terms of their conceptions of themselves, and in their subsequent implicit and explicit exclusion from occupational opportunities. Such exclusion is not a discursive practice, but a very practical reality, though doubtless language is a significant factor in the acceptance of the violation. Further, the immense uncertainties associated with such categorisations is both demonstrable and measurable.

I have argued that validity discourse is currently constructed in such a way as to deny this demonstration. Invalidity discourse, based on the detailing of error components as presented here, is an advocacy for the defence of the examined rather than the examiner. As such it tends to redress the power imbalance, and hence reduce structural violence and increase social justice.


Return to Table of Contents