Part 4: Error analysed

Chapter 13: Four faces of error
Chapter 14: What do tests measure?
Chapter 15: The psychometric fudge
Chapter 16: Validity and reliability

 

Chapter 13: The Four Faces of Error

Synopsis

The meaning of error in each frame of reference for interpreting assessments is now considered: In the Judges frame the phrase "error in the Judge's frame" is recognised as an oxymoron; in the General frame error is conventionally defined in statistical terms that ignore or underestimate some of the considerations, and the unattainable true score is seen to be a theoretical construct that need not relate to any external reality; errors are hidden in the Specific frame, and some of the Pretenders to this frame, namely mastery tests, criterion referenced tests, and competency standards, are briefly examined; finally in this chapter the meaning of error in the Responsive frame is considered. As this frame involves human interaction and discourse, error is what disrupts or disturbs movement towards clarification of meaning.

Assessment discourse is necessarily confused and confusing when the frame of reference within which the discourse is occurring is not specified, or when it involves definitions and methods where the actual frame being used is misrepresented.
 

The meaning of error in different frames

As soon as assessment data are committed to paper, their material permanency is dramatically increased. Likewise, the span of their associations is spread and emphasised. No longer just a description of a particular performance, the assessment becomes interpreted as a measure of knowledge and ability, an indicator of achievement on a course of study, and a predictor of future success or failure. Participation in an event has been transformed into an attribute of a person.

To estimate error is to imply what is without error; and what is without error is determined by what we define as true, by the assumptions of the frame of reference that forms our epistemological base.

There are four, at least, frames of reference for assessment. Four different sets of assumptions about the nature of the exercise. So within each of these frames the meaning of error, as defined by the assumptions of that frame, is different. Just as the meaning of error within each frame will be different again if judged by the assumptions of another frame. It is these differences that will be examined in this Chapter.
 

Error and the Judge

The Judge assumes omnipotence and infallibility within limits. The limits are defined by the particular performances with which the Judge is presented. These are the facts of the case. The task of the Judge is simple. He examines the performance of the accused, in whatever form it may be presented, he relates this performance to the standard, and then describes it accordingly.

He does this without error.

So problems that relate to error such as labelling, construction, stability, generality, prediction, categorisation, values and distortion of learning are, to the Judge, irrelevancies. For Judges are practical people, concerned with the realities, with what is, rather than what might be. And for them reality is the answers written on paper, is the art poster presented, is the motor repaired; in short, is the performance or artefact with which they are presented.

Questions of ability and stability, of looking to the past or to the future, are both irrelevant and unsettling. Irrelevant because they are outside the limits of their scrutiny. Unsettling because they trigger notions of a subject.

What sort of jargon is that?

Is what?

Trigger notions of a subject, for God's sake!

You find that a bit obscure?

I find that absolutely obscure.

I was alluding to the difference between subject and object.

I'm none the wiser.

An examination paper is an object. A grade is an object. A standard is an object. The Judge relates these objects. And he claims to do it quite objectively. A computer, programmed correctly, would also do it objectively. Objectively in this context means that the process is purely rational, untainted by emotion or expectation of any kind. The Judge is firmly positivist in his stance; he rationally assesses what is out there in the real world to be described.

Seems eminently reasonable.

Indeed, if somewhat inhuman. An observer in another frame of reference might see the Judge as myopic and deluded. He might see the Judge immersed in a totally subjective world triggered by the statements, now confined to paper of the person being assessed. Further, he might see the comparison with the "standard" as an intuitive rather that rational process, affected by images, emotions and expectations stimulated by script, time and style of the answers as much as by content.

That also seems eminently reasonable.

Regardless, it is necessary for the Judge to deny such subjectivity in order to maintain the role of impartial expert, of perfectly calibrated measuring instrument. The Judge considers his work as objective, and so is unsettled by the notion of the subject, the four dimensional person who is assessing, and the four dimensional person who is being assessed

Most teachers marking tests and assessing student work, and most public examiners, work within this frame. So most educational assessment is, by definition, error free.

Sometimes it is necessary, because of numbers of students, to have more than one Judge. There may be a number of Lesser Judges and a Chief Judge. In such situations it is accepted that ratings from lesser judges could contain some error, of the order of one or two marks in a hundred. To minimise this possibility, sample answers for questions might be prepared, with detailed marking shedules.

Sometimes a further check is made of papers just one or two marks below the cut-off points for failure. The Chief Judge will examine these to ensure that there has been no error, thereby restoring the myth of infallibility.

Reducing error in the Judge's frame of reference is not a problem. There is no error, except in the special cases of Lesser Judges and crucial decisions. In that case the error is the difference between the original assessment, and that of the Chief Judge.

Note that the Judge is infallible regardless of the form in which he presents the assessment. He may compare with the standard in any way he thinks desirable. The Judge is perfect in his rank orders, scores, grades, or other normative classifications. He is equally impeccable should he present his assessment in any other form, such as verbal description, moral tirade, or hologrammed logo.

The important point to understand is that the Judge is part of a social and political structure in which the inviolability and accuracy of the Judge's decisions are crucial elements. To suggest that the Judge may be in error threatens the stability of that structure and its accompanying mythology, so it is an act both treasonable and blasphemous: treasonable because it undermines the structure of society; blasphemous because it denigrates one of its icons.

In the hundreds of letters I have read in newspapers complaining about examinations, I have never seen one that suggested that the Judge, because he is a normal person, may make whopping big errors! So to the general public the Judge is not a normal person, and makes no errors.
 

Error and the General

Most of the book space and discourse time about this frame has been appropriated by those associated, corporately or academically, with the test construction industry; by those who produce and sell achievement and ability tests of many and varied kinds. Or by those who play in a scholarly way with mathematical models that might be used by those who construct such tests. (Nairn 1970). I shall deal with this world specifically in Chapter 15, the psychometric fudge.

Within this frame as constructed by psychometricians the error is the difference between the true score and the estimated score

However, the logic of the frame does not require such elegant and complex mathematical manipulation. The mathematical models have, overall, been counterproductive. Their theoretical elegance has hidden their inapplicability to most practical learning and teaching situations; the mystification of their statistical constructs has hidden from teachers, students and public alike the enormous extent of rank order inaccuracies and grade confusion, and the arbitrary nature of all cutoffs and standards.

One further point needs to be emphasised here. The General frame contains no notion of Standard. It is about creating stable rank orders of students. Anyone, anyone with sufficient authority that is, is at liberty to arbitrarily define a standard somewhere along that rank order. But a standard so defined is obviously a relative, not an absolute, division.
 

Error and the Specific

In this section we will look at error in the Specific frame in its purest form of specific behavioural objectives, as well as in its degraded states of mastery testing, criterion referenced testing, and competency standards.

In this frame there is only one correct description of performance, and that is the unambiguous learning outcome defined in advance. It is assumed that learning outcomes can be defined so clearly that there is no doubt about whether a student has, or has not, matched behaviour to objective. In such a situation there should be no problem with labelling error because there is no labelling. Each objective stands alone, pure and clear in its pristine self description; context, task and standard clearly enunciated. (Mager, 1962)

Construction errors are another matter. Whilst it is assumed in this frame that any outcome relevant to a particular course of study can be so specified, it is not claimed that all such relevant outcomes are in fact described. In some cases only those outcomes that all students are expected to attain are specified. Then we have a set of minimal learning outcomes. In asking "who makes this decision" we indicate a construction error. Why these particular objectives? And why these particular cut-offs for adequacy? It is apparent that behind the asserted certainty and objectivity of these objectives lies the usual minefield of idiosyncratic and arbitrary construction errors.

In other cases, a set of possible outcomes may be taken as indicators, and attainment of these is taken as evidence of achievement of related ones not directly assessed. And of course, no performance is ever a perfect indicator of a related performance, so hiding behind this wall of tightly specified objectives are all of the errors related to generality as well as to construction.

These construction errors, however, are all quite small compared to the massive one involved in the basic assumption of this frame: The assumption that any outcome pertaining to a course of study can be specified according to this frame; that all important outcomes can be specified in the form of a specific behavioural objective. In practice, it is just not so. This is what Messick (1989, p63) refers to as "construct underrepresentation".

This method of description is appropriate for situations where there are a finite number of tasks. Conceptually we are limited to tasks involving low level comprehension. As soon as we move into problem solving, analytic, application, or creative activities, there are an infinite number of possible task situations in which a student may be put in order to assess whether the student can demonstrate these more complex cognitive and practical operations. The tasks are limited only by the imagination of the test setters. And if we choose any one of these tasks, and describe them in such a way that they can be "taught" as a specific objective, then the task becomes one of low level comprehension. In other words, it must be a new task, a task previously unspecified, if these higher level performances are to be indicated (Bloom, 1956).

A student may attempt the task on a number of occasions if necessary, so usually irregularities in the performance of a particular student are not considered significant; unless, of course, a requirement of regularity over time is built into the objective. So errors in the temporal dimension are not applicable - unless, of course, we wish to infer that because a student has done the task, the student not only can do the task now, but on all occasions in the future. Such inferences are often made, of course. And they are utterly indefensible.

Prediction errors for an individual objective are enormous. But then, a specific objective does not claim that it would alone, or even in conjunction with other objectives, predict anything. On the other hand, as soon as it starts to describe itself with other adjectives, such as minimum, or essential, then it does open the way to predictive estimates of error.
 

Error and the Responsive

In the Responsive frame for any student there are many descriptions that are accurate and adequate to a particular purpose. Adequacy means that the description conveys sufficient information to carry the intent of the assessor and/or assessed into effect.

In this frame there is no competitive element, nor are the outcomes predefined in detail. Rather the assessor responds to the situation in terms of a particular purpose, which might be to describe how the student could improve the performance the next time (descriptive assessment). Or a responsive assessment might lead to a student's involvement in planning and assessing a course about maintaining a tractor (work required assessment). Or a responsive assessment might involve sharing a personal non-judgmental response to the student's work (detailed audience response).

While sometimes the criteria used for a responsive assessment might be preconceived, this is often not the case. The criteria emerge out of the totality of the situation, and so depend on the assessor's sensitivity, empathy and sense of quality. In addition, notions of adequacy are in general accepted for the subjective entities they indeed are, so become notions for considered opinion and discussion, rather than pretending to be absolute, accurately measurable qualities.

Responsive feedback then is part of a communication process which involves observation or other sensory input, interpretation, and response. It may in addition involve ongoing dialogue. Inaccuracy, in the sense of misinterpretations or misunderstandings may occur at any of these stages, as may obfuscations, denials, irrelevances, or contradictions. Empirically, this reduces to differences in interpretations, and there is no necessity in most cases to assume that there is some "true" interpretation or description. The aim is not to accept or reject the other's meaning, but to understand it.

In this frame, the person being assessed is also a potential observer and assessor, so self assessment can be an important part of the process. The communication process tends to be self-correcting, as the parties to the interaction both are concerned to clarify and understand what is being communicated. Accuracy then is concerned with the clarification of meaning, and error is reduced through openness of the communication channels.

Adequacy can only be determined by consequences. That is, to the extent to which effect conforms to intent. Again, error is reduced in as much as the assessed can feed back to the assessor the effect of the assessment, so that modification either of the description or the purpose can occur if necessary. This assumes that the assessed is aware of the purpose of the assessor's comments, and has reflected on their effects. So the continuity of open communication is as necessary as its initiation.

Keeping all communication channels open is of course more easily said than done, particularly in the social milieu that pervades most teaching-learning situations. For optimum reduction of error in this frame, both teacher and student would need to value openness over protection, autonomy over control, uniqueness over standardisation, complexity over simplicity, and tentativeness over certainty. In addition, each would need to be conscious of the potentially debilitating effects on open communication of the hierarchical structure in which their relationship is probably embedded.

More importantly, each would be wise to be aware of the potentially destabilising effects of their open communication on that structure, and of the social risk involved in such radical activity.
 

Summary

As the meaning of error changes with assessment mode, so do the methods designed to reduce such error. From a perspective of oversight of the whole assessment field, this is itself yet another source of confusion and invalidity, particularly as it is rare for any practical assessment event to remain consistently within one frame of reference.

Return to Table of Contents