Chapter 11: Rank orders and standardsSynopsisIn this chapter the relationship between rank order and standard is teased out in more detail: In particular the particular meanings given to the standard in the Judge and General frames of reference; how logical confusions proliferate when discourse jumps from one frame to the other; and how the differences in meaning are connected logically.At the end of the chapter a post-modern myth of the situation is presented.
Personal day-dreamI was about fourteen when I first pondered the sticky issue of the elusive standard. The context was heavenly, rather than earthly, theological rather than educational.It concerned St Peter. It seemed to me he had a problem. Here he is at the pearly gates as the newly dead file by and do their thing - state their case. And Peter, judge extraordinaire, gives his verdict; pass, fail, pass, fail, fail, fail, etc, etc for millions and millions of people. And somewhere, among all of those millions were two people, so very close together in the merit of their lives. Oh, so very close! Yet their destiny so very different. For one, just scraping through, the joys of heaven for ever. And for the other, eternal damnation. But it didn't end there. For as thousands and thousands of years pass, and more and more millions queue at the gate, even between these two he must make finer and finer discriminations. I didn't doubt he could do it, mind you. Well, it'd be more accurate to say that I considered that if anyone could, he could. But I wondered why he'd want to! Fifty years on, these are still the two fundamental questions I have
about the notion of a standard : the people who define a standard do in
fact have St Peter's god-like omnipotence, but do they have his infallibility?
And why do they want to engage in a process that is so manifestly unjust?
Order and standardLet's go back a bit and tease out this relation between standard and rank order of merit. A relation that I intuited at fourteen, but only recently have systematically thought through.The relationship is not immediately apparent. There are some judges who are adamant that they can recognise standards and this has nothing to do with relative merit. In fact, to them the word relative is anathema. For them, standards are absolute. They are as solid as a winning post, they are a fact established, a sign as recognisable (to them) as a green light at an intersection. Recognising that some people play games, run races, create rank orders and random distributions and normal curves, they see themselves doing work of a higher order; as maintaining absolute quality in a world trivialised by concepts of the average, the normal, the relative. So let's push them with a bit of Socratic dialogue. Or is it Hegelian dialectic? Yes. Could you always recognise it? No. So how did you come to reach this state of clear recognition? Through many years of study, reflection, and discourse with other scholars and experts. The senses become refined, the observation sharpened, the criteria established, as slowly, with increasing precision, the standard for quality becomes defined. Let's assume all this is true, and you can in fact recognise the standard. So if I were to show you a work that was well above the standard, you would recognise it as such? Of course. Similarly, if you were to be presented with a work well below the standard? Naturally. It would, of course, be apparent that the first work was better than the second work. True. But this is a consequence of my recognition of the standard, and has nothing to do with its cause. It is, you might say, an irrelevant corollary. Possibly. Now let's take a work that is very close to the standard. You would know whether it was just above or just below, would you not? Yes, I could make that judgment. And if I were to present you with another work very close, you would know whether that was just above or just below? Certainly. So if one were just above the standard and one were just below, and I were to present you with a third work somewhere between these two, you would know whether is was just above or just below the standard, and you would know that it was between the other two in merit? I would know that, but only by comparing them all to the standard. Not by comparing them to each other. Quite so. Now we have talked about five pieces of work. So if I were to present these five pieces of work to you again, you would of course give the same decision regarding each of them. Certainly. And incidentally, after the event in your view, you would have them in the same rank order of merit. Agreed. Now if they were in a different order of merit the second time, would this not show that there was no absolute standard to which you were able to compare the works? It would certainly throw doubt on that contention. And if you can do it with five, in principle you should be able to do it with fifty? If necessary. Or even five hundred or five thousand? Some public examiners do indeed take on that sort of responsibility. Can we agree then, that regardless of whether the rank order of merit of the works is produced after they have been compared to the standard, or whether the standard is constructed as an artefact of the rank order of merit, in either case the whole notion of standard is in jeopardy unless the rank order of merit is a stable one. This would seem to be a valid argument. Would you be willing to put it to the test then? Put what to the test? Would you be willing to rank fifty pieces of work in their order of merit, (based on their respective distances from the absolute standard) and then do the same task six months later. Me personally? You personally. I'm a very busy person, and it would quite frankly be a waste of time. The result would be obvious. It is self-evident. The orders of merit would be the same. You're certain of that? As certain as I am of my professional competence. It is also clear that the last sentence
is not just a rhetorical device, an appropriate metaphor. It is rather
a literal truth specified by the very role of Judge. The whole notion of
professional competence is dependent on this ability to judge the value
of work in the area. To question that competence, then, is to remove the
very foundations of the Judge's professional existence. It is an act, therefore,
of extreme danger that we would expect to be resisted with great strength,
and considerable emotion.
Quality or boundaryIn practice our confidence in the standard defined by a Judge cannot be greater than the accuracy with which the Judge can place works, performances, or people in a stable rank order of merit. Our confidence can, of course, be much less than that, but it cannot logically be greater.That being so, we may think of the standard in two ways: as the lower limit of adequacy, or excellence; or as the line that divides, as the boundary between classifications. Which way we see it is more than a trivial semantic difference. It is an essential point of discrimination between the frames of reference of the Judge and the General, which entail quite different conceptions of the task being undertaken. For the Judge claims to judge quality, and if necessary the classifications of quality (as inadequate, or good, or outstanding), and the stable orders of merit are a consequence of this. In the General frame these claims of the Judge are denied. In this frame it is assumed, and the assumption has much empirical evidence to support it, that a judge produces different rank orders of the same works at different times. This indicates at the least considerable fuzziness of standard, and at the most a disintegration of the very concept of the standard. In addition, different judges produce very different rank orders, as well as very different "standards" around which they appear to be, rather randomly and quite widely, distributed. So in the General frame the first task is to stabilise the rank order as much as possible, and then decide the cut-off, the boundary between the classifications of adequate/ inadequate or whatever. The point that I want to make here is that these two frames of reference are not compatible, and cannot both be used in the same mechanism of assigning a standard without introducing an inherent contradiction into the whole process. The frames are of different logical types; the Judge is a member of the General class. So contradiction is inevitable when the discourse boundaries between them are not clearly separated. More specifically, we cannot use the General frame of reference to obtain a more stable rank order of merit, and then use the Judges frame of reference to decide the standard, by looking, for example, at some examination papers around what is assumed (from the General frame) to be close to the boundary line. For the use of the General frame has assumed that any judge is inaccurate, and has already produced not a boundary line, but a broad boundary band, within which the Judges' (many and varied and implicit) definitions of standard are to be found. The price we have paid for the more stable rank order is to make clear the instability and variability of the Judge's "standard." We cannot now go back to the Judge to determine the many (disguised as the few) indeterminate cases by using his/her ability to recognise the absolute standard, an ability already discredited by the assumptions used to make the rank order more stable. This has not deterred public examining
authorities and professional test agencies from doing just that.
Empirical evidenceFacts are less dangerous than theory; despite the promise of the Enlightenment, most people use up far more energy defending their mythologies than in searching for facts; the world is full of answers looking for questions, and significant questions are rather an endangered species.There is no doubt about the empirical evidence available about the extreme vulnerability of any single Judge in determining either a stable rank order in concurrent rank orderings of the same tests, or in the great differences in rank orderings between different Judges. And this is just for marking. (Hartog, 1936; Cox, 1965; Rechter, 1968; Halpin, 1983) On the other hand, those plain statements are sanitised by such mathematical constructs as reliability coefficients, some of which become acceptable because they are higher than others; certainly not because they have solved the problem of the stable rank order. In the literature, reliability coefficients of 0.7, and validity correlations of 0.4, are considered very good. They don't look so good when we realise that 0.7 is fifty percent better than chance, and 0.4 is only sixteen percent better than chance. Now I want to focus on just one aspect
of this issue, which relates to the increased stabilisation of rank order
obtained through standardised marking procedures, and show how such collusion
of Judges produces confusion in the General frame.
The fool-proof marking schemeThe Judge's sense of infallibility in his own ability to recognise standards does not extend to his view of other Judges. It can't, of course, because some of them will disagree with him and then they can't both be infallible. It is necessary then in any particular situation for one Judge to be infallible for all other Judges to be fallible. Thus the requirement in any large scale marking exercise to have fool-proof marking schemes, devised, or at least accepted, by the chief Judge.In this way the lesser Judges take on some of the aura of perfection of the Chief Judge. And certainly, such schemes do have a considerable effect in stabilising the rank order of students being assessed. And of course, it is easier to determine the detail of such marking schemes in such subjects as Mathematics and Physics than it is in English Expression and Art and History. At least one unused to the cognitive gymnastics of examiners might tend to so believe. Regardless, a Chief Judge who sets a test paper and then devises a marking scheme could, one would hope, be fairly specific about what content and processes were important, and what criteria were being used to assess the students. These particular values, or prejudices, or idiosyncrasies are then passed on to the other Judges through the marking scheme. It is obvious that this will decrease the differences between rank orders when papers are marked by different lesser Judges. Statistical data can then be produced showing how "good" marker reliability is. And within the Judges frame it is certainly true that rank order discrepancies have been reduced. What is not so immediately obvious is that within the General frame the discrepancies have been increased. Within the General frame the rank order shows less variation the more independent Judges there are. The whole point of having many Judges is to "iron out," to balance out, individual discrepancies and prejudices. By effectively reducing the number of independent judges through the marking scheme, the generalizability of the rank order produced to another similar situation is reduced, not increased. For example, we can easily imagine another Chief Judge, with different priorities about the course of study being tested, and different criteria for assessment, producing a very different marking scheme, which would then produce a quite different (though equally consistent) rank order of students. This problem is not solved, though
it may be slightly alleviated, through a more "democratic" production of
the marking scheme under the eagle eye of the Chief Judge. The hierarchical
structure of the committee, the press to conformity and the expectation
of a consensus, will necessarily erode genuine independence on the part
of the lesser Judges. Regardless, such "consensus" is not equivalent to
the averaging out of independent judgments.
Quantum of errorThe Judge can be very specific, at least rhetorically, about what is being assessed. And then the error, as defined by the differences between the rank order produced and that of other independent Judges, is large.In the General frame, we can reduce the discrepancy between rank orders by averaging out the rank orders produced by a number of independent Judges. But then, because they are individually emphasising different criteria, we cannot be very specific about what we are measuring. Test agencies and Public Examination systems always assume they are measuring what they are being paid to measure, so regard any improvement in stabilisation of the rank order as a good thing. Persig (1976), in Zen and the Art of Motorcycle Maintenance, assumed that this more "stable" rank produced by averaging was indeed a measure of the elusive "quality" which he sought. I find such interpretations exceedingly suspect, examples of wishful thinking. The fact is that the more precisely we proscribe one aspect of the intricate web in which the spider variously called achievement or ability or quality of performance lies hidden, the more diffuse other aspects become. We tighten up marking schemes and lose generalizability to other marking schemes. We use many judges and lose specificity about what it is we are measuring. We specify behavioural objectives and lose definition of problem solving. We use multiple choice answers and construction and synthesis gets lost. We create a test and lose most of what we are trying to test. This sort of phenomena is well known in the sub-atomic world. According to Heisenberg's Uncertainty Principle, you can know the exact position of a particle, but then you lose information about its momentum. Or you can know its momentum, but then lose information about its position. And the amount of fuzziness, the quantum of error, is a constant. A reason for this is that to collect information about sub-atomic particles, they must be interacted with in some way. And the very process of interaction produces a change in the "original" state. We are in an analogous situation with tests. The very process of giving a test displaces the person from the "original" situation that the test is meant to describe. We have created an interference by the very process of the experiment, and in so doing have activated an irreducible quantum of doubt concerning our "measures," that can never be appreciated by examining just one measure. On the contrary, reducing the error in just one measure may necessarily increase it in another area. For example, reducing the error in rank order may necessarily increase the error in sampling from all aspects of achievement. Probably the biggest contribution to this quantum of error is to be found in the boundaries of the test situation itself, regardless of the frame in which it occurs. Such boundaries represent a separation from the everyday learning or working world in which people interact in particular contexts. Knowledge is not something a person has, but rather one aspect of a response, appropriate or not, to a particular environmental context. Test situations invariably remove the person from that real context to produce some sort of controlled, simulated, and hence different context. It is this largely unexamined and unestimated discrepancy that represents a large and irreducible portion in the quantum of doubt. The enormous popularity (as distinct
from reason or purpose) of tests is to be found in its point of congruence
with most other myths; in its implicit promise of deliverance from a world
permeated with uncertainty, in it's claim to reduce human complexity to
a simple story line. In this case the story line of simple numbers.
Judge and juryI haven't? Of course you haven't. All you've done is to show that some judges aren't as good as they thought they were, and that anyone can be a judge so long as they know something about the topic they're judging on. So I haven't really got rid of the Judge? Not really. You've just democratised the process of judging. You've let more people into the club, and then asserted that the average of their marks is a better estimate of the true score than the judgment of any one of them. You think I've become a victim of my own ideology? Let me put it this way. If you're convicted of murder, does it matter whether the Judge or the jury convicted you? The error and the standardIt's at this point that the metaphor becomes shaky. For whilst there was indeed a real crime in the case of the criminal, as evidenced by the dead body of the victim, there is less evidence that there is a real order of merit, a true score. Now if there isn't a true score, then necessarily there can't be a true standard. And even if there is a true score, it doesn't follow that there is necessarily a true standard. As we have seen, the error in the estimate of the standard can't be less than the error is the estimate of the true score. And it will certainly be more, because different judges will differ about where to put it.How would we do that? Get a number of judges to identify the standard, and then average them out. You mean assume there is a true standard, and then see how well we can estimate it? Isn't that what we did with the rank order? Certainly. Then why not do the same thing with identifying the standard? Let's start from the beginning. In the General frame of reference, we assume there is a true score, which mirrors a true attainment, or ability, or trait, or predisposition, or whatever And starting from that assumption, we can show, both theoretically and empirically, that we can never measure it. We cannot specify what it is. We can never specify the true rank order of merit. We can only obtain estimates of it, and indicate how far away from our true rank order it probably is. Now whether there is "really" a true score or a true order of merit of the group being assessed, must forever remain moot. Assumptions of theories do not have to accord with some relationship between variables that have substantive existence in the world. So assumptions of theories related to people do not necessarily relate to any actual qualities or measurable quantities or substantive aspect or observable behaviour of real people. Theories are useful or not according to whether their outcomes, their conclusions, have some links with the observable world. Their assumptions are just that. Assumptions. However, if we had clear evidence that the assumption was incorrect, then there would seem to be an inbuilt contradiction of our theory to the world that it purports to mirror. Now if we wish to use the General frame of reference to define the standard, we need to assume that the rank order is the true rank order. For the true standard requires that preliminary assumption. The claim of the Standard is not
the claim of a broad fuzzy space, but of a thin red line. The Standard,
if it means anything, means a point on a stable steel scale, not a probability
on shifting beach sand.
Defining standardsAnd we have seen that we can never present the judge or jury with that true rank order. Our own theory had negated the possibility of locating the standard, because it has negated the possibility of finding the true rank order of merit on which the delineation of the standard, in this frame of reference, depends. It is not moot whether the true order of merit had empirical existence. It does not.What do you mean, stuck? We can't use our rank order, inaccurate as it is, to find a standard. Not altogether true. We can define the standard in terms of our true score. In terms of our true rank order. Whose existence is still moot. Exactly. How do we do that? Very simply. If we wish to use grades, for example, we can just define an A as any score or rank order in the first five percent, and an E as the bottom twenty percent, of the population we are testing. Why five and twenty? Make it twenty and five if you like. It doesn't matter. It's arbitrary. The important thing is to define it, so that everyone is talking about the same thing when they're talking about the grade. Won't there be an error in the definition? Not in the definition. The definition is in terms of the true score. So it is exact, as a Standard must be. Of course, in practice there is always an error. So each person is truly at some Standard, but we can never be sure exactly what that Standard is? The second part of your sentence is true. The first part may be true, or false, or just a silly question. Reducing absurdityLet's briefly summarise what we know about standards, and their relationship with assessment, to this point. First of all, we know that empirically an individual judge cannot consistently recognise a standard, nor can he consistently rank students in the same order. These differences between rank orders, and the position of the standard related to them, are increased if different judges are asked to recognise a standard, or rank order students.The claim of the Judge that he can do these things is thus seen to be untrue as an empirical fact in the real world. It is a fantasy that he has about his own ability that is shared by many people in society. This does not make it less untrue. It does make it less likely that he will admit to its untruth, and more likely that he will take strong measures to disguise the extent of its untruth. For to admit of any error is to destroy the fragile fabric with which the myth of his power and perfection is woven. In the General frame the error is admitted, though the assumption of an (unattainable) true score is retained. The estimate of the true score is improved by averaging scores from a number of judges. This is vindicated empirically because different estimates obtained by this method are closer together than estimates made by two single judges. In this frame, it is admitted both theoretically and empirically that any rank order of students is not the true rank order, but an estimated one with built-in error. Thus it makes rational sense to define some standards, some grades, which admit of no error, in terms of percentiles of this true rank order. Even so, in practice we would have to indicate clearly the errors in our estimated grades. And we would have to indicate clearly that these standards are unrelated to any judgments of "quality" as defined by Judges. They are merely cut-off points at various percentiles of a specified population of testees. What would not be rational would be to get judges to estimate the cutoff points for standards by presenting them with a scale that was admitted to be inaccurate. The Judge claims to recognise the standard, and the production of a stable rank order is a necessary corollary of that claim. We have rejected that claim in our production of a more stable, but still inaccurate, rank order through gereralizability assumptions. It is absurd to now reinstate the judge to determine the standard. It's asking the judge to do something that's demonstrably crazy. (Not that it's unusual to engage
in crazy activities. It would surely be utterly irrational to expect humans
to act rationally. The expectation of rationality is the epitome of delusion.
It can lead only to despair at the human condition. To applaud rational
behaviour in its rare moments of emergence from the mire of human craziness
will provide a firmer path to human happiness. But that's another story.)
Judgments and categorisations in the qualitative worldOne more point needs to be made here. Whilst the above argument has focussed on tests and grades as a particular sort of educational event, the arguments made are equally cogent for all categorisations of people, whether these be made in the numerical world of quantitative assessment, or in the more linguistic world of qualitative assessment.Let us be clear about this. If at any point a qualitative assessment engages in a categorisation, a separation of two groups of people, then it is invoking the notion of a standard, and of the measurement of that standard. And in so doing it is logically engaged in all of the rank ordering and judgment errors that have been discussed. There are some few genuinely dichotomous
variables on the basis of which most people may be categorised; for example,
blue eyed people and brown eyed people. Most variables used for categorising
people however are continuous and not dichotomous; as such, any such categorisation
requires a standard, the thin red line that defines the categories, and
then a judgment about whether any particular case is above or below that
line. As argued earlier, this logically implies a stable rank ordering,
which constitutes a primitive form of measurement. Categorisations then
involve both standards and measurements, regardless of how much semantic
camouflage is used to disguise this.
Democracy and doubtAs the judge topples from his autocratic pedestal of certainty, it is doubtless pleasing to those of democratic mind to know that what will replace the judge is not chaos, but the will of the people. The rule of the individual will be replaced by the judgment of the group. The idiosyncrasy of the individual will be cancelled out and reveal the pure decision of the majority that is the source of the true the right and the just!We have seen how in practice the delineation of the standard cannot be more specific than the fuzziness of the rank order of those being standardized. And we have seen how individual judges vary considerably in their rank ordering of a group of students, especially if they have no information about them other than the set of examination or test papers. A good punter can (usually) pick a good horse from a bad one, in a general sort of way, but he makes lots of errors when trying to rank accurately all of the runners in a particular race. So it is with the judge of human performance. There is a crucial difference between the punter and our Judge, however. In the horse race the camera can photograph the finish, so that there is a "true" rank order in which the horses run this particular race. It might not be stable if they run this distance next week, or generalizable to other distances. It will certainly be different over hurdles. But at least in this race we know accurately what the rank order is. Further, we know (almost) exactly what distance they have run, because we have a unit of distance with which we can measure. And we know (almost) exactly what time each horse took to run this distance. If we wanted to, we could nominate a "standard" for this distance below which horses could not compete in the equestrian Olympics. It would be an accurate standard. And it would be arbitrary. And we could measure whether a horse had reached that standard with a small, and empirically determinable, error. Horse racing as we know it is not
a good metaphor for the testing game. So let's develop a better one, a
myth more appropriate than that of the infinitely accurate little black
box that had mystical knowledge of standards, and resides in the head of
the omnipotent judge.
They're racing in TestlandIn Testland, races have always been important events. There are no permanent tracks, and unfortunately no way of measuring either distance or time with any accuracy. Some of the more exalted people in Testland do own clocks, but unfortunately they all run at irregular rates, and they all give different times for the same race.Races are accompanied by due pomp and pageantry. The track is marked with flags and signs saying "this way" and "that way." Horses and riders train hard and are decorated in much colourful finery. There is no starting point and no finishing point but when the bugle sounds they are off and may the best horse and rider win. There is no actual finishing point, but everyone knows the general area that the race will finish. Here congregate the Judges: the Standard Judges in their white wigs and purple cloaks impressively flourishing their clocks; and the Placement Judges so serious in their blue serge working suits all constructing their own lines of sight so they can accurately record the order of finishing. Some of these, aware of the subjectivity of human vision, have cameras with which to record the finish in a truly objective way. In the good old days in Testland there were many more Judges than horses. Everyone would have a great time picking the winner, and recording the orders and times. Then they would happily argue for the rest of the day about who had won and come second and so on. Because all of the judges were viewing the race from different positions and at different angles, because it was unclear which part of the horse had to get past the finishing line to complete the race, and because the signs on the track often had horses running in opposite directions by the time they reached the finishing area, every rider could find some judges who thought they had won the race. So race days were days of celebration and festivity, until . . . Nobody knows quite when the rot started, when the question about who really won the race became a problem for decision rather than an excuse for argument. Some thought it was when someone suggested that prizes should be given only to the first three horses and not shared equally as was the custom. Others thought it stemmed from a misunderstanding of a remark made by one Sir Henry du Princely, the Queen's sometime lover; another Judge thought Sir Henry said he had the best clock in Wonderland, and took umbrage. But most saw it as the inevitable march of progress and civilisation as Testland lurched forward into an uncertain future; just another example of the dominance of the three e's in the post-industrial era; engineering, efficiency, and expediency. Regardless of the reason, the facts are clear. Word got around that there was a real winner, and a true rank order in the race. There had to be, because it was self evident that some things were better than others. It followed that some horses and riders were better than others. Thus no-one but an idiot would argue with the blinding clarity of the truth that there was a unique winner, and a verifiable placement order, to every race. The race, everyone knew, was to the swiftest. It became the task of the Judge, therefore, to determine that swiftest. Sir Henry, who had the ear, as it were, of the Queen, and had been under some flack from other Judges because of the misunderstanding previously referred to, made a unilateral decision that henceforth and from hereon only one clock would be used in adjudging horse races and that one would be his. One or two other Standards Judges who contested this pronouncement found that their clocks mysteriously disappeared, leaving them, clearly, without a tick to tock on, or alternatively a tock to tick on, depending on which University in Testland you went to. Changes of this magnitude are not implemented easily, of course. At the next race meeting Sir Henry clocked the winning horse and for obvious reasons no other Judge queried his timing. However, the Placement Judges argued that, through no fault of his, he had clocked the wrong horse. Obviously, Sir Henry had underestimated the complexity of the task. He needed the placement Judges in his pocket as well as his clock. It was at this point that
Sir Henry's brilliance shone through with a remarkable insight which ensured
his historical survival in the annals of Testland. He let go a double-bunger
of a pronouncement that in one foul swoop solved the otherwise irresolvable
time and space problems. He defined the finishing line as being where his
clock was, and in the direction in which he pointed. By these means Sir
Henry succeeded in defining a unique standard and producing a unique placement
system at the same time. Truth was now defined. It was what Sir Henry did.
He had constructed a new view of reality. A world of winners and losers,
scientifically classified.
In conclusionThe astute reader will recognise here the birth of the Judge's frame in its modern form. More importantly, they will see, from their helicopter oversight, that the race has not changed. From above the chaotic nature of the race is evident, and Sir Henry and his little team of supporters can be seen to be doing what they are in fact doing; co-creating a fantasy about a winner where there is none, blinkering vision to substantiate a myth of order, and imposing truth by political assertion.
Return to Table of Contents |