Part 3: Tools of analysis

Chapter 7: Four frames of reference
Chapter 8: Equity, frames and hierarchy
Chapter 9: Instrumentation
Chapter 10: Comparability
Chapter 11: Rank orders and standards
Chapter 12: An inquiry into quality
 
 

Chapter 7: Four frames of reference

Synopsis

In this chapter four different frames of reference are defined; four different and largely incompatible sets of assumptions that underlie educational assessment processes as currently practised.

First is the Judges frame, recognised by its assumption of absolute truth, its hierarchical incorporation of infallibility; second is the General frame, embedded in the notion of error, and dedicated to the pursuit of the impossible, that holy grail of educational measurement, the true or universe score; third is the Specific frame, which assumes that all educational outcomes can be described in terms of specific overt behaviours with identifiable conditions of adequacy, and what can't be so described doesn't exist; fourth is the Responsive frame, in which the essential subjectivity of all assessment processes is recognised, as is their relatedness to context. Here assessment is a discourse dedicated to clarification, rather that the imposition of a judgment, or the affixation of a label.
 

Mythology

In the myth of meritocracy the examination is both a major ritual and a significant determinant of success. At the heart of this ritual, between the practice and the judgment, between the stress and the carthasis, is the great silence, the space where the judgment is processed.

The myth gives hints of what moves in this silence, for the myth makes three claims: the race is to the swiftest; the judgment is utterly accurate; and success is a certification of competency.

These hints tap the bases of the three frames of reference for assessment that assume objectivity. However, other assumptions of these frames make them mutually contradictory. This in itself would be good reason for keeping the process implicit. For the assumption that inside the black box hidden in the silence is a mechanism, an instrument of great precision, may be difficult to sustain, if it contains major contradictions within its workings.

Four assessment systems, with four different frames of reference, have staked their claim to exclusive use of the black box, their claim to be the best foundation for the precision instrument to measure human - what? Bit hard to say what exactly. To measure, perhaps, human anything. It may be sufficient just to measure. Or even just to pretend to measure, to assert that a measurement has been made, so that a mark may be assigned to a person.
 

Frames, myths, and current practice

The Judge's frame is far more often evoked than talked about. The focus is on the assessor's judgment of the product. The major activity is in the mind of the assessor. Such terms as expert and connoisseur are essential to the construction of the accompanying myth. Faith is the requirement of all participants. It is explicit in discourses about teacher tests, public examinations, and tertiary assessment, and implicit in all human activities that involve the categorisation of people by assessors.

The General frame is the basis for educational measurement, for psychometrics. The focus is on the test itself, its content and the measurement it makes. Such terms as reliability and ability are essential to its mythological credibility. It purports to be objective science, and hence independent of faith. As such the world it relates to is static, so there is no essential activity. It is explicit in discourses about educational measurement, standardised tests, grades, norms; it is implicit in most discourses about standards and their definitions.

The Specific frame is about the whole assessment event, and is the basis for the literature that derived from the notion of specific behavioural objectives. The focus is on the student behaviour described within controlled events; in these events the context, task, and criteria for adequate performance are unambiguously pre-determined. Reality is observable in the phenomenological world; the essential activity is what the student does. This frame is explicit in discourses about objectives and outcomes; it is implicit, though rarely empirically present, in discourses about criteria, performance, competence and absolute standards.

The Responsive frame focuses on the assessor's response to the assessment product. Unlike the other frames it makes no claims to objectivity; as such its mythical tone is ephemeral, its status low. This frame is explicit in discourses about formative assessment, teacher feedback, qualitative assessment; it is implicit though hidden in the discourses within other frames, recognised by absences in logic and stressful silences in reflexive thought. Within the confines of communal safety such discourses are alluded to, skirted around, or at times discussed; on rare occasions such discourses emerge triumphantly as ideologies within discourse communities.
 

The Judge

Most assessment in education is carried out within the Judge's frame of reference. The chief characteristic is that one person assesses the quality of another person's performance, and this assessment is final. By definition the Judge's assessment is free of error, and therefore any check of the Judge's accuracy would represent a contradiction of his function. So such a check is not only unnecessary, it is immoral, in that it is an act likely to destabilise the whole assessment structure by calling into question its most hallowed assumption.

The Judge's assessment may be verbal and on-site, eschewing numeration and a special testing context. However, performance is usually assessed with tests and examinations, with merit graded in some way. It is assumed that adequacy or excellence in performance is described accurately by the Judge. For this to be true, it must also be assumed that the test measures what it purports to measure, and that the marking, whether by the Judge or his assistants, is reliable. Again, therefore, checks of validity, that the test measures what it purports to measure, or of reliability, that the test will give the same result if repeated, are not only unnecessary, but are unacceptable and demeaning.

Judges must stand firm on the absoluteness and infallibility of their judgments, for this is the essence of their power, the linchpin of their role, the irreducible minimum of their function.

Thus they are duty bound to recognise standards, to perceive with unerring eye that thinnest of lines that separates the good from the bad, the guilty from the innocent, the excellent from the mediocre, the pass from the fail.

Talk to them of normative curves or rank orders or percentiles, all of which imply relative standards, and they will hear you out, wish you well, and with scarcely disguised distain send you on your way. In their absolute world such matters are irrelevant. They know what the standard is, and therefore their job is simple. Simply to allocate students, or their work, to various positions above or below that standard.

Set hard in a rationalist world view, this is a black and white world, a fundamentalist cognitive universe. The assumptions deny the possibility of reality checks, so the collective fantasy easily becomes the perceived truth, as human minds and bodies contort themselves to deny their more immediate experience.

So let us see what that more immediate experience might tell us if another frame of reference is chosen.
 

The General

The second frame of reference is called the General frame. I used to call it the generalizability frame, but that word has been hijacked by psychometricians. The general has been privatised and corporatised by mathematicians. The bird has been tamed and lost its wings. The general has become severely contained in mathematical armour.

What I am calling the General frame of reference is blatantly egalitarian and inherently relativistic in its conception, but has become constricting, reductionist and inequitable in its mathematical application. In one form or another it has dominated the academic literature in educational assessment for over sixty years. Within this frame is contained most of the received wisdom from thousands of studies in educational measurement and evaluation.

Its two initial assumptions are shattering. One Judge is as good as another. And all Judges are inaccurate. God is dead!

Now as Little Jack Horner understood quite well, you can't just stick in your thumb and leave it there. If you stick in a thumb you've got to pull out a plum or no one will say you're a good boy. And the plum was the third assumption: There is a stable rank order of merit. So there is a true score.

And there is a stable standard. It's just that, sorry old chap, it's just that the jury does it better that the judge. Or perhaps it would be more accurate to say that we measurement experts, we psychometricians, can do it, with the jury's help, much more accurately than you can.

Judge You can, can you?

General Yep.

Judge Whose assumptions are you using?

General Ours.

Judge Whose definition of a true score?

General Ours.

Judge Whose definition of error?

General Ours.

Judge And whose definition of standard?

General Ours.

Judge And you say I live in a fantasy world?

General That's what we say.

Judge I rest my case.

A bit unfair. But more that a grain of truth in all that. Even so, let's put a little more flesh on the skeleton of the General.

There is a true score: This notion has implications well beyond the psychometric. It is assumed that we are not measuring what a person can do, but rather a sample of what the person can do. If we could measure all the things (exactly) then we could find the true score directly. But as we can't there will always be some random error. In other words, if we had selected a different set of tasks the person would have done, probably, a little better or a little worse. Or even (softly now) a lot better or a lot worse.

This is all pretty obvious when you think about it. In almost any area of human activity, or study, there are an infinite number of possible tasks that could be required, questions that could be asked, limited only by the imagination of the examiners. And obviously, in a test situation, only a few may be chosen, from which a generalisation can be made about the rest. But the more tasks chosen, and the more they are a random sample of the total possible universe of questions, the closer you can get to the "true score". Further, your choice is a biased choice. Different people will choose different samples with different biases. So again, the more people involved in the setting of the examinable tasks, the closer we get to the replicable rank order, and hence to the true score.

We can't just stop at the questions, however; different markers rate answers differently. So markers also have to be sampled.

And contexts affect the result. Physical setting often affects performance. Some will perform better at home, some at school, some in an unknown environment. Some produce better work when isolated, as in a "normal" test situation. Others require stimulation in a group, which approximate more "normal" work situations.

The interactional media is sometimes crucial. Some express themselves better with the written word; others are much more comfortable with visual, aural-oral or more physical communication. Meanings can be communicated through many sensory modes. So if we are concerned to assess understanding of some area we would logically need to check across all of these modes.

And the time is important. They might do it well before lunch, badly after; successfully today, unsuccessfully in a month's time.

So assessments are required (marks or grades or rank orders), in all these different ways if we are to get a true estimate of a person's attainment or ability.

Whoops

Whadaya mean, whoops?

I saw that

Saw what?

Saw you pull that card out of your sleeve.

What card?

That one with the word "ability" on it.

I didn't pull it out of anywhere. I materialised it. I created it.

You made it up.

I created a useful concept. We all do it all the time.

Useful to who?

Useful to me.

Why is it useful to you to make up a concept called ability.

Because I've created a mess. A conglomerate of numbers based on myriads of interactional and contextual incidents. And I know how to turn it into one fairly stable number. But then I've got to write it on a label and pin it on someone.

Why?

Why?

Yes, why?

Well, if I can't pin it on someone then I would have done all that work for nothing, because it's obvious that although all these scores and grades were supposed to be measuring the same thing, they were actually measuring different things.

And you've got to have them measuring the same thing?

Obviously, otherwise I can't add up all the marks to get one stable mark, can I?

I suppose not.

So I made up a name.

Ability?

Ability.

And no doubt you specified the ability as being identical to the task area you were assessing?

Of course.

So ability is what the total (average) number is measuring?

Absolutely.

Relatively, you mean.

Yes, it would be fairer to say relatively.

And if you know their ability you know what particular things they can do?

No, I wouldn't say that.

Perhaps you know what particular things they can do better than someone else?

No, not that either.

What do you know then?

Well, if you were to take all the possible things that a person might be required to do in a particular area of activity that is more or less described by the ability, then you could say that, on average, and very consistently, a person with a high score on that ability would do better than a person with a low score.

Whoops, you've done another shift. All this information isn't about the person. It's about the interaction of the person with the task with the assessors. How are you justified in pinning it on the person doing the tasks? Why isn't this information about the whole contextual community?

Initially it is. But when we average out all the individual scores, they stabilise for each person. Regardless of the context, and regardless of the particular assessors. And the only other stable objects in the whole shebang are the people being tested, and the thing we're supposed to be measuring. So it makes sense. Ability is the stable label.

What does that ability score tell you about specific things that they can do?

In terms of specific tasks I would have to admit, if pressured to do so, that I could, from their ability score, predict very little.

So you began with lots of information about differences.

Indeed.

And you finished up with one bit of information and a name attached to a person. One bit of information about a constancy.

True.

You made a choice. You could have said that a student's true ability was all that variety of things that were very uneven and unstable and changeable. You could have said that the true description of ability was the collection, rather than the summary or summation, of all the information.

I could have done that.

And then the summary, the average, would represent a huge simplification, a reductionist symbol, a monstrous error, rather than a true score?

That follows.

But you chose to define the average, the summary, the abstraction, as the true score, and everything else as error?

Indeed I did.

How do you justify that?

Because the average gives a stable score, and a stable rank order, and this enables us to make a clear classification of the student.

And that's important?

It's crucial. You could say it was the aim of the whole exercise.

I thought the aim of the exercise was to describe a student's learning.

Would you think the best way to do that was with a number?

No.

Well, then!

I have tried to give some of the flavour of the General frame of reference here. To indicate some of its assumptions, some of the things it can do, and some of the things that it can't do. And it is apparent that one of the things that it can't do is give specific information about exactly what tasks a person can or cannot adequately perform.

I have also, in the spirit of this frame, fudged a bit. For example, the scores are not stable; they are stabler after they are averaged than they were before. As are the rank orders. But stabler does not mean stable; more reliable does not mean reliable; more valid does not mean valid. More of this later.

I have also expanded the conceptualisation of this frame well beyond most of the theoretical expositions in the literature. Such logical expansion does not lead itself to elegant mathematical modelling, however, so the fudging of psychometricians has reduced, restricted and simplified these concepts to a shadow of their full power.
 

The Specific

The third frame of reference for assessment defines the world of specific behavioural objectives, or specific learning outcomes, and, by implication if not practice, of the more fashionable criterion based assessment and competency standards.

Here we are far away from the religious world of the judge, and the pseudo-scientific world of generalised ability. Here is a technological space in which a spade is indeed a spade, and to Alice's delight, things are indeed what they say they are. Or so it would seem.

This frame of reference assumes that the task of assessment is to describe what can be done, under what conditions, and what constitutes adequacy. So there is only one correct description of performance, and that is the unambiguous learning outcome that is defined in advance. It is assumed that learning outcomes can be defined so clearly that there is no doubt whether a person has, or had not, matched behaviour to the outcome.

There is no problem here of matching objectives to curriculum, and curriculum to testing. The objectives are the curriculum are the learning outcomes are the test. A rose is a rose is a rose.

Here is the bright fluorescent material world of the technological fix. Reality defined as observable behaviour. A world where doubt and uncertainty is no more. A place of clear goals, purposeful activity, and attainable and unambiguous outcomes.

More than this. This is surely a political revolution. The power to certify or exclude is no longer in the hands of the omnipotent judge or the manipulative psychometrician. It is clearly with the student who can self-certify adequacy, and any intelligent bystander can check that the task has indeed been adequately accomplished.

The technique was first developed to train technicians quickly and efficiently during the second world war to do a limited number of very specific tasks, and follow through a finite number of carefully specified procedures. In this it was highly successful, and its overflow into the general training area, and the nebulous and vague syllabuses of education, was viewed with delight by many of those who wished a firmer base for guiding and assessing learning. That is, who wanted to control what people learn.

And it was possible to find in most areas of learning, in most specifications of jobs, in most definitions of curriculum, in most topics of study, some irreducible minimum, some particular aspects of performance such that we could say - well, if they cannot do at least these things to this level of skill, or if they do not know at least these particular facts, then we could never certify that they were adequate in this area of functioning. In other words, the frame proved to be very useful where there were a finite number of tasks that could be isolated and specified, with limits of adequacy defined.

However, there were two questions, one technical and one political, which shattered the image of specific behavioural objectives as a democratic panacea for education. The first question was - is it possible to specifically define outcomes in any area of interaction that includes cognitive or interactional areas involving any problem solving or analysis or synthesis. Any activity, that is, involving cognition of more complexity than low-level comprehension?

Note, however, that to ask this question is to step outside the frame. For the assumption of the frame is that all tasks are so specifiable.

And the political question - who defines the objectives? Why these particular tasks? Why this particular context? Of what significance this particular cut-off for adequacy? Have we solved the problems of reliability or adequacy, or merely hidden them behind a dense materialist behavioural smoke-screen, behind which shadowy judges, bureaucratically insidious, silently sit?

Again, to ask this question is to move outside this frame. Within the frame this question is not a contradiction, it is simply irrelevant.
 

The Responsive

The Responsive frame of reference for assessment is manifestly and covertly subjective: no longer are the descriptions and judgments attributed to the performance, the artefact, or the person. What the assessor says is no longer claimed to be a quality of the object produced, or the objectified subject that produced it. What the assessor says is claimed only to be what it indeed is - a response of the assessor to a particular situation or artefact; a verbalisation of a particular human response to an interaction; a construction of the person assessing that says certainly as much about the world view of the person assessing as it does about some abstract quality or behavioural skill of the object or person being assessed.

Within such a frame there is no question of a right judgment, of a correct classification, of a true score. The response might be sensitive or insensitive, sophisticated or ingenuous, informed or uninformed. The verbalisation of that response might be honest or manipulative, its fullness expressed or repressed, its clarity widened or obscured. It still belongs undeniably to the assessor, and the expectation is not towards a conformity of judgment, but a diversity of reaction. The lowest common factor of agreement is replaced by the highest common multiple of difference. The subject of assessment is no longer reduced to an object by the limiting reductionism of a single number, but is expanded by the hopefully helpful feedback of diverse and stimulating and expansive response.

As with the other frames of reference, this one rarely materialises in its pure form. In the evaluation literature it has gained some attention under the rubric of formative evaluation, which occurs during a course of study, a low status cousin of summative evaluation, the final judgment, that more macho space where the real battles are fought, and the important decisions are made. Even so, there is professional literature in plenty, and especially in the rhetoric of "teaching" rather than "assessment", that supports the idea of assessment as feedback and guide, rather than classification and judgment (Williams, 1967).

So it is in this diagnostic and formative function that responsive assessment has found its place; as part of the training program rather than as legitimate description of what has been learnt.

There is good logical reason for this. It is obvious that this frame is a direct contradiction to the Specific frame, in which there is only one description of performance required and that is defined in advance.

It is less obvious, but none the less true, that the frame contains, in its practical functioning, a contradiction of the Judge and General frames, for it denies implicitly the idea of the single accurate order of merit, and hence the notion of some true score, or of some inviolate standard.

There is a further contradiction built into the assumptions of the Responsive frame. For if, in attending to the feedback, the performance of the person assessed is indeed improved, then the quality of performance, the degree of skill, will be changed, and the "true score" will also be changed in the very functioning of the assessment process, making the accurate judgment immediately inaccurate.

It is important to the logic of the Judge, General and Specific frames that no learning takes place after the test, for otherwise the test result becomes invalid, and must surely be dispensed with. On the other hand, within the Responsive frame, it is expected that the responsive feedback from an assessor will interact with the performance and improve the quality of later work, at least in terms of that particular assessor.

In the Responsive frame, this is an act to be applauded; in the other frames, it is a worrying source of error; in this respect the Responsive frame fits into a dynamic, and hence educative, environment. The other frames are predicated on a static universe, and are thus, in a profound sense, anti-educational.
 

Shifting sands

How does the Judge perceive the other frames? To the Judge the General frame is hopelessly relativistic, lacking in authenticity and depth, and devoid of standards. the Specific frame is reductionist and trivial, unable to cope with the cognitive complexity which lies at the heart of any discipline. And the Responsive frame is permeated with that subjectivity that indicates the absence of the objectivity that only comes with true scholarship, which the Judge exemplifies.

How are the other frames viewed from the General perspective? The Judge simply cannot deliver his promise of measuring accurate standards. His idiosyncrasy is legion and his omnipotence is self delusion. The Specific frame presents information that is scattered, incapable of producing a single dimension of measurement. Any addition of the specific information loses it, and returns the data to the General frame without the usual measurement controls. The Responsive frame presents data that is too diverse and contradictory to be seriously considered as a measurement.

From the Specific frame the Judge may be measuring something but neither he nor anyone else knows what it is. Just so with the General frame, that gets lost in a wilderness of numbers and cognitive abstractions. And the Responsive frame belongs to the world of opinion and gossip rather than scientific description.

The Responsive assessor sees the Judge as a responsive assessor, deluded by a fantasy of objectivity and accuracy. The General frame is seen as mathematical chicanery used to justify unsustainable classifications of individual people. And the Specific frame is seen as an absurd attempt to reduce human experience and performance to a few describable and measurable behaviours.
 

Conclusion

Sensible debate within a particular frame of reference for assessment sometimes occurs. However, rational debate across the full range of frames is a rarity. Part of the reason for this is that people argue from different frames of reference, with their incompatible assumptions, and these are rarely made overt. Not only that, but individual people in a particular discussion shift from one frame of reference to another, sometimes with bewildering speed.

This is why a conversation between a university professor (Judge), a psychometrician (General), a educational software technologist (Specific), and a radical teacher (Responsive), sounds like the sound track from a Marx Brothers movie.

In the next chapter we shall see how these frames are related to concepts of equity and hierarchy.


Return to Table of Contents