Chapter 10: Comparability

Synopsis

In this Chapter I examine the notion of comparability as it applies to the assessment process. Any rank ordering of students, any adding of marks on examinations, any addition across subjects, assumes that comparisons can indeed be made.

The fundamental distinction between more and less, and better and worse, is first elucidated, and this is linked with ideas of uni- and multi- dimensionality and notions of doing or having. This analysis is then applied to ideas of traits, abilities, and skills, and their supposed measurement in tests and examinations. Some fundamental confusions are exposed.

The discussion then moves to what meaning if any can be given to the result when marks or grades are added, how loadings on final rank orders are affected by spread of marks, and how differential privileging of sub-groups occurs with different intercorrelations. Finally, it is contended that for individual students the privileging is non-predicable, and the total score thus meaningless.

Goal kicking skills

George!

Yes coach?

You know why we've lost the last six games?

The other teams were better?

Bad kicking, George. Bad kicking. And with six in a row, someone's got to go.

Gee coach, that's really poetic.

Yeah George, and you're really pathetic. Anyway, do some tests and get me a team ranking on best to worst on goal kicking skill.

No worries, coach. Goal kicking skills, you said?

That's what I said. Get me a best to worst ranking on goal kicking.

What particular aspects of goal kicking, coach?

You're the trainer, George. How far they can kick. How straight they can kick.

Anything else?

Jeez, what do I pay you for? Set kicks, kicks on the run, and snaps. That ought to do for a start.

No worries, coach. I'll work out some tests for each of those and give you a list in a coupla days.

(Two days later).

Here you are, Coach. Here's the list. I've ranked twenty five of them in order of merit on goal kicking skills.

That's great, George. Just what I wanted. Let's have a look at this. Harvey's on top of the list. How many goals has he kicked this season?

None, coach. He's been playing in the back pocket.

Look where you've got Shonker. Twentieth. He's the bloody full forward. He's booted a hundred goals this season already.

Yeah! well, he's missed two hundred.

So he's missed two hundred. He's still booted four times as many as anyone else.

That's because he has ten times as many possessions as anyone else. You didn't ask me about that. You just asked me about goal kicking skills.

Yeah, OK. So who's the longest kick?

Can't tell you that. It got lost in the data.

Who's the most accurate on set shots over 50 metres?

Got lost in the data.

Who's the best snap shooter. No, don't tell me. Got lost in the data.

Hate to tell you, coach, but I think this list is a load of shit.

You can say that again. Who was the idiot who did it?

The idiot who did what some other idiot told him to do.
 

Better or more?

Fundamental to the process of arranging orders of merit is the notion of comparability. As we have seen, the notion of standard implies the notion of order of merit, which implies the notion of more or less, better or worse. For such notions to have a meaning, they must refer to some aspect, some property that is being compared, that is presumably being measured.

Regardless, the first paragraph slid past a fundamental distinction: "more or less" is not the same as "better or worse": More or less are terms related to counting, to mathematics, to scales and measurements. They are loaded with notions of objectivity, and solicit entry to the quantitative world; better or worse are terms related to value, to goodness. They are permeated with the aura of subjectivity, and are related to the qualitative world, the world of valuing. The concepts are in different domains of discourse. If the criteria is size, then two people may be compared as being more or less heavy; or their weights may be compared in terms of better or worse in regard to health. But the two ratings are unrelated. Or if the criteria is emotionality, we may rate people in terms of whether they are more or less emotional; or we may rate them in terms of the appropriateness or productiveness or empathic clarity of their emotionality. Again the two ratings are conceptually unrelated. Or so it would seem.

What is the essence of this difference? For when we tried to explain what we meant by better, we used words like healthy, productive, empathic, clarity: and the interesting thing is that we may use more or less with any of these words, even though we started off in the better or worse category. And we may also ask of each of these new criteria whether they are better or worse; in this case questions preempted in the predominant paradigm because value judgments of better are already built into the words chosen to describe the criteria.

So what is the essence of the difference? In relation to aspects like size or emotion or clarity, when we ask the question more or less we are asking about intensity, about how much or how many. We are referring to the aspect in isolation from its environment. The event that produces the judgment about more or less involves our sensory relation to that aspect independent of other aspects. More or less questions are answered by focussing on the aspect and on no others. More or less questions are directly answerable. The answer may be incorrect, but such a statement in itself implies that there is a correct answer. More or less has only one meaning in relation to a particular aspect. They can't be more and less at the same time, so the question is convergent, and presupposes a world in which there is a true answer to the question. So logically more or less implies a uni-dimensional aspect, a world of transitive and asymmetric relations (Lorge, 1951, p548).

On the other hand, when we ask the question better or worse, we have to ask another question, In what way better or worse? Because something may be better in some ways and worse in others. Better or worse in what aspects? Or better according to whom? Or better under what conditions? And when we nominate those aspects we can ask of them two questions about any comparison; more or less, or better or worse. And so on. Essentially better or worse implies multi-dimensionality in the aspect under consideration.

What does all this mean? Very simply, when we ask the question more or less there are no further questions to ask. We move straight on to the answer. In other words, more or less questions define the end of discourse; they are a direct invitation to a judgment; they are the signal to stop thinking, and act; and incidentally and significantly, to accept the judgment, which comes after the thinking has stopped.

But the question better or worse logically invites more questions about the first criteria. In what way better or worse? Which introduces more aspects, particular aspects selected in most cases from a much larger set of possibilities. For there are as many aspects as our conceptual imagination may produce (Lorge, 1951, p536). Yet the original aspect is reduced, even as more precision is generated by defining aspects; and as more aspects are conceived, the potential disparities of the judgments concerning them increase. And then for each of those aspects: More or less? Better or worse? And again, the additional questions about positioning and context are generated. So better or worse questions encourage further discourse, and further thought.

All this is not to deny that the power relations in which such discourse is embedded may dictate that the answer to the question better or worse be given at any time and be accepted without further thought. But that in no way invalidates the additional logical questions that the aspect implicitly generates.
 

Having and doing and being

It is obvious, but important, to make the point that whole entities (holons) cannot be directly compared in terms of more or less, only aspects of them (Jones, 1971, p335). One dog cannot be more than another dog. Nor can a stone be more than another stone, nor a stone be more than a dog.

In like manner dogs and stones cannot logically be compared in terms of better or worse, for such a claim is meaningless without a response to the question "in what way better?" A dog cannot be better than another dog. In terms of dogginess, dogs are equally doggy; they are equal by definition, as being classified as dogs. Likewise with stones. And dogs and stones cannot be compared as entities because they are in different classes. It follows that the very act of classifying whole entities (into classes) logically invalidates any comparisons within or between the entities that comprise them. Classes of course can be compared in terms of the numbers of elements they contain, but this is a different matter.

Two people are being compared in terms of the relative merit of some task. In terms of doing, we may say that one person does it better than the other. This is a statement about relative merit. Or we may say that one person does it more than the other. This is a statement about relative frequency, and not of relative merit. You may drive a car badly many times.

In terms of having, we may say that one person has more of something than the other. This may claim to account for the greater merit. It is essentially a statement about the comparative number of elements in a class. But we would not account for a difference in merit by saying that one person had that something better than the other. Such a statement refers to the whole class and whole classes cannot be compared except by numbers of elements.

So in terms of relative merit, the question of more implies a different mode of description, a different ontology, than does the question of better: Better or worse is a comparison of what people do under certain conditions, made by some person; more or less is a comparison of what people have, or are alleged to have. As such it is logically independent of any contextual or positioning variables. One begins to see the simplistic delusion generated by mathematical modelling.

Logically then better or worse questions cannot be answered definitively until they are reduced to a criteria which comprises a class in which the question better or worse is reduced to the question more or less. Logical here means relations that are transitive and asymmetric.

Pragmatically, better or worse questions can be answered whenever the criteria are sufficiently understood (implicitly or explicitly) to allow consensual subjectivities of judges to give similar answers. However, as we have indicated earlier, such criteria are multi-dimensional. And as is evident from the conversation that began this chapter, little if any meaning can be given to a uni-dimensional description of this multi-dimensional entity in terms of their uni-dimensional elements. As we shall see later, one meaning of such a comparison is dependent on the relative loadings of the different dimensions.

Politically, of course, better or worse questions are answered whenever someone with sufficient status or power gives a decision.
 

Comparing people

It follows that to compare people, whole people, we may compare either some parts that comprise them, or some wholes of which they are parts. If we look at the parts that comprise them, we may look at the person's elements or internal processes; if we look at the wholes of which they are parts, we may examine the person's functions and relations in the wider environment or community, or at the cultural meanings in which their thoughts and actions are embedded (Wilbur, 1996).

Let us compare two people in terms of their relative merit in Physics. We are particularly interested in their relative achievement in a particular course of study at year 12 level. Such a course has a range of content and objectives and involves practical and cognitive operations of varying complexities.

We are obviously in a multi-dimensional world, in which at this stage more or less questions are meaningless. Further, any logical answer to the better or worse question is going to depend on the details of the answer to the prior question: In what way better? What particular aspects? Under what particular conditions? In whose opinion?

And if we intend to give a meaning as well as an answer to a multi-dimensional comparison, what are the relative loadings of each aspect in the final judgment?

Of course, we could simply ask the teacher who taught them, who is better? And the teacher might give a judgment. But in making sense of that judgment in terms of the original question, the implicit questions still hang there; in what way better? So after the judgment, the teacher must logically justify the decision on the basis of criteria; and if one is not better on all possible criteria, then the question of how the criteria are loaded to obtain the final criteria is relevant.

So, either prior to or after the judgment, how might the discourse progress?

In what way is she better?

She knows more facts.

Is that all?

No. she's better at solving problems?

In what way better?

She gets more complex problems right?

Does she get more simple problems right?

No, he gets more simple problems right?

In what ways is he better?

He is more careful, he makes less mistakes.

And so on , and so on. And if we are dealing with twenty or thirty persons, it is clear that different criteria of comparison are possible for each pair, and there is no reason to believe therefore that there would emerge any final rank order of merit, for on the basis of different criteria of comparison, A could be better than B on criteria 1, B could be better than C on criteria 2, and C could be better than A on criteria 3. This is an empirically inevitable consequence of multi-dimensionality. It is inevitable because only when every criterion correlates unity with every other criteria will ranking invariance occur. And in that situation we are, by definition, in a uni-dimensional situation. It is the reason that psychometricians fantasise unmeasurable but uni-dimensional true scores.

Viewed from this perspective, it becomes clear that the more specific, limited and applicable to all comparisons the criteria become, the more possible it is to finally reduce such aspects to those answerable by more or less, the more possible it is to produce an invariant ranking, and meaning (in terms of explicit loadings) for the meaning of the original comparison. However, such meaning is at the expense of initially reducing and finally confusing the meaning of the original comparison. Another example of the essential contradiction between reliability and validity.
 

Traits, abilities and skills

A trait or an ability is a thing that a person has. A trait is a hypothetical entity, an abstract attachment, a comparative label, that is used to explain differences in what people do in terms of something that they have. A trait is described not so much as a performance as a potential performance, as a sort of template of the performance that might emerge under ideal conditions, whatever that may mean; a morphic field that predates performance. This magical property of a trait makes it forever immune to particular environmental conditions, which may indeed influence particular performances, but leave the trait, securely protected within the person, unsullied and unmoved, firmly fixing individual merit in correct relative position in the grand order of things.

A skill is a much more difficult ball of wool to untangle. A skill is something you have, like a verbal reasoning skill. On the other hand, a skill is normally exhibited as something you do, like playing a musical instrument or tennis. And you can have more skill but maybe not better skill (skill here is used as a holon). On the other hand, you can have more skills or better skills, and these two meanings are different, as with the goal kicking skills referred to earlier. Better skills here appears to have more to do with a particular selection of skills relevant to a particular context. Then again, skill seems to refer at times to a particular standard in a more-less or better-worse ranking; unskilled refers to rankings below the standard. It is clear from all this that the word skill is a very useful word to have in any discourse that wishes to imply precision even whilst it multiplies confusion. Norris (1991) notes a similar confusion in the notion of outcomes:

The precise specification of performance or outcomes rests on and leads to a mistaken view of both education and knowledge. Mistaken because there is a fundamental contradiction between the autonomy needed to act in the face of change and situational uncertainty and the predictability inherent in the specification of outcomes (p335).

The world of objective tests

Objective tests, which often claim to be value free, necessarily do not ask better or worse questions. The whole operation is contrived so that only more or less questions are asked and answered. Further, they necessarily deal with what people have, not with what they do. Thus it is not so much a desire to deceive that drives the psychometrician to imagine constructs such as ability or traits or skills, but a logical necessity of the world they have constructed.

For it follows that if there is to be an answer, rather than a multitude of answers, to a comparison of two people, it is essential that the question better or worse never be asked, and all comparisons be reduced to the question more or less.

So the world of objectives tests, like the world of chess, and the world of mathematics generally, is certainly internally logical. Whether it relates to anything that actual people do in the world, apart from answering objective tests, or playing chess or mathematics, is another question.
 

The world of public examinations

Examinations live in far more dangerous territory. The constructors and markers of examinations are far less isolated from the front line of educational activity than are test writers. Their language is less precise, their pragmatism more up -front, their compromises and contradictions more overt. So they are far more likely to slide uneasily between concepts of better or worse, and of more or less, according to the pragmatics of phases of the assessment.

Consider the marking of essays. Whilst guidelines for marking may be given, ultimately notions of better or worse must be utilised by examiners in deciding what mark to give. Such guidelines are designed to circumscribe the answers to the question "what aspects?," to limit variability in the question "who says it's better?," and hopefully bypass entirely the question of the effects of the conditions on the essay's production.

So in stage one, the answer to the question of "better or worse," which establishes the ranking of students on a particular question, is used to determine the answer to the question "more or less," which is the mark given. Now the marks are added to give a total score, which is then interpreted as being better or worse according to whether it is more or less. Finally, if the grades are not distributed statistically, someone must look at whole papers around the grade boundaries to decide which are in their opinion better than the standard that defines the boundary, and which are worse.

Now, it is clear that this procedure only makes sense if the notion of better or worse, and the notion or more or less, are synonymous, within the series of events that comprise the examination. In other words, if better means more within the context of the examination. Practically, this makes it now impossible to untangle the interaction between the two notions, or deal with the complexities involved when multi-dimensional aspects are mapped onto uni-dimensional scales.

It is not my intention to suggest a solution. It is my intention to establish a confusion, and to note that such confusions must invariably lead to more invalidity and uncertainly about what is being described here. In other words, here we have another, crucial and fundamental, source of error.

We are tapping here one of the distinctions between quantity and quality, two concepts often fused together in discourse on measurement and evaluation. At this point it is sufficient to note that big is not necessarily better; getting more sums correct than somebody else does not necessarily make you better at mathematics: nor does getting more spellings correct make you better at writing, or getting more multiple choice questions correct on a philosophy test make you better at philosophy, or a better philosopher. To suggest otherwise is perpetrate a category confusion. The matters raised in this paragraph are further elucidated in Chapter 12.
 

What can be compared? What can be added?

So in terms of "more or less" we can compare any events that have a common aspect, that have a criteria on the basis of which we can rank them in terms of having more or less of that common aspect. A criteria, that is, that can be considered uni-dimensional.

Two questions then arise, which are fundamental to the whole notion of testing, examining and credentialling. The first question is, what happens when we add measures or ranks that relate to the same aspect? The second question is, what happens when we add measures or ranks that relate to different aspects?

Let's compare swimming pools in terms of two aspects that are comparable in terms of the same measurement units, a claim incidentally we could rarely make in the human measurement field; we could compare the pools in terms of length, or in terms of depth. In both cases they may be measured accurately (to within one millimetre) in metres. Now we could obviously compare our pools in terms of length, and we could compare them in terms of depth. The question is, could we use these criteria to obtain a single measure in terms of which they could be compared? This is in many ways an ideal situation; we have an accurate scale and measuring device, and our two aspects can be accurately compared on the same scale. So we could add the measure of length and the measure of depth. But what would it mean?

We could classify swimming pools uni-dimensionally in terms of the sum of their length and their depth. In terms of the initial components we have now lost any meaning, but the process (the addition) does enable us to imply another meaning; in this total positioning length and depth were equally valued, because we added the two measurements together, each with a loading of one. Or so it would simply appear. But things are not always what they seem and in this instance this would be an erroneous inference.

The relative valuing of the two components may be looked at in two ways; in terms of absolute value of the combined measure, or in terms of the influence on the rank order of the combined measure. Let's look at the absolute measures first.

If the depths of the pools varied from 1 metre to 2 metre, whilst the lengths varied from 10 metre to 100 metre, magnitude of the addition would be almost entirely defined by the length measurement. Alternatively, if the lengths of the pools were all between 15 metre and 16 metre, and the depths varied from 1 metre to 5 metre, then again the length would contribute most to the total measure.

However, in the second case the final rank order of the total measures would be most influenced by the depth measurement, which has a bigger range. So whilst the loadings for absolute values of the sum of measures are determined by the absolute values of the components, (which could statistically be characterised by their mean value, if we wanted to lose a lot of information), the loadings for determining the final rank orders are determined by the standard deviations of each component ( Guilford, 1965, p424).

In this situation, the rank ordering of the total can be given a (process rather than content) meaning in terms of the relative valuing of the two components; and that valuing is implicitly determined by the standard deviations of their measures. We may adjust this by loading one of the measures. For example, a diver may greatly value depth over length in his pool, so may want the addition to mirror that valuing. So the diver may want to load the depth scores (by multiplying by a certain number) so that the standard deviation of the (loaded) depth measure (before addition), is 5 times that of the length measure. On the other hand, a long distance swimmer may want the two dimensions loaded the other way. In both cases the specific loadings are arbitrary, and in both cases they are related to function. And in both cases the final measure has no meaning other than that attributable to the relative contribution of each component to the final measure. (Of course, in this case the addition was completely unnecessary to the function; it would have been more rational for the diver to specify a minimum depth and minimum length, and for the long distance swimmer to do likewise; but that would have left us with no single variable with which to compare pools. And as mentioned elsewhere in this thesis, that may be the whole point of the exercise).

Let me generalise a little from this very simple case;
 

  • 1. Any measure implies a ranking. Rankings imply transitive and asymmetric relations.
  • 2. Rankings of a single aspect have a meaning, in terms of relative size or intensity of that aspect, which we can specify as more or less, and hence by numbers.
  • 3. Rankings of different aspects may be added, but the addition has no meaning in terms of either of the aspects taken separately; the addition can be given a meaning in terms of the relative contribution of the two aspects to the total.
  • 4. The relative contribution to ranking is determined by the loadings, equal to standard deviation multiplied by an arbitrary number.
 

The effect of correlations on loading

Let's go back to test and examination scores. We have three sets of scores (L, M, N) for the same group of people. The scores have the same standard deviation. We wish to add them to get a total score. Our theory tells us that they will have equal loadings on the final score.

Assume L and M scores correlate zero. Then when we add the L scores to the M scores, rank orders of both are changed, and it looks as though they contribute equally in determining the final rank order.

Assume M and N scores correlate one. Now when we add the N scores to the M scores the rank order of the M scores is unchanged. We could argue that the N scores have contributed nothing to the rank final order.

But then, if we add the M scores to the N scores, we could argue that the M scores contributed nothing to the rank order. A paradox. It is not necessary to resolve the paradox to realise that in this case the loading is determined by what is being added to.

It is also very clear that the final rank orders are very different in the two cases of zero correlation and unity correlation. Regardless of the loadings (statistically determined by the standard deviations), different students have been privileged in the two situations described. In the uncorrelated (r = 0) groups, no particular group of the M score group is being privileged, or under-privileged, by the addition. However, in the perfectly correlated groups (r=1), the students who do better in M scores are all privileged when the scores are added, and the students who do worse do worser when the scores are added. This is in addition to the fact that the standard deviation of the composite score is 1.4 times greater in the case of the perfectly correlated group, giving it just that much extra loading as a composite when compared to the other total (Guilford, 1965, p418).

So what does all this mean when both L and N scores are added to the M scores to obtain a single rank order? The L and N scores both have equal loadings to the M scores; but this is a group phenomenon, and tells us little about individual students or sub-groups of students. We have seen that the L score loadings are more or less equally distributed across the M scores, but the N scores have privileged the top sub-group (according to M scores) and down-graded (with respect to the total score) the bottom sub-group. By interpolation we can see that this phenomenon will have a differential effect over the whole range of possible correlations and will be greater as the correlation with the scores added to increases.

In addition, to the extent that the means of the L and N scores are different, to that extent will the addition scores generally privilege the group with the higher mean.

It is clear that the statistical notion that relative standard deviations determine loadings is a vast oversimplification when applied to complex comparison situations.
 

Comparability, true score, and error

Here we have presented, in very simple form, one of the dilemmas of public examiners who must cope with adding different scores, from different subjects, or from the same subject marked internally and externally, and end up with some final rank order of marks because someone has said this is what they must do.

I have argued that such a total score can have no meaning other than that inherent in the loadings attributable to each component added; and I have shown that whilst the loadings of the whole group from any one school may be controlled through controlling the standard deviation of the marks, the correlations of the score with the score added to will influenced the subgroups which are over or under privileged by the addition.

There is another paradox evident in the conclusion, especially in regard to internal-external scores. To expose the paradox two further facts need to known.

Firstly, the rationale for internal assessment is that something different (broader, deeper, more complex, more varied) is measured by the internal assessment. Secondly, we can assume that in most public examinations some twenty to forty percent of students will be deemed to have failed, and to that extent the rank orders of their final scores are irrelevant in respect to the grades of those who pass; so the pragmatic teacher might argue that to underprivilege students who will fail anyway "does not matter."

In such a situation, it is rational (if somewhat inhuman) for schools to aim for maximum correlations with the external examination in order to privilege those who will most benefit from such privilege (that is, the best students). However, in order to do this they must invalidate the internal examination; for such an examination is surely more valid the less it correlates with the external scores, because it is supposed to be measuring something different. In short, the price of success is invalidity.
 

The middle way

That's all very well for the front runners, but most of the kids I teach are more middle of the road. I just want to get as many as possible past the cut-off point for entry to University or TAFE.

Well, you've got a different problem then. You want to maximise opportunity for the middle group, not the top group.

I suppose you could put it that way. So how do I do that?

Easy. Just take out that middle slab of students and put them at the top of the rankings.

Just like that?

Just like that!

But isn't that unethical? Doesn't that make the whole examination invalid?

Sure. But as I've explained, it's invalid already because of what many schools are doing for their top students.

Are they really aware of what they are doing?

What's the difference. I don't accept the view that in this case bliss in ignorance makes the position less unethical. It certainly doesn't make the practice less invalidating, or the errors less significant.

When equal loadings are unequal

I have shown how equal loadings for a group may take on different shapes according to the correlations. Equal loadings for a group does not in practice mean equal loadings for all subgroups of that group. And in terms of individual students it doesn't have any particular meaning.

The question then arises, does equal loading for the whole group of students mean equal loadings for each separate school? Surely some school groups are really better than other school groups so should be differentially loaded? Some school groups might have higher means, and some may have larger or smaller standard deviations in the sets of marks that indicate their comparative attainments. And these might mirror differences in intrinsic ability, whatever that means, or might be a function of very good, or very bad, teaching, whatever that means. But if such students are tested internally, how would we know about their differential potential, or their differential attainment, as distinct from differential testing effects? And especially how would we know if they study and emphasise different things, and value different criteria, so that their results are essentially non-comparable? Or if they study different subjects, with utterly different realms of discourse, such as chemistry and Japanese?

Now there are a number of ways of trying to solve this problem, all of them more or less inadequate. McGaw (1996) summarises them well: use some external examination (either the specific one related to the subject, a single "scholastic ability" test, or some grand total score on all external examinations) to statistically adjust the internal school results; this is statistical moderation of the school-based assessments. Or alternatively "use some external review and checking of schools" assessment results by teachers from other schools or authorised assessment experts to control the level and distribution of school-based results (ie consensus moderation)" (p82).

Such moderation systems provide different processes for modifying the means and standard deviations of school scores on the basis of comparison with other scores or other schools or other students. To the extent that the correlations with the criteria (whether the criteria are scores or actual criteria in the minds of the moderators) are high, to that extent is the moderation reasonable, and possibly invalid. And to the extent that correlations with the criteria are low, or differential, to that extent is error compounded, as we have indicated in the previous discussion.

I do not intend to enter into the debate as to which of these is the "best" way to go, or indeed whether they all do not produce solutions which are more inequitable than the problem they were devised to solve. My project here is not to indicate how such problems may be best solved, but rather to detail what implications such solutions have for the empirical determination of error.
 

Comparability error

What is clear is that different solutions, including no solution, produce different results. The notion of "true score" is dependent on the notion of some uni-dimensional trait that is obviously non-admissible when the additions involve not only components which have low correlations and do not claim to be about the same thing, but the different additions contain different components. (That is, different additions contain marks from different subjects) But the notion of difference in estimates requires no such theoretical underpinning. It is empirical data demonstrated by differences in empirical rankings or scores under different experimental conditions.

Estimates of comparability errors are easily computed. Given that various forms of inequity are inherent in all measures of both school based and external examinations; that the meaning of the final rank order is based on relative loadings; that all means of trying to create equal loadings involve the creation of arbitrary assumptions and the subsequent construction of additional inequities. Given these facts it is relatively simple to construct a number of different aggregates according to the various models available (including the original raw data), and thus determine the range of ratings (or scores) that these produce. These empirical differences are an estimate of the comparability error. Such a set of scores has the added advantage that it relates to estimates for each individual, and does not confuse such individual differences with group statistics (such as standard error of the estimate).

Note that this is not the assessment error. The comparability error is the additional error added through the procedures of summating or summarising scores, which are independent of other sources of error described elsewhere.
 

The ontological remainder

My description of comparability error here begs the question as to whether the whole process isn't a nonsense, because of the meaninglessness of the total score. In order to examine that notion briefly I will examine the construct, not of academic merit, which might be a name that we could give to the sum of marks on test or examination performance in various academic subjects, but rather the idea of athletic merit, a similar construct we might conceive in the field of more physico-social endeavour. Concerned at the physical flabbiness of our youth, the party in power in the Federal Government, as part of its election platform for 1998, promised to improve the nation's health by removing the flab.

Thus in the year 2000, two lists of year 12 students were produced by Education Departments in each State. One for academic merit, and one for athletic merit. Students are required to nominate three areas of physical prowess. To ensure some breadth they must include at least one area from athletics or swimming, and one from team sports.

Brad and Diana make their choices. Brad, who does not like running, and is not very strong, chose walking as his athletics choice, doubles bowls as his team game, and pistol shooting. Diana chose the hammer throw for athletics, basketball for a team sport, and golf for the third choice. Diana is not very fast or indeed very agile, but she is 1.8 metres tall and weighs 95 kg.

Brad and Diana both covered the curricula designed around their choices, and completed the various tests designed to measure their skills in the designated areas. After some statistical corrections, their separate scores were added to give a final mark. They both obtained the same score of 189 points which is about half a standard deviation above the mean for all year 12 students in Australia.

Independently of this (obviously), they were both offered scholarships at the Australian Institute of Sport; Brad because his pistol shooting scores place him in the world's best ten; Diana because last year she broke the Australian Women's open hammer throw record.

This story is important because it is about individual students and not about groups of students. All of the talk of equal loadings and fairness is in the "equal ends" definition of equity. It attempts to address inequities involving groups of students, but in no way addresses the inequities done to individual students. And just as attempts to address inequities between whole school cohorts invariably leads to other inequities in terms of sub-groups within the school, so any attempts to reduce "better or worse" questions to more or less questions, or any attempt to reduce multi-dimensional entities to uni-dimensional ones, must invariably discriminate against some students more than others, and utterly confuse the meaning of what the final ranking is really about.

The second aspect of the apocryphal story that I want to draw attention to is its obviousness. It is obvious that all of these physical activities are different from each other and that whilst comparisons of aspects within a single sport may sometimes be meaningful, between sports such comparisons are meaningless.

What is not so obvious perhaps is that the complexity and possibilities of difference within cognitive endeavours have much more span, and much more depth, than do those of a largely physical nature. For this field encompasses the whole universe of cultural experience and knowledge. And the ideologies of schooling, if not the practices, assure us that students will have the opportunity to tap this richness. Even so, at the end of the day it all gets reduced to a uni-dimensional list. And both the tragedy and the absurdity of this gets lost in its normality.


Return to Table of Contents