Chapter 9: Instrumentation
 

Introduction

Assessments in the Responsive mode do not necessarily involve standards or measures. In this frame, assessors may be content to describe without measuring, to give feedback without judgment, to respond with blatant subjectivity.

However, in the political and technocratic world in which evaluation thrives, such 'soft' assessments are scorned, and the claim to measure, to rank, and to compare to a standard is what gives status and power to the evaluation process. Sydenham (1979) points out that even in the physical sciences

a great deal of modern instrumentation is used to control, rather than gain, new knowledge in the scientific sense. . . it would seem that man seeks to extend the body of knowledge to make eventual use of it to subjugate his environment to suit man's needs (p. 30 - 34).

In the social world, it is people, regardless of any particular label, who are subjugated.

Measurements in physics

To measure any quantity or quality in the physical world we use an instrument, and the instrument must be calibrated. To measure length we need a ruler, and on the ruler is the scale. To measure time we need a clock, and on the clock face is the scale in seconds. To measure current we need an ammeter, calibrated in amperes. The electricity meter measures electrical energy consumed and is calibrated in kilowatt hours.

To calibrate the instrument there are three requirements. The first relates to scale, the second to replicability, and the third to theory-practice bridging.

Whilst scales do not have to be linear (they may be logarithmic or indeed of any other mathematical or ordered function), the nature of the scale does need to be known if any sensible interpretation of the scale is to be made. I will discuss only linear scales here, as they are the simplest and the most common, keeping in mind that the general argument would apply to any other scale for which a mathematical function applies with which to interpret differences.

For a linear scale equal gaps represent equal quantities of the thing being measured. The gap between 3m and 4m is exactly the same as the gap between 6m and 7m. The period of time represented between 9.1 sec and 9.2 sec on the stop watch is identical to the period represented between readings of 12.8 sec and 12.9 sec. The 5 kw hr of electrical energy represented by the difference in meter readings or 39.4 and 44.4, is identical to the 5kw hr of electrical energy represented when the meter reading goes from 44.4 to 49.4. As we pay for the electrical energy that we use, we would want to be sure that this equation was true. We would want to be sure that equal differences on the scale equated to equal differences in energy consumption. And when measures are added we would want to be sure that the laws of arithmetic applied.

We would also want to be assured that our meter gave the same reading as any other meter. It wouldn't need to look the same, or even be constructed the same, but we would want to be certain that if other people used up the same amount of electrical energy that we did, their meters would also indicate that 5 kw hr had been used. So other meters and other occasions must give identical differences for the same energy consumption. Yesterday's 5 kw hr on one meter must be identical to tomorrow's 5 kW hr on another meter.

And finally, after being convinced that the scale was calibrated accurately and the results were replicable, we would want to be assured that the meter really was measuring electrical energy in the units described. We would not want to pay for 5 kW hr of electrical energy if we were only using three. If all the meters are over-reading we are all being equally ripped off, but we are still being ripped off.

To ensure this accuracy we would require comparison with some standard instrument, against which all others could be compared. Such a standard instrument would itself incorporate both the meaning and the value of the thing we are measuring. That is, the standard includes within its operation both the theory of its definition and the practice of its measurement. For example, a standard metre rule is both a practical measure of a metre, and incorporates the theory that equal distances along its length are of equal value. A standard Ammeter, designed to measure electrical current, incorporates within its operation both the numerical value of current marked on its scale, and, within its mechanism, the definition of the ampere as a particular force acting between two conductors a certain distance apart carrying electrical current. And our kilowatt-hour meter gives us a reading on the scale, and incorporates into its mechanism the definition of electrical power as the product of voltage, amperage, and time.

Strictly speaking, such instruments (as instruments), incorporate sub-standards rather than Standards; that is, because they are instruments, they necessarily incorporate an error, which in the cases cited is very small. Because the Standard, which is some fixed point on the scale, is by definition error free, it follows that the Standard must be defined in terms of some mathematical theory (or some replicable event that is more accurately measured than the instrument). That is, with theory or events which have been empirically shown to have specific linkages with other measurable aspects of the physical world.

The standard and the measure

At this point it seems important to clarify the fundamental difference between any standard, and the measurement of that standard, for it is in the failure to appreciate this fundamental distinction that much of the confusion (and manipulation and mis-information) about the measurement of human 'ability' and 'standards' is rooted.

The standard is arbitrary, and is completely accurate. It is not arbitrary in the sense that it is capricious or random. It is arbitrary in the sense that it is based on opinion, and is merely one of a very large number of standards that could have been chosen. However, once the standard is defined as the standard, then it is that exact value. The value of the standard measure is completely accurate not because it has been measured completely accurately; the value of the standard measure is completely accurate because it is a definition, and not a measurement (Sydenham, 1979, p26).

If now we wish to measure a particular thing, we may ask whether it is above or below the standard measure, and by how much. In order to do that we must measure it with an instrument of some kind, or make calculations that involve such measurements. And such measurements will always contain some error, for such is the nature of measurement, because measurements are made along a continuum, unlike counting, which occurs in discrete leaps. We may count the number of bricks, and may do this without error. But no two bricks will be of exactly the same weight. One will have a few more grains of sand or clay than another. And even if two were of exactly the same weight, we could never know that, for the instrument with which we weigh them also contains errors in its scale, in the calibration of that scale, and in the reading of the value of the scale. Two bricks for which we obtained equal weights could indeed be of different weights if measured on another scale of equal accuracy. And two bricks for which we obtained different weights could indeed be the same (within the order of accuracy of that measuring instrument) if measured on a scale of greater accuracy.

One of the party tricks used by educators and others who wish to defend their indefensible measurements is to give examples that reduce measurements to counting. Surely 18 out of 20 correct spelling is 80 percent! Surely number facts in addition or multiplication are either right or wrong! And then they stop. For in the whole field of education they can't think of any other examples where measurement may be so reduced to a counting procedure. Not to mention the sidestepping of the question, eighty percent of what?

The case of the digital watch

Increasingly, instruments use digital electronic mechanisms which use counting methods to give their scale readings. However, these jump from one number to the next, just as watches with visual dials jump forward in one second or tenth of second leaps. Time, however, does not jump forward in such leaps, but is measured on a continuum, as are most of the other quantities that we measure. So the upper limit of accuracy of such an instrument is the gap represented by the jump. The lower limit is much greater.

The interference effect

It is a truism of science, often conveniently forgotten, that any measuring instrument distorts the field it is intended to measure. This is obvious when we think about it. For the measuring instrument to operate, it has to interact - that is, interfere - with the field it is measuring. Newton's Third Law is a universal principal: every action has an equal and opposite reaction; if the field acts on the measuring instrument, then the measuring instrument simultaneously acts on the field.

The effect may be relatively small - a thermometer inserted into a large container of hot water will not much affect the temperature of the water, though it will affect it. However, a very cold thermometer inserted into a very small cup of warm water may cause the temperature to drop appreciatively. The temperature thus measured is not that of the hot water, but that of the water-thermometer system.

In this particular case, it is possible to estimate the imprecision caused by the measuring instrument, if we know the masses and specific heats of water and container and mercury and glass, and the temperature of the surrounding air and the time taken for the thermometer to give its highest reading and the rate of heat loss from the container. Then we may estimate the temperature of the water at the moment the thermometer was inserted. However, even in this simple case, it is necessary to use a theory that is itself, of necessity, subject to some imprecision.

Sometimes the instrument is permanently incorporated into the system, and can then be defined as part of the field. Our electricity meter is a case in point. It is a permanent part of the electrical fixtures in the home. Nevertheless, it does use up energy in its very operation, thus increasing the energy needed for the house. It does distort the field. And as we might expect, it is the consumer, and not the electricity company, who pays for the distortion.

So how big is the interference effect when a 'test' is used to measure some human 'attainment' or 'ability'? How precise is the theory that links the measuring instrument to the thing it is supposedly measuring? And does the test introduce a small distortion into the field it is supposedly measuring, or is it of the same order of magnitude as the field? Are we putting a warm thermometer into the ocean, or into a little test tube of cold water?

Boundary conditions

Another fact of Science often conveniently forgotten is that the precision of the physical sciences - that is, their ability to obtain (almost) identical results in replicated experiments - is directly related to our ability to control the boundary conditions of the experiment: to prevent heat loss, to create a vacuum, to maintain a constant magnetic field, and so on. The precision of physics is specifically related to our ability to create a completely controlled (and hence artificial) environment in which to construct and conduct the experiment. The formulas of dynamics are very accurate in predicting the velocities of objects in free fall in a known gravity field in a vacuum. They are hopeless in predicting such velocities for a skydiver who jumps from a real aircraft in a real atmosphere. She will not reach the ground at the same time as a bunch of feathers or a lead ball thrown out at the same time, nor, luckily for her, at any time predicted by the formulas of simple dynamics. The point to note is that controlling the boundary conditions often produces an artificial environment which makes the data unusable in the 'uncontrolled' world.

This excursion into elementary physics is occasioned not only by nostalgia, but by a desire to clarify some of the relationships between instrument precision and measurement precision in that most precise of sciences, and to point out that whilst precision in Physics certainly cannot be greater than that of the measuring instrument, and any calculation based on that measurement is limited by the empirical accuracy of the attendant theory, that in most cases these two variables are not the main limitation on replicable accuracy. It is rather the stability of boundary conditions, the physical scientist's ability to artificially freeze all other significant variables, that allows such precision, predicability and control in these sciences.

And this is the precise problem we face when we try to measure people. For the boundary condition for stable human behaviour (and all measurement of people, all assessments, all tests, all examinations, must elicit or refer to some form of behaviour), is a stable human mind. But the individual human organism is not a computer. It does not produce a unique response to the same situation, if for no other reason that the 'same' situation never reoccurs. Perception and conception, and hence response, to 'identical' situations invariably differ, as the variables that affect such reactions - attention, mood, focus, metabolic rate, tiredness, visualisations, imagination, memory, habit, divergence, growth etc. - come into play.

As Kyberg (1984) describes it:

measurement makes sense only when the standards are reproducible, permanence over time being considered a form of reproducibility. Further more, the usefulness of measuring according to this scale depends on some form of reproducibility or permanence among the objects or processes being measured. (p190).
So the very concept of a 'true' measurement resides in the assumption of a stability and permanence in the characteristic being measured, and the boundary conditions of the measurement. Lack of these conditions does not represent so much an error of measurement, as a discrepancy with fundamental assumptions.

Where does the data come from?

Before dealing in more detail with the specific problems in measuring human ability, there is one more point to clarify. Where does the data come from? Where does it belong?
Data are not out there; they are events interpreted. What constitutes data and what constitutes garbage depends upon frame of reference, aim and method. Furthermore, data are not collected, they are constructed. Data require interpretation and represent the results of a construal, not simply a discovery (Eisner, 1990, p 183).
What Eisner is saying here is very important. The data, the measures, are not out there in the object being measured. They are measures that we have generated through a particular mechanism that includes the measuring instrument and the theory and some aspect or property of the thing being measured. Any claim to 'scientific' truth involves a further implication that a similar mechanism would produce similar data on another occasion with the same person. Or more accurately, with the person that person has now become.

So the temperature is not only some aspect of the object being measured; it is also and equally a meaning generated by a certain way of construing the world (the theory), and a certain way of interacting with it (the mechanism which includes certain actions with instrument and object). As Pawson (1989) expresses it, the only alternative is "to retain the notion of an observable realm that is independent of us yet knowable, . . . (and) to propose some automatic, pre-established harmony between subject, language and world"(p 61).

In like manner, if we are able to measure some aspect of a person called their ability, we are not measuring something they have. We are generating data that is also determined by the mechanism of the instrument - person interaction, as well as by a certain way we, the assessors, have of construing the world. In other words, we ask them to live in our little experimental world for a time, and make a measure in that world. To pin the label on them apart from that world is to misrepresent the experiment: The data, the label, belongs not to them, but to the whole theory-experiment-instrument-object interaction.

Measuring human ability

The rather detailed account of the properties that measuring instruments must have if they are to be usefully used in the study of the physical world enables us to look more adequately at the measurements being used in the study of human ability or human attainment. We might expect such instruments also to incorporate the three same necessary elements: a generally acceptable theory that enables the gap between theory and practical measurement to be bridged, in which a standard measure is defined; an instrument that is itself replicable in terms of the theory, and gives replicable results when measuring the same thing on different occasions; and a scale on which equal differences either represent equal 'ability' differences, or can be translated into some meaningful comparison by a known mathematical relationship. This last becomes particularly important if we wish to use it to make a categorisation, or be added to some other measure.

Standards and standards

Before examining how the Judge, General, Specific and Responsive frames for assessment stand up in relation to these three elements, I want to clarify the meaning of the word 'standard' in relation to human products. This 'standard' relates to a point on a scale, to a point below which the product is unacceptable. The standard thus indicates the lowest limit of acceptability. It requires a scale to define it.

This 'standard' is utterly different to the 'Standard' which is the basis of the scale, and hence of the measures made by the scale. This 'Standard' defines a difference between points on the scale, and can be used therefore to check the replicability of instruments. So we have a 'Standard' metre length, a 'Standard' second of time, a 'Standard' kilogram of mass. I have (arbitrarily) differentiated this Standard with a capital S. Such Standards are useless unless measuring instruments of great accuracy are available to sub-divide and expand the scale embedded in the Standard. However, the specification of any Standard does not guarantee the existence of a suitable measuring instrument (Sydenham, 1979, p 26).

The tendency we have to attribute guilt by association is well known. We are less wary of the tendency to attribute innocence by association. Our Standards of length and time are immensely accurate, as any Standard that defines a scale must be. Indeed, Standards of this sort are infinitely accurate because they are definitions and not measurements. The sub-Standards do involve measurement. And as the sub-Standards also provide bases for scales, the measurements they make must be very accurate and precise. We tend to associate similar accuracy of measurement to those quite different 'standards' that are used to describe minimum acceptability.

Most industry product 'standards' of minimum acceptability are based on criteria for which very accurate measurements can be made. That is, we can measure very accurately whether our product is minutely above or below the stated standard. And that tends to make us forget that the standard itself is not a measurement but is a definition, and is arbitrary. Any amount of a particular additive to food could be harmful to a particular person. All exposure to radiation, even background radiation, has an effect on living organisms. Any bridge will collapse under some particular conditions. Product standards are always statements about a compromise. They represent the arbitrary point at which safety, conservation, style, cost, expediency and whatever strike an uneasy, indeterminate, and hence arbitrary balance. At which point they assume a solidity and stability that denies and contradicts their genesis.

Any standard of acceptability is a political entity, as much in its production as in its enforcement. The myth of certainty that surrounds measures of people is achieved partly by its association with the Standard that defines accurate scales, and with the standard that is a definition of acceptability. As well as the standard we salute as the symbol of authority, as referred to in chapter 6.

Judge's frame

Whilst the Judge often uses a student's written work, in assignment or tests, as a basis for measurement, the Judge would not see the test as an instrument. Nor would he claim to make a measurement. What is written is merely a vehicle for showing him what the student is capable of. The Judge would claim to be able to use any such example as a basis for indicating the level that the student had attained. The Judge is not even particularly concerned to have a sample, random or otherwise. Any example, according to the Judge, can be judged according to its relation to the standard.

In scientific terms, the cognition of the Judge is the instrument, and incorporates the Standard, the scale, the theory-practice gap, the standard of acceptability, as well as the actual measurement, all within its own internal mechanism. Putting it more bluntly, the Judge simply does not operate on a scientific paradigm. Rather the Judge is a mystic who claims to 'know' the definition of standard, rather as one may 'know' the presence of God. A student's level of attainment may then also be 'known' and hence judged accurately, through the union of his/her own consciousness and that of the person being assessed, the example of the work judged being the medium through which this communion occurs, rather in the manner in which tea-leaves activate the astral consciousness of the psychic. Such a process is sometimes conceptualised and rationalised by considering the permutations of such value imponderables as style and form, understanding and creativity, texture and design, understanding of the field, or whatever. Many, if not most judges, would admit however that such variables were used to justify their intuitive judgments, rather than to logically develop their proofs.

From the point of view of the scientific paradigm, the work of the Judge is aesthetic rather than scientific. As such, it belongs logically to the Responsive frame with all the limits and advantages of the overt subjectivity of that frame. Creative reflections on their work by others can be of great value to a student's learning. However, when given in the form of absolute judgments rather than helpful feedback, such reflections are more likely to stifle learning than to expand it, more likely to inhibit creativity than encourage it, more directed to conformity than diversity.

What stops such classification into the Responsive frame is the refusal of the Judge to admit such idiosyncratic subjectivity, and to insist on the truth and objectivity of his judgments as measures of human performance or ability, by invoking the ideology of the absolute standard and the expert judge, and assuming, in both senses of that word, a state of mystical communion.

More recent post-modern conceptions of the Judge's frame use the notion of the interpretative community to defend the position of the Judge. Here quality is determined by a discourse embedded in the language of the field, and various criteria or aspects of quality may be so discussed. However, despite the acceptance within the community of the ephemeralness of the notions it produces, the end result is still the categorisation of the product and/or the student; a solid dichotomous categorisation that denies the tentativeness of its genesis, and, certainly outside that community, and I suspect also within it, is not regarded as a problematic (Fish, 1980).

General frame

The General frame pays considerable attention to problems of scale and replicability, and the theory-practice gap. Theoretically (though almost never in practice) it uses random sampling theory and practice, and assumptions about the distribution of attainment, to produce an instrument (a test), define a scale (normalised score), and estimate replicability (standard error or correlation). In terms of 'ability' measures various standards can also be defined in this model to comprise certain grade levels, in terms of percentiles of defined populations.

Now this is more or less what 'standardised' tests do. In my view they vastly underestimate the error, both in its theoretical definition, as well as in its representation (or more accurately its non-representation) to student and faculty. Some specific details of this are given in Chapter 15 on the psychometric fudge. Rarely do the instruments satisfy the requirements of theory (random selection of items), nor do the populations on the basis of which they are calibrated (random selection of the population). Even so, they do tend to satisfy some of general requirements for a measuring instrument, as required by the physical sciences, even though the errors in these instruments, if made explicit in public knowledge, would make them useless for the purpose for which they are designed.

There are, however, three more fundamental sticky points, points at which the whole exercise becomes very suspect, or unrealistic. The first is inbuilt, and concerns the assumption about normal distribution of performance (or indeed any other assumption that might replace it) built into the theory. There is absolutely no reason to believe that in any area of educational activity the end point should be represented by a normal distribution (which is the same shape as a random distribution) of attainment. In fact, the better the educational environment, the more likely we are to obtain a very skewed, lop-sided, distribution of attainment.

The second occurs when the scores, which are defined in terms of the distribution, are presumed to relate to some 'standard of competence' for an individual student. This latter represents an error in logical typing, but might be more truthfully described as a semantic confidence trick.

Perhaps the most blatant example of this is the distribution grades that are labelled A B C D F. These grades may be defined in terms of percentile distributions, so that A represents the top 5 percent of the rank order of students (or whatever other arbitrary percentage is chosen), B the next 20 percent, and so on. Logically then, F represents the last 5 or 10 percent or whatever. So why not E? Because F also stands for 'fail', a statement about competence and not distribution. And historically, as we know, A and B have connotations of excellence that C does not have, though there is nothing in the distribution that implies either that A is an excellent performance, nor that C is a mediocre performance. For example, if a group of professional sprinters throw the javelin and are then graded in terms of their rank order, we would not expect those obtaining an A to have reached the Olympic 'standard'. On the other hand the person who runs last in the Olympic 800 metres final is hardly a mediocre runner, or a failure.

For even if we except the notion of a 'normal' distribution, the sticky question still remains: a normal distribution of which group? All the people in the world? All the educated people? All the people still at school? All the fifty year olds? All the people at a particular grade level? In a school? In a city? In a country? Without this detailed information the 'standards' cannot be given a meaning. And even with them, they can be given no meaning other than that defined for them. That is, their meanings can only relate to distribution, and not to competence.

Even with such information about the nature of the sample population, there is, and can be, no formula, no equation of equivalence, between grades defined by distribution on a rank order, and some pre-specified level of attainment of an individual student (Airasian, 1979, p 42; Jaeger, 1980, p 64; Glass, 1978; Levin, 1978, p 314; Burton, 1978, p 263; etc.).

In addition, the differences in logical type in attempting to make linear measures of complex qualities generate paradox and confusion and hence strong emotion and unresolvable debate (See Chapter 12). This makes the topic utterly suitable for creative endeavour and satirical humour, but impossible for scientific measurement.

The third point is more fundamental, and may well make the other two points trivial. There is no Standard against which the scale can be calibrated. There is no theory that enables a definition of some point on the scale to be distinguished, against which the scale might be calibrated, along with other scales purporting to measure the same thing. The test scale floats freely in space, relating solely to its own assumptions with no Standard rope to bind it to the earth. What we have here is not a scientific instrument, but a very suspect ordinal scale pretending to derive from a scientific measurement.

Specific frame

In the 'pure' Specific frame, a person's 'ability' or 'performance' or 'attainment' is reduced to a finite number of specific behaviours, for each of which a 'standard' is clearly defined. Thus we are, in theory, able to specify exactly which 'objectives' have been achieved to the specified 'standard'. The notion of scale, Standard, and measuring instrument is (apparently) sidestepped by postulating a dichotomous variable, requiring not a scale, but rather an on-off switch, to categorise its measure. We shall come back to this in Chapter 11, where it is argued that all categorisations infer measurements.

However, in most areas of human endeavour such reductionism to specific behaviours results in trivialisation of the task. Further, specification of the 'standard', even in such a narrow and specific thing, is still very difficult in most cases, as the measurement instrument does not exist, and it is finally fallible human judgment which in practice must decide whether the standard has been achieved for each objective. Further, the basic assumption is erroneous; the variable being measured is continuous, not dichotomous, so the measurement error still exists, disguised though it might be. We are back again to the Responsive frame, requiring a subjective decision, which is covered up by pretending to be the Judge's frame, requiring an unambiguous omnipotent objective decision, which is in turn covered up by pretending to be an example of unambiguous standard in the Specific frame, derived from a definition of standard which pretends to be dichotomous and pretends to be nonarbitrary.

To further confuse the issue, what often now happens is that specific information about which particular objectives have been achieved is lost when measurement is reduced to counting, and the number of objectives achieved is the only information recorded. This creates the illusion of exactness and error-free information by disguising the fact that the exactness of the 'standards' of individual objectives is, in practice, illusory.

Responsive frame

In the Responsive frame the person's work, or inferences about the person's capacity or ability, are described but not measured. Further, these responses are ideally owned by the responder, and not projected onto the producer, or the producer's work. They may describe how the person's performance relates to certain criteria, how then the performance might be improved, and to what extent, in terms of such criteria, and in the opinion of the responded, success has been achieved.

The responded may also offer some opinion about whether the work of the person being assessed is of inferior or superior quality, or whether they are skilled enough to practice in a certain field of work. However, again, this does not purport to be a measurement of some clearly defined standard, but merely the informed view of a particular person who for some reason or another has views worthy of hearing. As Stake describes it:

People do not just disagree, they live in different realities. People live quietly and often proudly with their peculiar ways of seeing things. The evaluator errs in too noisily depicting the peculiarities as much as too quietly. . . . multiple views help legitimate resistance to bureaucratic standardization. (Stake, 1991, p 85).
But note how quickly Stake modifies the insight of his first sentence with the caution of the second. In whose interest is this emphasis on quietness? Why this concern to legitimate resistance rather then stridently call for reform? Who might hear strident voices, that quieter ones may not discern? And whose voices go unheard in the quest for quality, and the demand for categorisation?

And note also the very narrow gap between offering an opinion on whether the performance is adequate for some purpose, and categorising the student. We are here at the very edge of the Judge's frame of reference, a boundary crossed over as soon as the categorisation is made.

Summary

In this chapter we have looked at the invariances required in events involving measuring instruments if such events are to have credibility. In particular the notion of a Standard that theoretically defines the scale, and how that is not to be confused with a standard of acceptability, which is to be measured by the instrument, and which requires a scale in order to be located. We also noted the importance of the specification of boundary conditions and interference effects, and that the price of invariance and tight theory-practice links was artificiality.

The various assessment modes were then analysed in terms of their instrumental error. All were found to be invalid, on the grounds of not satisfying the conditions of adequate instrumentation.

 


Return to Table of Contents