|
Part 1: Positioning
Chapter 1: Positioning
the study: content and methodology
Chapter 2: Positioning
the writer: experience
Chapter 3: Positioning
the writer: philosophy and value
Chapter
1: Positioning the study - content and methodology
Summary of the study
The project grew out of a general critique of assessment theory and
practices, and in particular of the way in which the notion of error in
measurement is obfuscated.
The fundamental research question that informed this study is:
How is error in measurement of standards obscured in most practical
events involving assessment of persons?
The study that subsequently developed
- Clearly positions the writer in terms of the experience, philosophy
and values that he brings to this study.
- Develops some tools of analysis of the educational assessment process
that enables a more stringent critique of the nature and extent of error
in the measurement of standards.
- Establishes the centrality of the notion of the educational standard
to the categorisation, production and control of the individual in society.
- Shows how the professional literature on educational measurement is
based on the notion of error, and at the same time trivialises that notion.
- Re-examines some of the fundamental assumptions of educational assessment
generally and psychometrics in particular. Indicates some of their most
blatant self-contradictions and fudges.
- Reconceptualises the notion of invalidity, and positions the field
of educational categorisation here, from the perspective of the examined,
rather than with validity, which is an advocacy for the examiner.
- Applies some of this analysis to a study of competency standards in
general, and in particular University grades, and national literacy testing
as developed in the Australian context during the 1990s.
As can be seen, the initial research question has generated action
as well as understanding, a tool to repair the damage resulting from the
critique, and a way to reduce some of the violence it implies.
Relevant Literature
The relevant literature is extensive as well as intensive, as the Bibliography
shows. The extensiveness was necessary, as many of the misconceptions and
fudges and contradictions that characterise the field of educational assessment
have been caused by a myopia regarding knowledge outside the arbitrary
boundaries within which the field encloses itself.
Within the field of educational measurement the critical studies which
most overlap mine are: in the United Kingdom, Hartog & Rhodes (1936),
Cox (1965); in the United States, Hoffman (1964), Nairn (1980), Airasian
(1979), and Glass (1978); in Australia, Rechter & Wilson (1968).
The Hartog & Rhodes study clearly showed the enormous instability
of the measurement of standards in Public Examinations in England. The
sneakiness of some of the research techniques in no way detracts from the
dramatic incisiveness of the data. Cox did a similar job and ended up with
a similar horror story on measurements of University grades. Hoffman directed
his critical attention to the detail of multiple choice testing. Nairn's
critique of the work of Educational Testing Service, and in particular
the part it plays in College Entrance, is devastating in its implications.
Airasian's book is a comprehensive critique of competency testing. Glass
attacks the measurement of standards at its most vulnerable point; there
are no standards, or at least none that psychometrics can produce. And
Rechter & Wilson's study indicates the confusion about how to reduce
error that accompanies public examining in Australia.
On the other hand, most of the literature on reliability and validity
is pertinent to this study, because, when its discourse is repositioned
from examiner to examined, it provides more than enough invalidity information
to self destruct.
Most studies of error in the measurement of standards are however much
more specific in their focus than is mine. Their minimal effect on practice
has perhaps partially been due to the fact that their critiques were in
terms of their own discipline of educational measurement; a discipline
that owes its very existence to the claim to accurate judgments. In terms
of general style and scope this study is perhaps closer to the work of
Persig (1975; 1991), who delved, articulately if deviously, much more deeply
into the notion of quality.
Within the field of power relations and the construction of the individual
the studies most similar are those published in Foucault and Education
(Ball,1990), in particular those that take off from Foucault's placement
of the examination as a central apparatus of power/knowledge.
This study is significant in that it brings these two diverse fields
of educational assessment, and the power relations that pervade education,
into much closer contact, to expose their interrelations, and allow the
critique to cross fertilise.
Importance of the study
The initial question addressed is how the whole matter of error in
measurement of standards is obscured in most practical events involving
assessment and measurement.
This is directly related to the centrality of the notion of the educational
standard to the categorisation, production and control of the individual
in society. For if the notion of the standard is crucial to the maintenance
of power relations, and its empirical realisation is prone to enormous
error, then the whole apparatus of power/knowledge that depends on it is
in jeopardy.
I argue in Chapters 4 and 5 that the examination normalises and individualises,
and is impotent without the notion of the measured standard, the sword
that divides, the wedge that produces the gaps; and how important it is
that these measures of standards be seen as accurate if current societal
structures are to be maintained.
One view of immorality is that it is behaviour that destabilises a
social system. So if playing the game is inevitable, is questioning the
rules not so much dangerous as despicable, immoral to the point of being
unthinkable? Is this the reason for the great silence about the enormous
errors in any measure of standards? Does this account for the erasure from
public consciousness and discourse of the obvious fact that educational
standards as a thin accurate line have no empirical existence, and attempts
to measure in relation to that line no instrumental reality?
In Chapters 6 to 17 thirteen sources of invalidity that contribute
to the error and confusion of all categorisations of individual persons
are detailed and elucidated, indicating how this silence in professional
and public consciousness might be filled with a deafening noise.
In Chapters 18 and 19 of this study I apply some of the analytic tools
developed to the contemporary scene in Australia, and demonstrate how the
noise may be turned into a coherent critique of practice. In 1997 competency
standards, as a form of assessment, have become, and are becoming, the
major credentialing instrument for both educational and vocational courses
and jobs. In addition, they are now the basis for job descriptions. In
defining what training is required for a job, what prerequisites are required
to attempt a job, what the job is, and how performance on the job is to
be assessed, the cycle of fantasy created by this controlled semantic reductionism
is complete; the material world of education and employment has become
textualised in terms of competencies (Collins, 1993; Cairns, 1992). The
fragility of this theorising is exposed when examined in terms of the reconstructed
notion of invalidity developed in this study.
In Universities students are still categorised in terms of grades loosely
defined. What do they mean? How error prone are they? And in the schools
all Australian states have agreed to introduce tests of literacy. Certainly
they will introduce tests. But what will they measure? And with what accuracy?
Again the reconstructed notion of invalidity is used to critically evaluate
such questions.
Methodology and the critique of practice
The study roves beyond the artificial constraints of psychometric theory
and test practice; into ontology, epistemology and the metaphysics of quality;
into the nature of instrumentation; into the relations between equity and
assessment frames of reference; into the fundamental notion of comparability;
into the detail of the relation between rank orders, standards and categorisations;
and into the minefield of the psychometric fudge.
Is there method in this diverse madness? Where is the methodology that
informs this wild profusion? The study aims to expose the madness that
underlies much of the current method. So what is a methodology that undermines
methodologies?
One such method is critical analysis, the analysis of the educational
discourse that comprises the field of assessment. The polices and practices
of educational assessment become fused in the discourse in which they are
embedded (Ball, 1994).
Discourses are about what can be said, and thought, but also about who
can speak, when, where and with what authority. Discourses embody the meaning
and use of propositions and words. Thus, certain possibilities for thought
are constructed . . . We do not speak a discourse, it speaks us. We are
the subjectivities, the voices, the knowledge, the power relations that
a discourse constructs and allows (p22).
Analysis of such discourses may not be used to determine the truth.
Yet such analyses may be very sensitive to the uncovering of untruths,
by determining the extent to which they embody "incoherencies, distortions,
structured omissions and negations which in turn expose the inability of
the language of ideology to produce coherent meaning" (Codd, 1988,
p245).
How would such untruths be established?
- First, by uncovering self contradictions, within the overt discourse,
or between the unstated assumptions of the discourse and the facts that
the discourse establishes.
- Second, by exposing false claims, claims that may be shown with empirical
evidence constructed within its own frame of reference to be untrue.
- Third, by detailing some of the psychometric fudges on which many assessment
claims depend to maintain their established meaning.
- Fourth, by indicating how repositioning the discourse may dramatically
change its truth value.
- Fifth, by establishing four discrete epistemological frames of reference
for assessment discourse as currently constructed, and indicating the confusion
when one frame is viewed from the perspectives of the others.
- Sixth, by noticing frame shifts within a particular discourse, with
the resulting confusion of meaning.
- Seventh, by exposing the ontological slides and epistemological camouflages
necessary to sustain many truth claims.
So in this study I will substantiate the contention that some of the
explicit and implicit "truths" embedded in assessment practices
are falsifiable; that empirical data constructed from their own assumptions
denies the accuracy they assume; that this data is not only adequately
detailed in the literature, but further, that the notion of error is the
epistemological basis of much of that literature. All of which makes the
public silence about the presence of error even more puzzling.
I shall show that the epistemological and ontological grounds for the
whole field of assessment of individual persons are enormously shaky. I
shall also explain how the literature about the very notion of validity
is founded on a biased position, so that the sources of invalidity are
much deeper and wider than is admitted in practice, even though clearly
implied in theory and its attendant discourse.
I shall indicate the complexity of the notion of invalidity, with its
practical face of error. Error includes all those differences in rank ordering
and placement in different assessments at different times by different
experts; all the confusions and varieties of meaning attached to the "construct"
being assessed; and all those variabilities arising out of logical type
errors, issues of context, faulty labelling, and problems associated with
prediction. To further complicate the matter error has a different meaning
depending on the assessment frame of reference. And I will show that estimates
of the extent of the confusion along many of these dimensions may be easily
estimated.
This is a critical study. Foucault (1988) says:
There is always a little thought even in the most stupid institutions;
there is always thought even in silent habits. Criticism is a matter of
flashing out that thought and trying to change it: to show that things
are not as self-evident as one believed, to see that what is accepted as
self-evident will be no longer accepted as such. Practising criticism is
a matter of making facile gestures difficult (p155).
Using Foucault's terminology, this is a critical study designed to
make facile assessment gestures about standards difficult.
Methodology and inquiry systems
After a twenty three page discussion on data and analysis relevant
to construct validation, which to Messick (1989) means all validation,
he concludes
. . . test validation in essence is scientific inquiry into score meaning
- nothing more, but also nothing less. All of the existing techniques of
scientific inquiry, as well as those newly emerging, are fair game for
developing convergent and discriminant arguments to buttress the construct
interpretation of test scores (p56).
I would broaden this to refer to any categorisation produced by transforming
a continuity into a dichotomy. And for now I want to leave aside the obvious
bias in the word "buttress," and focus here on inquiry systems
themselves. For Messick (1989), conservative as he is, accepts that
- because observations and meanings are differentially theory-laden and
theories are differentially value-laden, appeals to multiple perspectives
on meaning and values are needed to illuminate latent assumptions and action
implications in the measurement of constructs (p32).
Churchman (1971), elucidates five such scientific inquiry systems of
differential values and epistemology, roughly related to philosophies espoused
by Liebniz, Lock, Kant, Hegel and Singer. Mitroff (1973) has developed
and summarised Churchman's systems. Very briefly, the Liebnizian inquiry
mode begins with undefined ideas and rules of operation, ending with models
that count as explanations. The Lockean mode begins with undefined experiential
elements, and uses consensual agreement to establish facts. The Kantian
system shows the interdependence of the Liebnizian and Lockean modes, and
uses somewhat complementary Liebnizian models to interrogate the same Lockian
data bank, to ultimately arrive at the best model. The Hegelian mode uses
antithetical models to explain the same data, leaving it for the decision
maker to create the most appropriate synthesis for a particular purpose.
In this mode values of enquirer and decision maker become exposed. Finally,
the inquiry system of Singer (1959), is one of multiple epistemological
observation, where each inquiring system is observed from the assumptions
of the others, and each methodology is processed by those of the others.
Churchman (1971) paraphrases Singer clearly and cleanly: "the reality
of an observing mind depends on it being observed, just as the reality
of any aspect of the world depends upon observation" (p146).
How do these inquiry systems link to the seven ways of demonstrating
untruths, or nonsense, detailed in the previous section? It is the Singerian
inquiry mode that best characterises this study as a whole. Although particular
modes have been utilised for particular critical purposes, this is in itself
justified by the Singerian inquiry mode.
So whilst the first three methods listed are clearly in the Liebnizian
and Lockean modes, the other four involve the explication of shifting sets
of assumptions, and belong to the Singerian mode. In particular the examination
of compatibilities between the four frames of reference for assessment
on the one hand, and equity definitions, power relations, instrumentation
requirements, and notions of comparabiltiy and quality on the other, demonstrate
clearly that to the Singerian enquirer, "information is no longer
merely scientific or technical, but also ethical as well" (Mitroff,
1973, p125).
The "conversation pieces" and "stories" used to
demonstrate the absurdity of some assessment claims belong to the Hegelian
mode. Churchman (1971) explains:
The Hegelian inquirer is a storyteller, and Hegel's thesis is that the
best inquiry is the inquiry that produces stories. The underlying life
of a story is its drama, not its "accuracy". Drama has the logical
characteristics of a flow of events in which each subsequent event partially
contradicts what went before; there is nothing duller than a thoroughly
consistent story. Drama is the interplay of the tragic and the comic; its
blood is conviction, and its blood pressure is antagonism. It prohibits
sterile classification. It is above all implicit; it uses the explicit
only to emphasise the implicit (p 178).
Strategy of deterrence
The general strategy used to make the case for the invalidity of most
current assessment practice is borrowed from military policies of nuclear
deterrence. It is a strategy of overkill. Of the thirteen sources of invalidity
developed in this study, any one would, if fully applied to current assessment
practices, take them out, neutralise them, render them inoperable. To nullify
this attack on validity of tests, examinations and categorisations generally,
it is necessary to destroy not one missile, but all of them.
Methodology and structure of the study
The study has been presented in seven parts: Positioning, Context,
Tools of Analysis, Error Analysed, Synthesis, Application, and a Concluding
Statement.
Part 1 - Positioning : All descriptions of events, all writing, is
positioned; makes certain assumptions, is viewed from a particular perspective.
Part one positions the study in terms of focus and method, and the writer
in terms of experience and philosophy.
In this opening chapter I position the work in terms of its general
content and methodology, and show how it all fits together. So Chapter
1 briefly summarises what the study is about, what literature is most similar
in both content and style, what is the importance of the study and its
possible impact, and in this section how it is structured.
In Chapter 2 I show how the study is positioned in terms of some of
the learnings accrued from the professional and life experiences of the
author.
In Chapter 3 I indicate how the study is positioned in terms of philosophy
and value, and how that relates to some contemporary literature.
Part 2 - Context: Assessment involves events that occur in, and are
given meanings in, a social context. In Part 2 I elucidate some aspects
of that context.
In Chapter 4 I focus on the way power relations both violate and produce
those who act out their lives within their influence. In particular the
centrality of the examination is exposed in the production of the modern
individual, defined as an object positioned, classified and articulated
along a limited set of linear dimensions.
In Chapter 5 the argument in Chapter 4 is applied and developed in
terms of educational assessment. In particular I examine the crucial part
that the standard plays in the whole mechanism of defining cut-offs for
abnormality and non-acceptance, and how important it is that these standards
be seen as accurate if current societal structures are to be maintained.
In Chapter 6 I focus on the cultural meanings that attach themselves
to the notion of the standard, and assign the idea of the human standard
to the mythological sphere, a place apart from critical thought. I examine
the emotional intensity of discourse about the standard, its significance
as an article of faith, and how this is related to the maintenance of control
and good order.
Part 3 - Tools of analysis: In Part 3 some tools for looking at specific
assessment events are developed. In Chapters 7 to 12 I examine four different
epistemological frames of reference for assessment, and relate these to
notions of equity, to hierarchical structures, instrumentation, comparability,
rank orders and standards, logical types, and quality. These chapters introduce
some independent, fundamental, and rarely discussed aspects of underlying
assumptions involved in events culminating in the assessment of students.
Inadequacies in any one of these aspects would, in a rational world, be
enough to destroy the credibility of most student assessments. I will contend
that all practical assessments of people contain major inadequacies in
most of them.
In Chapter 7 four different frames of reference are defined; four different
and largely incompatible sets of assumptions that underlie educational
assessment processes as currently practised: First is the Judges frame,
recognised by its assumption of absolute truth, its hierarchical incorporation
of infallibility; second is the General frame, embedded in the notion of
error, and dedicated to the pursuit of the true score; third is the Specific
frame, which assumes that all educational outcomes can be described in
terms of specific overt behaviours with identifiable conditions of adequacy;
fourth is the Responsive frame, in which the essential subjectivity of
all assessment processes is recognised, as is their relatedness to context.
Because of their contradictory assumptions, slides between frames result
in confusion and compound invalidity.
Chapter 8 shows how certain assessment frames are inherently contradictory
to certain definitions of equity, themselves contradictory to each other
and to the power structures in which they are enmeshed. As such, those
assessment frames and notions of equity that contradict the enveloping
hierarchical structure will be seen, accurately and probably unconsciously,
as potentially destabilising, and will consequently be ignored, nullified,
or corrupted into acceptability.
Chapter 9 looks at Instrumentation. In this chapter we look at the
conditions and invariances required in events involving measuring instruments
if such events are to have credibility; in particular the notion of a Standard
that theoretically defines the scale, and its confusion with a standard
of acceptability, which is to be measured by the instrument, and which
requires a scale in order to be located.
The various assessment modes are analysed in terms of their instrumental
error. On these grounds alone all are found to be invalid.
Chapter 10 takes up the issue of comparability. What can be compared?
Fundamental distinctions between more and less, better and worse are examined
, their relations with uni and multi dimensionality shown, and the implications
for rank ordering of students in tests and examinations unearthed. This
leads to further examination of the differential privileging of sub groups
and individuals when marks are added. The essential meaninglessness of
such additions becomes apparent.
In Chapter 11 the relationship between rank order and standard is teased
out in more detail: In particular the meanings given to the standard in
the Judge and General frames of reference; how logical confusions proliferate
when discourse jumps from one frame to the other; and how all categorisations
involve standards and rank ordering, even though many advocates of "qualitative"
assessment methods may want to deny this.
Chapter 12 leads from the implications of the Theory of Logical Types
for assessment practices to an examination of the distinction between standard
and quality. When the standard is seen, realistically, as unable to perform
its function, quality is the notion with sufficient mythical, ideological,
and intellectual status to replace it. This would produce a very different
learning milieu.
Part 4 - Error analysed: In Part 4 the tools developed in Part 3 are
used to discriminate particular sources of confusion and error within assessment
events designed to categorise students.
In Chapter 13 the meaning of error in each frame of reference for interpreting
assessments is considered. As the meaning of error changes with assessment
mode, so do the methods designed to reduce such error. Procedures to reduce
error in one frame are seen to increase it in another. From a perspective
of oversight of the whole assessment field, this is another source of confusion
and invalidity, particularly as it is rare for any practical assessment
event to remain consistently within one frame of reference.
Chapter 14 addresses the question: What does a test measure? In terms
of social consequences the answer is clear. It measures what the person
with the power to pay for the test says it measures. And the person who
sets the test will name the test what the person who pays for the test
wants the test to be named. The person who does the test has already accepted
the name of the test and the measure that the test makes by the very act
of doing the test. So the mark becomes part of that person's story and
with sufficient repetitions becomes true.
My own conclusion is that tests have so many independent sources of
invalidity that they do not measure anything in particular, nor do they
place people in any particular order of anything. But they do place them
in an order, along a single line of "merit," and that is all they are required
to do.
Chapter 15 shows some of the ways in which psychometricians fudge;
by reducing criteria to those that can be tested; by prejudging validity
by prior labelling; by appropriating definitions to statistical models;
and by hiding error in individual marks and grades by displaced statistical
data, and implying that estimates are true scores. A number of specific
examples of fudging are detailed.
In Chapter 16 some of the more recent work on validity is discussed,
and its positioning as advocacy demonstrated. I conclude that in practice
the very existence of validity is established, validity is indeed made
manifest, through the denseness of the arguments about invalidity criteria
used to refute such existence, together with the reassurance that the battle
continues, and some gains have been made.
Reliability is also discussed as a problematic, rather than as an obvious
prerequisite to validity. I conclude that most of the mechanisms designed
to increase reliability necessarily decrease validity.
Part 5 - Synthesis: In Chapter 17 the notion of invalidity is reconceptualised,
having both discursive and measurable components. Thirteen (overlapping)
sources of error are examined, all contributing to the essential invalidity
of categorisations of persons.
Part 6 - Application: In Chapter 18 I apply the philosophical and conceptual
positioning, tools of analysis, and the reconceptualised sources of error
developed in this thesis to the competency based assessment policies and
practices of Australia in the 1990s. I show how the notion of competency
standards is overtly central to the whole competency movement, the introduction
of which is shown to be overtly politically motivated. Thus the crucial
links between political power and educational standards that are argued
for in Chapters 3 and 4 become transparent. I then go on to examine the
invalidity of competency standards in the light of the thirteen sources
of error specified in the previous chapter.
Chapter 19 presents two specific applications of invalidity sources;
the first relates to national literacy testing, and the second to University
grades.
Impact
Assessment practice is permeated with mythology and ideology; with
confusions and contradictions; with epistemological and ontological slides;
with misrepresentations of frames of reference for different assessment
modes; with logical type errors and psychometric fudging, in which the
constructs that determine error--labelling, construction, stability, generality,
prediction--are either ignored or severely constrained in the determination
and communication of error, in those rare cases where personal error and
likely miscategorisation is publicly admitted.
I have no expectations for this study, but some hopes. A whistle blowing
study is like a joke--its impact is a function of timing. And the best
timing can only be determined in retrospect. My hope is that it will lead
to a reduction of the violence that is attributable to the suppression
of error in the categorisation of people.
Return to Table of Contents
|