Part 1: Positioning

    Chapter 1: Positioning the study: content and methodology

    Chapter 2: Positioning the writer: experience

    Chapter 3: Positioning the writer: philosophy and value

 

Chapter 1: Positioning the study - content and methodology

Summary of the study

The project grew out of a general critique of assessment theory and practices, and in particular of the way in which the notion of error in measurement is obfuscated.
The fundamental research question that informed this study is:

    How is error in measurement of standards obscured in most practical events involving assessment of persons?

The study that subsequently developed
  • Clearly positions the writer in terms of the experience, philosophy and values that he brings to this study.
  • Develops some tools of analysis of the educational assessment process that enables a more stringent critique of the nature and extent of error in the measurement of standards.
  • Establishes the centrality of the notion of the educational standard to the categorisation, production and control of the individual in society.
  • Shows how the professional literature on educational measurement is based on the notion of error, and at the same time trivialises that notion.
  • Re-examines some of the fundamental assumptions of educational assessment generally and psychometrics in particular. Indicates some of their most blatant self-contradictions and fudges.
  • Reconceptualises the notion of invalidity, and positions the field of educational categorisation here, from the perspective of the examined, rather than with validity, which is an advocacy for the examiner.
  • Applies some of this analysis to a study of competency standards in general, and in particular University grades, and national literacy testing as developed in the Australian context during the 1990s.
As can be seen, the initial research question has generated action as well as understanding, a tool to repair the damage resulting from the critique, and a way to reduce some of the violence it implies.

Relevant Literature

The relevant literature is extensive as well as intensive, as the Bibliography shows. The extensiveness was necessary, as many of the misconceptions and fudges and contradictions that characterise the field of educational assessment have been caused by a myopia regarding knowledge outside the arbitrary boundaries within which the field encloses itself.
Within the field of educational measurement the critical studies which most overlap mine are: in the United Kingdom, Hartog & Rhodes (1936), Cox (1965); in the United States, Hoffman (1964), Nairn (1980), Airasian (1979), and Glass (1978); in Australia, Rechter & Wilson (1968).
The Hartog & Rhodes study clearly showed the enormous instability of the measurement of standards in Public Examinations in England. The sneakiness of some of the research techniques in no way detracts from the dramatic incisiveness of the data. Cox did a similar job and ended up with a similar horror story on measurements of University grades. Hoffman directed his critical attention to the detail of multiple choice testing. Nairn's critique of the work of Educational Testing Service, and in particular the part it plays in College Entrance, is devastating in its implications. Airasian's book is a comprehensive critique of competency testing. Glass attacks the measurement of standards at its most vulnerable point; there are no standards, or at least none that psychometrics can produce. And Rechter & Wilson's study indicates the confusion about how to reduce error that accompanies public examining in Australia.
On the other hand, most of the literature on reliability and validity is pertinent to this study, because, when its discourse is repositioned from examiner to examined, it provides more than enough invalidity information to self destruct.
Most studies of error in the measurement of standards are however much more specific in their focus than is mine. Their minimal effect on practice has perhaps partially been due to the fact that their critiques were in terms of their own discipline of educational measurement; a discipline that owes its very existence to the claim to accurate judgments. In terms of general style and scope this study is perhaps closer to the work of Persig (1975; 1991), who delved, articulately if deviously, much more deeply into the notion of quality.
Within the field of power relations and the construction of the individual the studies most similar are those published in Foucault and Education (Ball,1990), in particular those that take off from Foucault's placement of the examination as a central apparatus of power/knowledge.
This study is significant in that it brings these two diverse fields of educational assessment, and the power relations that pervade education, into much closer contact, to expose their interrelations, and allow the critique to cross fertilise.

Importance of the study

The initial question addressed is how the whole matter of error in measurement of standards is obscured in most practical events involving assessment and measurement.
This is directly related to the centrality of the notion of the educational standard to the categorisation, production and control of the individual in society. For if the notion of the standard is crucial to the maintenance of power relations, and its empirical realisation is prone to enormous error, then the whole apparatus of power/knowledge that depends on it is in jeopardy.
I argue in Chapters 4 and 5 that the examination normalises and individualises, and is impotent without the notion of the measured standard, the sword that divides, the wedge that produces the gaps; and how important it is that these measures of standards be seen as accurate if current societal structures are to be maintained.
One view of immorality is that it is behaviour that destabilises a social system. So if playing the game is inevitable, is questioning the rules not so much dangerous as despicable, immoral to the point of being unthinkable? Is this the reason for the great silence about the enormous errors in any measure of standards? Does this account for the erasure from public consciousness and discourse of the obvious fact that educational standards as a thin accurate line have no empirical existence, and attempts to measure in relation to that line no instrumental reality?
In Chapters 6 to 17 thirteen sources of invalidity that contribute to the error and confusion of all categorisations of individual persons are detailed and elucidated, indicating how this silence in professional and public consciousness might be filled with a deafening noise.
In Chapters 18 and 19 of this study I apply some of the analytic tools developed to the contemporary scene in Australia, and demonstrate how the noise may be turned into a coherent critique of practice. In 1997 competency standards, as a form of assessment, have become, and are becoming, the major credentialing instrument for both educational and vocational courses and jobs. In addition, they are now the basis for job descriptions. In defining what training is required for a job, what prerequisites are required to attempt a job, what the job is, and how performance on the job is to be assessed, the cycle of fantasy created by this controlled semantic reductionism is complete; the material world of education and employment has become textualised in terms of competencies (Collins, 1993; Cairns, 1992). The fragility of this theorising is exposed when examined in terms of the reconstructed notion of invalidity developed in this study.
In Universities students are still categorised in terms of grades loosely defined. What do they mean? How error prone are they? And in the schools all Australian states have agreed to introduce tests of literacy. Certainly they will introduce tests. But what will they measure? And with what accuracy? Again the reconstructed notion of invalidity is used to critically evaluate such questions.

Methodology and the critique of practice

The study roves beyond the artificial constraints of psychometric theory and test practice; into ontology, epistemology and the metaphysics of quality; into the nature of instrumentation; into the relations between equity and assessment frames of reference; into the fundamental notion of comparability; into the detail of the relation between rank orders, standards and categorisations; and into the minefield of the psychometric fudge.
Is there method in this diverse madness? Where is the methodology that informs this wild profusion? The study aims to expose the madness that underlies much of the current method. So what is a methodology that undermines methodologies?
One such method is critical analysis, the analysis of the educational discourse that comprises the field of assessment. The polices and practices of educational assessment become fused in the discourse in which they are embedded (Ball, 1994).

Discourses are about what can be said, and thought, but also about who can speak, when, where and with what authority. Discourses embody the meaning and use of propositions and words. Thus, certain possibilities for thought are constructed . . . We do not speak a discourse, it speaks us. We are the subjectivities, the voices, the knowledge, the power relations that a discourse constructs and allows (p22).

Analysis of such discourses may not be used to determine the truth. Yet such analyses may be very sensitive to the uncovering of untruths, by determining the extent to which they embody "incoherencies, distortions, structured omissions and negations which in turn expose the inability of the language of ideology to produce coherent meaning" (Codd, 1988, p245).

How would such untruths be established?

  • First, by uncovering self contradictions, within the overt discourse, or between the unstated assumptions of the discourse and the facts that the discourse establishes.
  • Second, by exposing false claims, claims that may be shown with empirical evidence constructed within its own frame of reference to be untrue.
  • Third, by detailing some of the psychometric fudges on which many assessment claims depend to maintain their established meaning.
  • Fourth, by indicating how repositioning the discourse may dramatically change its truth value.
  • Fifth, by establishing four discrete epistemological frames of reference for assessment discourse as currently constructed, and indicating the confusion when one frame is viewed from the perspectives of the others.
  • Sixth, by noticing frame shifts within a particular discourse, with the resulting confusion of meaning.
  • Seventh, by exposing the ontological slides and epistemological camouflages necessary to sustain many truth claims.
So in this study I will substantiate the contention that some of the explicit and implicit "truths" embedded in assessment practices are falsifiable; that empirical data constructed from their own assumptions denies the accuracy they assume; that this data is not only adequately detailed in the literature, but further, that the notion of error is the epistemological basis of much of that literature. All of which makes the public silence about the presence of error even more puzzling.
I shall show that the epistemological and ontological grounds for the whole field of assessment of individual persons are enormously shaky. I shall also explain how the literature about the very notion of validity is founded on a biased position, so that the sources of invalidity are much deeper and wider than is admitted in practice, even though clearly implied in theory and its attendant discourse.
I shall indicate the complexity of the notion of invalidity, with its practical face of error. Error includes all those differences in rank ordering and placement in different assessments at different times by different experts; all the confusions and varieties of meaning attached to the "construct" being assessed; and all those variabilities arising out of logical type errors, issues of context, faulty labelling, and problems associated with prediction. To further complicate the matter error has a different meaning depending on the assessment frame of reference. And I will show that estimates of the extent of the confusion along many of these dimensions may be easily estimated.
This is a critical study. Foucault (1988) says:

There is always a little thought even in the most stupid institutions; there is always thought even in silent habits. Criticism is a matter of flashing out that thought and trying to change it: to show that things are not as self-evident as one believed, to see that what is accepted as self-evident will be no longer accepted as such. Practising criticism is a matter of making facile gestures difficult (p155).

Using Foucault's terminology, this is a critical study designed to make facile assessment gestures about standards difficult.

Methodology and inquiry systems

After a twenty three page discussion on data and analysis relevant to construct validation, which to Messick (1989) means all validation, he concludes

. . . test validation in essence is scientific inquiry into score meaning - nothing more, but also nothing less. All of the existing techniques of scientific inquiry, as well as those newly emerging, are fair game for developing convergent and discriminant arguments to buttress the construct interpretation of test scores (p56).

I would broaden this to refer to any categorisation produced by transforming a continuity into a dichotomy. And for now I want to leave aside the obvious bias in the word "buttress," and focus here on inquiry systems themselves. For Messick (1989), conservative as he is, accepts that
    because observations and meanings are differentially theory-laden and theories are differentially value-laden, appeals to multiple perspectives on meaning and values are needed to illuminate latent assumptions and action implications in the measurement of constructs (p32).
Churchman (1971), elucidates five such scientific inquiry systems of differential values and epistemology, roughly related to philosophies espoused by Liebniz, Lock, Kant, Hegel and Singer. Mitroff (1973) has developed and summarised Churchman's systems. Very briefly, the Liebnizian inquiry mode begins with undefined ideas and rules of operation, ending with models that count as explanations. The Lockean mode begins with undefined experiential elements, and uses consensual agreement to establish facts. The Kantian system shows the interdependence of the Liebnizian and Lockean modes, and uses somewhat complementary Liebnizian models to interrogate the same Lockian data bank, to ultimately arrive at the best model. The Hegelian mode uses antithetical models to explain the same data, leaving it for the decision maker to create the most appropriate synthesis for a particular purpose. In this mode values of enquirer and decision maker become exposed. Finally, the inquiry system of Singer (1959), is one of multiple epistemological observation, where each inquiring system is observed from the assumptions of the others, and each methodology is processed by those of the others. Churchman (1971) paraphrases Singer clearly and cleanly: "the reality of an observing mind depends on it being observed, just as the reality of any aspect of the world depends upon observation" (p146).
How do these inquiry systems link to the seven ways of demonstrating untruths, or nonsense, detailed in the previous section? It is the Singerian inquiry mode that best characterises this study as a whole. Although particular modes have been utilised for particular critical purposes, this is in itself justified by the Singerian inquiry mode.
So whilst the first three methods listed are clearly in the Liebnizian and Lockean modes, the other four involve the explication of shifting sets of assumptions, and belong to the Singerian mode. In particular the examination of compatibilities between the four frames of reference for assessment on the one hand, and equity definitions, power relations, instrumentation requirements, and notions of comparabiltiy and quality on the other, demonstrate clearly that to the Singerian enquirer, "information is no longer merely scientific or technical, but also ethical as well" (Mitroff, 1973, p125).
The "conversation pieces" and "stories" used to demonstrate the absurdity of some assessment claims belong to the Hegelian mode. Churchman (1971) explains:

The Hegelian inquirer is a storyteller, and Hegel's thesis is that the best inquiry is the inquiry that produces stories. The underlying life of a story is its drama, not its "accuracy". Drama has the logical characteristics of a flow of events in which each subsequent event partially contradicts what went before; there is nothing duller than a thoroughly consistent story. Drama is the interplay of the tragic and the comic; its blood is conviction, and its blood pressure is antagonism. It prohibits sterile classification. It is above all implicit; it uses the explicit only to emphasise the implicit (p 178).

Strategy of deterrence

The general strategy used to make the case for the invalidity of most current assessment practice is borrowed from military policies of nuclear deterrence. It is a strategy of overkill. Of the thirteen sources of invalidity developed in this study, any one would, if fully applied to current assessment practices, take them out, neutralise them, render them inoperable. To nullify this attack on validity of tests, examinations and categorisations generally, it is necessary to destroy not one missile, but all of them.

Methodology and structure of the study

The study has been presented in seven parts: Positioning, Context, Tools of Analysis, Error Analysed, Synthesis, Application, and a Concluding Statement.
Part 1 - Positioning : All descriptions of events, all writing, is positioned; makes certain assumptions, is viewed from a particular perspective. Part one positions the study in terms of focus and method, and the writer in terms of experience and philosophy.
In this opening chapter I position the work in terms of its general content and methodology, and show how it all fits together. So Chapter 1 briefly summarises what the study is about, what literature is most similar in both content and style, what is the importance of the study and its possible impact, and in this section how it is structured.
In Chapter 2 I show how the study is positioned in terms of some of the learnings accrued from the professional and life experiences of the author.
In Chapter 3 I indicate how the study is positioned in terms of philosophy and value, and how that relates to some contemporary literature.
Part 2 - Context: Assessment involves events that occur in, and are given meanings in, a social context. In Part 2 I elucidate some aspects of that context.
In Chapter 4 I focus on the way power relations both violate and produce those who act out their lives within their influence. In particular the centrality of the examination is exposed in the production of the modern individual, defined as an object positioned, classified and articulated along a limited set of linear dimensions.
In Chapter 5 the argument in Chapter 4 is applied and developed in terms of educational assessment. In particular I examine the crucial part that the standard plays in the whole mechanism of defining cut-offs for abnormality and non-acceptance, and how important it is that these standards be seen as accurate if current societal structures are to be maintained.
In Chapter 6 I focus on the cultural meanings that attach themselves to the notion of the standard, and assign the idea of the human standard to the mythological sphere, a place apart from critical thought. I examine the emotional intensity of discourse about the standard, its significance as an article of faith, and how this is related to the maintenance of control and good order.
Part 3 - Tools of analysis: In Part 3 some tools for looking at specific assessment events are developed. In Chapters 7 to 12 I examine four different epistemological frames of reference for assessment, and relate these to notions of equity, to hierarchical structures, instrumentation, comparability, rank orders and standards, logical types, and quality. These chapters introduce some independent, fundamental, and rarely discussed aspects of underlying assumptions involved in events culminating in the assessment of students. Inadequacies in any one of these aspects would, in a rational world, be enough to destroy the credibility of most student assessments. I will contend that all practical assessments of people contain major inadequacies in most of them.
In Chapter 7 four different frames of reference are defined; four different and largely incompatible sets of assumptions that underlie educational assessment processes as currently practised: First is the Judges frame, recognised by its assumption of absolute truth, its hierarchical incorporation of infallibility; second is the General frame, embedded in the notion of error, and dedicated to the pursuit of the true score; third is the Specific frame, which assumes that all educational outcomes can be described in terms of specific overt behaviours with identifiable conditions of adequacy; fourth is the Responsive frame, in which the essential subjectivity of all assessment processes is recognised, as is their relatedness to context.
Because of their contradictory assumptions, slides between frames result in confusion and compound invalidity.
Chapter 8 shows how certain assessment frames are inherently contradictory to certain definitions of equity, themselves contradictory to each other and to the power structures in which they are enmeshed. As such, those assessment frames and notions of equity that contradict the enveloping hierarchical structure will be seen, accurately and probably unconsciously, as potentially destabilising, and will consequently be ignored, nullified, or corrupted into acceptability.
Chapter 9 looks at Instrumentation. In this chapter we look at the conditions and invariances required in events involving measuring instruments if such events are to have credibility; in particular the notion of a Standard that theoretically defines the scale, and its confusion with a standard of acceptability, which is to be measured by the instrument, and which requires a scale in order to be located.
The various assessment modes are analysed in terms of their instrumental error. On these grounds alone all are found to be invalid.
Chapter 10 takes up the issue of comparability. What can be compared? Fundamental distinctions between more and less, better and worse are examined , their relations with uni and multi dimensionality shown, and the implications for rank ordering of students in tests and examinations unearthed. This leads to further examination of the differential privileging of sub groups and individuals when marks are added. The essential meaninglessness of such additions becomes apparent.
In Chapter 11 the relationship between rank order and standard is teased out in more detail: In particular the meanings given to the standard in the Judge and General frames of reference; how logical confusions proliferate when discourse jumps from one frame to the other; and how all categorisations involve standards and rank ordering, even though many advocates of "qualitative" assessment methods may want to deny this.
Chapter 12 leads from the implications of the Theory of Logical Types for assessment practices to an examination of the distinction between standard and quality. When the standard is seen, realistically, as unable to perform its function, quality is the notion with sufficient mythical, ideological, and intellectual status to replace it. This would produce a very different learning milieu.
Part 4 - Error analysed: In Part 4 the tools developed in Part 3 are used to discriminate particular sources of confusion and error within assessment events designed to categorise students.
In Chapter 13 the meaning of error in each frame of reference for interpreting assessments is considered. As the meaning of error changes with assessment mode, so do the methods designed to reduce such error. Procedures to reduce error in one frame are seen to increase it in another. From a perspective of oversight of the whole assessment field, this is another source of confusion and invalidity, particularly as it is rare for any practical assessment event to remain consistently within one frame of reference.
Chapter 14 addresses the question: What does a test measure? In terms of social consequences the answer is clear. It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named. The person who does the test has already accepted the name of the test and the measure that the test makes by the very act of doing the test. So the mark becomes part of that person's story and with sufficient repetitions becomes true.
My own conclusion is that tests have so many independent sources of invalidity that they do not measure anything in particular, nor do they place people in any particular order of anything. But they do place them in an order, along a single line of "merit," and that is all they are required to do.
Chapter 15 shows some of the ways in which psychometricians fudge; by reducing criteria to those that can be tested; by prejudging validity by prior labelling; by appropriating definitions to statistical models; and by hiding error in individual marks and grades by displaced statistical data, and implying that estimates are true scores. A number of specific examples of fudging are detailed.
In Chapter 16 some of the more recent work on validity is discussed, and its positioning as advocacy demonstrated. I conclude that in practice the very existence of validity is established, validity is indeed made manifest, through the denseness of the arguments about invalidity criteria used to refute such existence, together with the reassurance that the battle continues, and some gains have been made.
Reliability is also discussed as a problematic, rather than as an obvious prerequisite to validity. I conclude that most of the mechanisms designed to increase reliability necessarily decrease validity.
Part 5 - Synthesis: In Chapter 17 the notion of invalidity is reconceptualised, having both discursive and measurable components. Thirteen (overlapping) sources of error are examined, all contributing to the essential invalidity of categorisations of persons.
Part 6 - Application: In Chapter 18 I apply the philosophical and conceptual positioning, tools of analysis, and the reconceptualised sources of error developed in this thesis to the competency based assessment policies and practices of Australia in the 1990s. I show how the notion of competency standards is overtly central to the whole competency movement, the introduction of which is shown to be overtly politically motivated. Thus the crucial links between political power and educational standards that are argued for in Chapters 3 and 4 become transparent. I then go on to examine the invalidity of competency standards in the light of the thirteen sources of error specified in the previous chapter.
Chapter 19 presents two specific applications of invalidity sources; the first relates to national literacy testing, and the second to University grades.

Impact

Assessment practice is permeated with mythology and ideology; with confusions and contradictions; with epistemological and ontological slides; with misrepresentations of frames of reference for different assessment modes; with logical type errors and psychometric fudging, in which the constructs that determine error--labelling, construction, stability, generality, prediction--are either ignored or severely constrained in the determination and communication of error, in those rare cases where personal error and likely miscategorisation is publicly admitted.
I have no expectations for this study, but some hopes. A whistle blowing study is like a joke--its impact is a function of timing. And the best timing can only be determined in retrospect. My hope is that it will lead to a reduction of the violence that is attributable to the suppression of error in the categorisation of people.

Return to Table of Contents