Chapter 19: National tests and university grades

Synopsis

In this chapter I apply the reconceptualised notion of invalidity to national literacy testing, and to the definitions of grades within my own university.

These are presented as specific examples of the potency of the invalidity conceptualisation.
 

National Literacy Testing

    Context

In its edition of 15-16 March, 1997, the newspaper Weekend Australian announced on the front page under the heading "All pupils face tests of literacy" that:

The literacy and numeracy of every Year 3 and 5 student will be tested from next year under a historic agreement between the Commonwealth, States and Territories yesterday.

The Catholic and independent schools sectors have indicated they will support the national testing program, which will be linked to uniform education standards to measure the reading, writing, spelling and mathematical ability of students.

. . . The federal Minister for Schools, Vocational Education and Training, Dr Kemp, described the literacy strategy as a "historic agreement for the children of Australia" because it stresses that every child starting school from next year will be able to read, spell, and add up within four years.

The literacy test is to be based on that developed some years ago by the NSW Education Department, and it is this test to which the following critique is addressed, in terms of the thirteen sources of invalidity.

    Temporal errors

Temporal errors are indicated by the differences in assessment description when the assessment occurs at different times.

No estimates of temporal errors in the national literacy testing program exist. They would, of course be easy to obtain and would be small compared to some of the other sources mentioned here. Small, that is, for most students. But the same theory that predicts this also predicts that a small percentage of students (randomly placed and unfindable) would have large discrepancies. But even small discrepancies would destroy the notion of infallibility that seems to be necessary for such tests to be publicly acceptable. This is what test administrators call public confidence, and I have more accurately named a psychometric fudge.

    Contextual errors

Contextual errors include all those differences in performance and its assessment that occur when the context of the assessment event changes.

Literacy is a concept of great educational importance, of diffuse and contested and multi-dimensional meaning. It involves at the very least reading and writing. Yet reading what under what conditions? And writing what under what conditions? A test defines the what and defines the conditions: Tightly specifying the conditions improves the reliability; yet at the same time it obviously disguises and increases the lack of generality and hence increases the contextual invalidity.

Essentially, the context of test-taking is not the context in which literacy, in most of its forms, is demonstrated.

    Construction errors

Construction errors are indicated by all those differences in assessment description when the same construct is assessed independently by different people in different ways, whilst the broader context of the assessment is held constant.

It would be relatively simple to take samples of children and have teachers and researchers and the children themselves make independent assessments of various aspects of their literacy, and estimate construction errors by comparing the estimates with each other and with the result of the test. This writer has no doubt that such an experiment would presage the immediate cessation of such testing.

    Labelling errors

An assessment must be an indicator of something. It must have a name. Differences in the meaning of the name, both before and after the event, constitute confusion and hence error. Labelling errors are defined by all the differences given to the meaning of the assessment (what it actually measures) by all the participants in the assessment event(s), and by the users of the assessment information.

Literacy tests presume to measure literacy. But which particular aspects? What could any test score tell us about any of those aspects? What meaning is given to those aspects by any particular teacher? How does that meaning compare to that teacher's concept of literacy? And what action could be taken by any such teacher on the basis of those meanings to help any child more than that teacher is currently helping? The extent to which these questions produce diffuse and varied and contradictory answers gives an indication of labelling error. And the meaning of literacy includes such confusion. The problem is not solved by imposing a definition; this enables us to increase reliability, and reduce the apparent error in measurement. But it is a reductionist trick, a semantic scam. The concept of literacy is diffuse, so any attempt to measure to is, at best, extremely imprecise, and, at worst, meaningless and hence impossible.

    Attachment errors

Attachment errors are the ontological slides that occur when a description of a relational event is attached to one of the elements of that event; specifically, when a complex relational event involving the construction of a test, an interaction of the test with a person, and a judgment of an assessor, is described as a property of the assessed person, this is an error in attachment.

The implications of this source of invalidity for literacy testing are immense. Any information about the test cannot be unattached from the particular test and attached to the student as a "trait" or "ability." This involves a demystification of the whole process and its highly suspect theoretical underpinning. Such demystification relates it to the fundamental question "What do we really know about where this literacy score came from?" The answer is clear. A particular group of people selected a particular set of multiple choice test items which the student answered under particular conditions and were subsequently given a score which placed them in a rank order and some of them were then classified as below a standard which did not exist until this group or another group were so classified.

The point to emphasise here is that the score does not belong to the student. It belongs to the experimental event of which the student was a part. Any movement beyond this point requires another experiment - which, of course, produces another event, with concomitant multiplication of confusion and error.

    Frame of reference errors

Practically, frame of reference errors are indicated by specifying the frame in which the assessment is supposedly based, indicating the errors according to its own and other frames, and indicating any slides or confusions that occur during the assessment events.

In testing programs on literacy the tests pretend to be in the Specific frame of reference. The tests are talked about as though there are clearly defined and accepted specific tasks which students must do successfully in order to be considered literate or numerate. And that there is some predefined standard to which appeal may be made. Neither of these claims are true. The test items which are the basis of complex statistical manipulations are subjectively chosen by test constructors from the pool available, which may include some that they themselves specifically construct. And there is no standard other than that defined by the test itself. Some test constructors talk of an absolute scale. They are deluding themselves (Behar, 1983). All test data are based on item statistics which are norm referenced from groups of test takers. So the tests produce a rank order of merit and the test controllers (test makers, educational administrators, or political funders), make arbitrary decisions about adequacy (Glass, 1978). What we can be certain of is that the tests will produce a rank order in which some students will obtain higher scores than others. That is what the tests are designed to produce. Any implications beyond this about adequacy are arbitrary value judgments.

    Instrument errors

Instrumental error is implicit in the construction of the measuring instrument itself; what is conventionally called standard error of the estimate, or is indicated by the spread of judgments of independent assessors about a particular performance on a particular test.

One assumes that in national literacy tests this relatively small source of error (simple reliability) will be known to test constructors, forgotten by test administrators, and withheld from teachers and students. Regardless, such an estimate of error gives no information about the error of a particular student, and withhold the statistical information that only two thirds of actual students will have "true" scores within these limits, and as the total numbers tested increase, an increased number of individual students will be given completely unacceptable estimates.

At a more fundamental level, the instrument (the test) cannot measure anything because there is no Standard, no adequate theory -practice bridging to define the scale, no scale, and thus no measure that the scale may proscribe, that may subsequently be compared to a standard of acceptability.

    Categorisation errors

Categorisation errors derive from confusions about the definition of standard of acceptability, from differences in the meaning of what is being assessed and in the magnitude of its measurement, and in the variability of the judgment process in which the comparison with the standard is made.

Practically, categorisation errors are all those differences in assessment description that occur when particular data is compared with a particular standard to produce a categorisation of the assessed person.

The implications of this for literacy testing are profound. For not only is the meaning of the score highly suspect, but there is in fact no standard of literacy with which such a score may be compared. The standard is an arbitrary point selected after the event by the test makers and is based on the particular test, or on the particular items used in the construction of the test. Such circularity in definition produces a closed system that is the stuff of fantasy, but not of scientific measurement.

    Comparability errors

Comparability errors include all those confusions about meaning and privileging that inhabit the addition of test items, test scores or grades. Practically, comparability errors are indicated by constructing different summaries or summations according to competing models. The differences that these produce indicate the comparability error.

Literacy is a multi-dimensional concept. As such, a single dimensional scale can be used to measure the concept, but such a measure could not be given a meaning. In particular, any categorisation (involving a standard, assuming one exists) cannot be given a meaning, because it could never be certified whether any particular single - dimensional score was above or below that "standard." Because such meaning is central to the notion of validity, such inability to give a meaning makes any uni- dimensional test of literacy constitutionally invalid.

    Prediction errors

Practically, prediction error is indicated by the differences between what is predicted (or more subtly implied) by the assessment data, and what is later assessed as the case in the predicted event.

There is an implication in the national literacy program that the scores show that some children are illiterate, and that without special intervention triggered by this test they will remain illiterate. Such an implication could be empirically tested, assuming there was some satisfactory definition of illiterate. I know of no such definition, or of any program to develop one or otherwise empirically test the effects of the testing.

    Logical type errors

Logical type errors occur whenever there is confusion between statements about a class of events, and statements about individual items of that class. Practically, logical type errors are made explicit when the explicit and implicit truth claims of a particular assessment are examined and any logical type errors are made explicit. Such exposure may invalidate such claims.

In a rare burst of intellectual honesty the earlier versions of the literacy test were headed "Aspects of literacy" (NSW, 1995). Such a test cannot be a test of literacy. Statements about some members of the class do not apply to the whole class. All literacy and numeracy tests have this problem. They are essentially a summation of the specific items that the test comprises, and assumptions cannot be made of implications beyond this. Psychometrics could be defined as a statistical sampling game that produces a fantasy about traits in order to sidestep the contradictions that flow from the reality that all test scores are summations of discrete elements, and that all information about the individual elements is lost in the summation.

    Value errors

Value errors are indicated by making explicit the value positions explicit or implicit in the various phases of the assessment event, including its consequences, and specifying any contradiction or confusion (difference) that is evident.

The National tests purport to give information about individual students that might lead to remedial action. The value appealed to is that of helping students and improving performance. The tests are not diagnostic and so give no information about what particular misconceptions or problems (if any) particular students may have, apart from the extremely error-prone response from one or two items. Even if such diagnosis were available, its usefulness would depend on teachers being able to use it to improve student performance. And since it is not known whether or not teachers have already targeted some children for extra attention, its usefulness would depend on whether the test produces the same group for special attention, and in cases of difference whether the National test produced a more valid selection.

As there is no evidence that the tests will help children, it may be less naive to suggest that the main value behind the test is to help politicians gain prestige by appearing to solve a problem (which may not exist).

    Consequential errors

Consequential errors are indicated by the differential positive and negative effects that individual teachers and students attribute to the assessment process. At a more profound level the test may involve an explication of the very construction of their individuality, and all of the potentially violating consequences of such constructions.

The focus of the testing will be on those who are lower in the rank order. Theoretically these will be identified, and will improve as a result of special instruction. The magical improvement kit has not yet been produced, so such consequences are doubtful, especially as literacy (as most people understand the term), is so dependent on a whole range of experiences outside the school. What is more certain as a consequence is that such students will be classified as "failures" or "remedial" and will, in many cases, construct their individuality accordingly.
 

Summary

Practically, the description (measurement) of a person's literacy is not dependent on any notion of a single truth, but rather on one of differences between multiple truths, all with some claim to legitimacy; these are implicit in the production of the assessment event, in the interpretations of the assessed and the assessor's experience of that event, including categorisations, and in the particular intended and received meaning of the communication of that judgment to others. The error becomes explicit when all of these phases of the assessment event are specified; when genuinely independent events are constructed; when independent categorisations are produced by participants in the event; when the judgments, and the meanings given to those judgments by involved persons, are compared.

When such errors, contradictions, and confusions are acknowledged, the pristine purity of the test score disappears, to be replaced by a wide fuzzy band of possibilities; then rank orders recede, standards evaporate, categorisations are exposed as fantasy, and the whole inane and monstrous structure crumbles.

National literacy tests have thirteen charges (at least) to answer before being considered valid. Many of these are so fundamental that I doubt any reputable educator would take the case.

 
 

University grades

    Context

Just as honesty begins with self, so truthfulness should not ignore the home campus. My own university has announced a new grading system for the categorisation of students (Flinders University of South Australia, 1997). An analysis of the grade descriptions indicates six criteria are used. A summary of the descriptors is given in Table 1 (see next page).

In the next section I will examine this grading system in terms of the thirteen sources of invalidity.

    Temporal errors

If grades refer to a particular race that students have competed in, then temporal errors need not concern us. Description of the event includes a particular time and place and tomorrow is another day. If, on the other hand, they are presumed to indicate some skill or competency of the student, then they must also be presumed to have some constancy over time. Tomorrow is the same day in terms of traits and capacities and skills and understandings. At an ideological level the whole exercise depends on this. So if "skills" are developing then logically only the most recent performance should count. And if they are not developing then what are the students learning?

Table 1 Grade descriptions
 
Grade
core work
knowledge, competency
texts
wider reading
debates
approaches
original and creative
pass 2
50-54
undertaken
adequate
basic
 
some familiarity
 
pass 1
55-64
more
sound
sound
 
good general level of familiarity
 
credit
65-74
additional
sound
sound 
done
apply a range
 
 
distinction
75-84
considerable additional
advanced
advanced
considerable
broad familiarity and facility at applying
developing a capacity 
high distinction
85-100
considerable additional
highest level
in depth
extensive
highest level of proficiency in applying
combining knowledge with
fail
0-49
fail to complete
fail to demonstrate
 
 
 
 
 
 

Further to this, if they have actually learnt through the process of doing the project, or through any subsequent feedback, then the product becomes invalid because the state of the student is now different from that state when the product was produced, and another temporal error has been perpetrated.

In this sense, tests are premised on an assumption of student stasis; the more the student learns during or subsequent to any test information, the more that test information becomes outdated and hence in error.

    Contextual errors

The grade descriptors do not mention context. But they imply a range of possible contexts, assessment modes, media, and processes. In order to make sense of such grades we must infer that the performances on which they are based are independent of the context in which they are produced; that is, they must represent a fixed measurable property of the student rather than a particular response to contextual events. It has been argued in this thesis that to believe this is an ontological error. Regardless, it is obvious that human behaviour, including cognitive behaviour, varies markedly according to context, so to reduce contextual error of the grades it would be necessary to specify the context of the events resulting in students' products, and the events resulting in the assessors product (the grade).

Without such contextual specification therefore the grades must of necessity be invalid.

    Labelling errors

There are two labels; the label that describes the measure, and the label that describes what is measured. The assumption of these descriptors is that the measure can exist independent of what is measured. That grades have a reality independent of what is being graded. That administrative convenience can become a substantive reality. As indeed it will. But at what cost to professional integrity or student justice?

And even if this assumption is not nonsense, there is still the problem of the meaning of the grade. As I have indicated, the grade demarcations are so vague that errors within each criteria must be immense. Further, once the criteria become combined into a single dimension all information about individual criteria is lost, so all meaning related to the criteria likewise dissolves.

    Attachment errors

As I have reiterated in many places in many ways in this thesis, information gained from tests is information about an event in which an individual student is an element. Any attempt to attach the description or data to the student, rather than to the total event, is an ontological slide. Attempts to not only attach to the student, but to some particular conceptual entity which the student is fantasised to have, takes us even deeper into the ontological bog. Error is reduced as the completeness of the event is recaptured. Such recapturing, of course, nullifies the use of simple numerical and graded categories.

In this case we have, in terms of the definitions of the grades, at least six independent classification events, all of which are supposed to contribute to the final grade. Error is indicated by any differences or confusions of grade within or among such events.

    Frame of reference errors

The criteria would appear to indicate the Specific frame. Within each criteria there are indicators of grade demarcations. However, these are hardly adequate for specifying any standards. What is the difference between basic, sound, advanced, and in-depth? How do you draw fine lines between some familiarity, good general level of familiarity, broad familiarity and facility at applying? And how do you differentiate between developing a capacity for creativity, and combining knowledge with creativity? How else would you know a capacity was being developed than by relating it to knowledge? Obviously within the specific frame the indicators for cut-offs are hopelessly inadequate, and in this frame the system is grossly invalid.

Perhaps though this is unfair. Perhaps it is only political fashion that has forced this appearance of competency. The word "highest" appears twice, and this is obviously a normative term belonging to the General frame. Yet there are no percentiles given for grade boundaries, so standards are not possible to define within this frame. There are of course marks given that are appropriate to each grade. The Calender makes it clear, or at least implies strongly, that these marks are awarded as subdivisions of the grade, rather than that the grades are based on some previously determined marks. What is done in practice is moot. Regardless, the system is unworkable in the General frame, because there are no guidelines in this frame to decide grade boundaries. Within this frame therefore immense errors of miscategorisation must be expected.

In the Judge's frame, where as the reader will recall there is no error by definition, there is no problem. There never is. Judges have no problem differentiating between more core work, additional core work, and considerable additional core work. Even when, as appears to be the general case from the descriptions of courses given in the Calender, no core is specified. Or even, indeed, between the different "soundness" that differentiates pass level 1 from credit when applied to sound knowledge and competencies, and the sound understanding of texts.

It seems apparent that the criteria here are a competency smoke screen, a vague set of hints that allow assessors to continue to do what they have traditionally done; create a comparative order of merit of doubtful meaning , and at the same time allocate rather arbitrary grade boundaries to the rank order. The specification of criteria, naive and inadequate as they are, nevertheless fortifies the "scientism" of the Judge's frame, armouring its uncertain certainties with a coating of current assessment dogma.

    Instrumental errors

With a plethora of assessment modes-assignments, practical work, observations, tests, examinations, it is sometimes difficult to actually locate the instrument, the "objective" machine that makes the measure. And of course there is no such objective machine. The fantasy that tests of various kinds are measuring instruments unfortunately remains a prevailing myth in the assessment of persons. The assessment modes are merely techniques used to fix a performance in time and space, to give it reality through some semblance of permanency. This allows, at least theoretically, independent judgments to be made of their "quality" or relative merit.

In practice the actual instrument, the place where the standard resides, the conceptual theory-practise link is established, the mark is produced, the comparisons are made, and the categorisations established -- all of these exist inside the mind of the examiner. So there is no objective instrument, and the assessment is clearly in the responsive mode, subject to all the normal variations and anomalies of idiosyncratic subjective judgments. Single examiners, which is the norm for university assessments, disguises this reality by nullifying in advance all competing judgments.

    Categorisation errors

Within each criteria the categorisation boundaries are defined by words or phrases of extraordinary vagueness and imprecision, when it is remembered that this purports to be the official description of the categories that determine students' futures.

For example, assuming that the "core work" for a particular course has been precisely defined, then it might indeed be possible to determine whether it had been "undertaken." Or even if "more" than the required work was done, meriting a pass 1 classification. But how to distinguish this "more" from the "additional" core work required for a credit, or the "considerable additional" work required for a distinction or a high distinction, is unspecified. And how does the "sound" knowledge and competency required for a pass 1 differ from the "sound" knowledge and competency required for a credit, and in what way is that different from the "advanced" knowledge and competency required for a distinction or the "highest level" required for a high distinction? Surely it would be easier to be honest and say: "Rank order the students somehow and then draw arbitrary grade boundaries!"

    Comparability errors

How are estimates for different criteria to be summated? The meaning of the final grade can only have a meaning in relation to the criteria if the loadings for each criteria are transparent, for how can we compare grades if they can mean different things. And how can we compare them anyway? How does "developing a capacity for original and creative work" in Commercial Law B compare with the same description in Human Resource Management or Mathematics 1A or Cognitive Science? What could "developing a capacity" possibly mean in any context, for that matter? And how can you compare the core work between subjects when it isn't specified in most cases? Indeed, if it isn't specified in some detail the whole grade description structure is entirely unworkable within a subject, for how could "additional" be judged without knowing what it was additional to?

    Prediction errors

Whilst there are no overt predictions made in terms of these grades, there are some covert ones of immense significance. Certainly entry to higher degree programs is largely determined by the grades obtained, so there is an implicit prediction that students with lower grades are less suited to such further work. And, of course, students who fail are predicted as unsuited to qualify for work in particular fields.

Performance in academic course work, even if it could be accurately assessed, is very different from performance in professional work contexts. Yet the former is often, and increasingly, a necessary prerequisite for the latter. So the predictive validity of the grades would seem to be of vital importance, especially in those professions that require academic qualifications.

As indicated in Chapter 15, predictions about job performance on the basis of any selection criteria tend to be very low indeed, and correlations of 0.3 are considered very adequate. That this is ten percent better than pure chance indicates the immensity of the predictive error, and the extraordinary extent of the social injustice perpetrated through such mechanisms.

    Logical type errors

Referral to Table 1 indicates there are six elements to the class of each grade. Are all elements required for the grade to be awarded? Or are five out of six enough? Or is one element enough for a higher grade? Could a person graded pass 2 be at a high distinction in five elements and be categorised pass because they had not done wider reading? How would we know that? If the elements must all be attained for a given level of grade then necessarily the lowest level in any element will alone determine the grade. If individual common sense gives the answers to these questions what can grades mean when common sense is so disparate?

Attention to possible logical type errors of this sort indicate inevitable massive confusion and thus error in the interpretation of these grades.

    Value errors

What are some of the value errors implicit in this system? An obvious one is that "more and less" is synonymous with "better and worse." This shows very clearly in the descriptors for core work, knowledge and competency, and wider reading. The clear implication of these columns is that more is better.

This has considerable social as well as semantic significance. There is a value clearly implied that students should do more work than is specified or required, and that merit is accumulated through such activity. There are uncomfortable parallels here with current work practices in a competitive market, where workers are increasingly expected to work longer hours for no additional remuneration, and this exploitation becomes twisted by ideology to become a symptom of professionalism.

Another value, whose implications influence comparability errors, is that of terribly ordered learning. The six criteria must march along in unison otherwise they are unusable. It seems, for example, that original and creative thinking can only occur after masses of core, and additional conceptual work, has been understood. Is this true? Cannot innovative practical methodologies be constructed with very little specific knowledge? Cannot original and creative practical experiments and equipment design be produced to specifications with almost no knowledge of background theory? The limiting of the terms original and creative to the top two grades involves very prejudicial assumptions.

    Consequential errors

How quickly and how intensely do students accept the judgments of their assessors as to the relative merit and idiosyncratic opinion (disguised as absolute value) of their academic performance? To what extent is the camouflage of error, the appearance of certainty, a predominant factor in this acceptance? To what extent does such acceptance affect later work, either positively or negatively? To what extent is the academic student constructed by the apparently objective measurements of their grades?

Such effects may be consistent within discernible sub-groups of students, or may be individually differentiated. Regardless, the questions indicate a particular category of invalidity, and in fairness to all students demand answers if the extent of invalidity for this criteria is to be explicated.
 

Not a problem

Does the confusion with its attendant error that is evident here create a problem for assessment in academia? It would seem not. Hopeless as the descriptors are, they are probably no better or worse than those they replaced, nor of others elsewhere. Academics just do not seem to problematise confusion and error in the measurement of "standards," at least not in academic discourse.

Is validity an issue? I checked the journal Assessment and Evaluation in Higher Education. Of a total of 195 articles in this journal from 1986 to 1996 only nine dealt, directly or by implication, with the problem of error, or inconsistency, or lack of validity in grading or marking. Of these nine there were three articles on validity which did not deal with inconsistency or error as any sort of a problem or issue. Four dealt with marker reliability, and two of these trivialised the notion of error in their conclusions.

Closer to home, Orrel's (1997) examination of the thinking-in-assessment of "everyday academics" revealed sometimes some angst in assigning a grade, but little concern that the "standard" itself might be illusory. And she commented that "A notable silence in the academic's discourse was any reference to the considerable technical measures that exist for assuring validity and reliability in assessment"(p397). But then, as they were clearly in the Judges frame of reference, such comment would have constituted a mind-shattering contradiction.
 

Conclusion

In the vernacular, it's a matter of "no worries, mate, business as usual!"

Return to Table of Contents