Chapter 15: The psychometric fudgeSynopsisThe first part of the chapter details some of the ways in which psychometricians fudge; by reducing criteria to those that can be tested; by prejudging validity by prior labelling; by appropriating definitions to statistical models; and by hiding error in individual marks and grades by displaced statistical data, and implying that estimates are true scores.In the second part of the chapter a number of specific examples of fudging
are detailed; in particular, the item response theory fudge, selection
and prediction fudges and the great Queensland reliability fudge.
Constraining the definitionReliability and validity are two concepts dear to the heart of test constructors and others involved in the field of psychological and educational measurement. I'll begin my analysis of the fudge that characterises the field by looking at reliability, or the lesser fudge.Reliability in classical test theory is (indirectly) an estimate of the error you'd expect if the student did a hypothetical parallel test. And in generalizability theory it's an estimate of the difference between the "universe" score and the score on any particular test. In both cases it's about the reliability of the test, or more accurately of the test-testee interaction, and not of the assessment; of the extent to which two tests give the same score, not the extent to which this particular description of student performance, based on a test, confirms or contradicts other such descriptions, which may or may not include a test (Behar, 1983, p19). Note the way the mathematical model simplifies and constrains the world.
It would be easy to believe the reliability of the test was about the extent
to which the test describes course outcomes or student performance or work
successfully completed. It isn't. It confines itself to the closed world
of the test. It's about its ability to reproduce itself.
Mathematical models and true scoresThe concept of the true score or universe score is central to the derivation of the theory. That is, it is a theoretical assumption. That does not mean that it necessarily has any place in the interpretation of the theory, that it corresponds to some measurable property of real people. And even if it does, the theory indicates that we can never know the true or universe score, only an estimate of it. And that estimate is always associated with error.So in practice, in the world out there, there is no true score that can be attached to a person or an event. There is no thin line beside which a number is placed. Even before the empirical evidence starts to come in, there is only a wide fuzzy band, and all we can say mathematically is that the true score is probably in there somewhere. And if it is only probably in there somewhere, then for all practical purposes, for an individual person it isn't in there at all. In practice there is no true score. There is no stable rank order. And if in practise there is no stable rank order, then there can be no stable practical standard. The history of achievement testing represents an enormous confusion of theory with practice. A model is not true or false. It is useful in as much as its predictions accord with empirical data at some points. It is not necessary that the assumptions of the theory correspond to actual situations in the world in which its predictions are applied. The assumptions of quantum mechanics from which the theory derives cannot be validated empirically. That is why they are assumptions. The metaphor in which the assumptions may be enclosed is useful in as much as deductions from the theory are experimentally verifiable. But such assumptions are not considered "true." Nor are they considered as having some "real" existence out there in the "atom." Psychometricians on the other hand assert that their assumptions about a true score or universe score imply that such a score refers to some attribute, some measurable property, of a person. The person can be then classified, because the number is a measure of something called achievement, or ability, or whatever. In Criterion-referenced tests it is achievement in a specified "domain" of knowledge, and is called a "trait." Regardless, this achievement is assumed to be some psycho-cognitive state which can be accurately described by finding a corresponding point along a one dimensional scale. Why are these very intelligent people wanting to insist that their theoretical assumptions are consistent with empirical reality, when theories in general require no such correspondence? And when the fundamental assumption, the primary axiom of this particular theory, is that such correspondence can never be achieved? Why this enormous urge to represent uni dimensionally a variety of human performances which are obviously multi dimensional? Why this obsession with numbers, this illusion of numerical accuracy, this delusion of descriptive adequacy? At this time, let us merely note that all of these activities are related to a psychological ideological assumption about human ability, or skill, or achievement. Some particular quantifiable quality of people that belongs specifically to them, and is thus independent of gender, race and class; that is unsullied by environmental factors; that is a permanent fixture of the person independent of the conditions of its production. That is, indeed, the clinging legacy of the nineteenth century belief that "intelligence was a unitary and immutable trait. It had no kinds or varieties, only ranks."(Wolf, 1991, p36). As well these assessment activities are related to an ideological social assumption that this quality may be quantified and be represented along a uni dimensional line of almost infinite length, along which each person may now be accurately placed and categorised, their place permanently fixed, and their relative position in the order of things firmly established. And this conception of "ranking, fixedness, and predictability provided the "scientific" basis for two enduring institutional responses to the diversity of styles, cultures and academic backgrounds of students: universal testing and the systems of tracking students." (Wolf, 1991, p38). And, further to Chapter 4 , note that The General frame and the true scoreThe logic of the General frame does not require any notion of a true score. The true score is a statistical artefact, a mathematical artifice, devised to defend a quite fantastic and monstrous proposition about ordering and classifying with great accuracy large numbers of people. Here is that monstrous proposition spelt out in more detail.The political proposition that is being rationalised, justified, mystified, constructed and implemented in the notion of a true score is this: that it is possible in any area of human achievement to produce an accurate order of merit of "ability" in that area, and to attach to each person a number, a score, that fixes them firmly in position within that hierarchical order. What do we actually know empirically? That under certain conditions it is possible to increase the stability of the rank order of merit of people on "test" results, in "test" situations. And that the more we can eliminate personal idiosyncrancies of setters and markers by averaging, and the shorter the time span of repeating the testing, the more the rank order is generalizable to other setters and markers of similar tests constructed by similar people. We do not know empirically whether there is an asymptotic limit to this stabilisation; theoretically, and practically, there is always an error of measurement. We do know that this fits empirical data quite closely in regard to sampling assessors for marking. That is, when students do very similar tasks and the idiosyncrasies of assessors are "averaged" out. We do not know empirically whether a similar stabilisation occurs when results are averaged over different occasions. There is no a priori reason to believe that they should be, especially for achievement tests with a high memory component. Indeed, there is every reason to believe that the actual performances of particular students would vary considerably, and differentially, when assessed over time, given that their forgetting curves are non linear and of different shapes. Thus sampling across these dimensions could produce an increase in error in the General frame, not a decrease. It would be very dangerous to collect such information, however, for it would contradict the assumption of stability that the notion of skill or ability implies. Empirically the true score is not known, and can never be known. Empirically estimates of the true score can be obtained, and these are always different, because all of the measurements we make contain an error. In practice then, error is indicated by the difference between estimates, not between estimates and some hypothetical "true score." That is why the notion of true score is not necessary for simple and specific and individualised estimates of error, though theoreticians and ideologues may well require the idea for their own particular purposes. The notion of the true score, then, despite its enormous ideological importance, is practically unattainable, irrelevant, and misleading. It is a theoretical input to the mathematical theory of testing, not a practical output. The statement that there is a true score is a statement about a theoretical statistical assumption, not about an attainable empirical reality. Further, such assumptions of mathematical models need have no direct links to any properties or aspects or qualities of phenomena "out there" in the real world. Models and itemsThere is no doubt that one way to get information about achievement (what a person has done), or skill (what a person can do), or ability (what a person could do given the opportunity), is to get them to answer some questions about what it is they are supposed to have achieved or have the ability in. And one rather contrived way of doing this is to use pencil and paper tests. Further, a particular method of this technique is to use test items of a multiple choice or short answer form.It requires an enormous suspension of rational thinking to believe that the best way to describe the complexity of any human achievement, any person's skill in a complex field of human endeavour, is with a number that is determined by the number of test items they got correct. Yet so conditioned are we that it takes a few moments of strict logical reflection to appreciate the absurdity of this. Test items not only determine the form and media of testing as paper and pencil tests, but also specify the type of question as short answer or multiple choice. In other words, talk of test items tends to narrow dramatically the sort of performance situation in which the person being assessed is to be put, and also severely limits the sort of description that might be given. Why is this important? Because psychometricians have defined reliability and generalizability in terms of test variance, which is in turn determined by the characteristics of test items. Likewise, estimates of construct validity, on the rare occasions they are estimated empirically, are determined by statistical manipulations of item characteristics. By appropriating terms like reliability and generalizability and validity,
and defining them in terms of the mathematical properties of particular
tests, professional test agencies and examining institutions perpetrate
another grand fudge. These concepts become narrowly construed as properties
of tests, or relations between numbers, rather than as useful criteria
on the basis of which concerned people may judge the whole assessment exercise.
Item response theory and the absolute scaleItem response theory allows us to construct a scale in the same way that classical test theory and generalizability theory enables us to construct a true or universe score.The magic is in the word "construct." It is theoretically constructible, not empirically constructible. In fact, the theory determines that the scale is absolute but improbable; the actual scale produced measures the probability (or if you prefer, the improbability) that any person to whom the scale is applied actually has that reading on the (theoretically) invariant scale that the theory constructs. Just as objective tests are highly subjective instruments in which the
marking can be done objectively, but it is implied that the assessment
is objective; and just as the true score can never be measured but it is
implied that the estimated score is that score; so the invariant scale
of the criterion referenced test can never be physically produced, but
it is implied that the test produced contains that scale, rather than its
very error-prone physical manifestation.
Criterion referenced testsCriterion referencing, as applied by professional test agencies, is not directly referring to course objectives or to student learning. Criterion referencing refers directly to test items. A criterion referenced test is one that is proscribed by tight delineations of the structure of particular tasks to be included in the test.Advocates of criterion referenced tests often claim that the performance on such a test is judged in relation to an absolute rather than a relative standard. That is, that scores on criterion referenced tests are measures of achievement in a particular domain and do not depend on relative merit, but are informative in their own right. This claim is another psychometric fudge. Criterion referenced scores are in no way absolute scores. They are norm-referenced. The norm-referencing is done prior to the test construction process at the item level, and not at the total test level during a specific application of the test. (Behar 1983, Glass 1978) Criterion referenced tests contain all of the errors of Mastery tests plus one additional labelling error of great ideological significance. A sub-group of tests in this area, called sometimes Domain referenced tests, have developed a whole theory based on test item characteristics, which is very efficient. Efficient in the sense that students can be tested with less items than in the random sampling model for the same error (an error which, as usual, is never attached to individual scores). This is achieved by using known levels of difficulty of the items (based on random or other specified population estimates), in computing the student's score. Nothing wrong with this of course. Except the labelling claim that these scores are absolute measures of a "latent trait." What is a latent trait? It is some "hidden characteristic" which some students have more of than others, and which is measured by the test. And those who have more of it are more likely to be able to answer correctly the more difficult items. As all of the items in a Domain referenced test relate to some particular area of learning, such as reading comprehension, or computer skills, or simple calculus, or newspaper editing, or social skill, or whatever, then it doesn't really matter what "latent trait" means. The assertion that "it" can be measured absolutely is what constitutes its ideological power. Here is the ultimate rationalisation for intellectual and social stratification. Here is the number that describes each person's place on the continuum of ability or skill or whatever for any label that testing agencies wish to attach to the domain of items. On the surface, of course, it is the specific label that assumes social importance. The claim being made, or at least strongly implied, is that such a test is an absolute measure of reading comprehension, or computer skill etc. But in focussing on the label, we are likely to miss the frightening significance and ideological sleight of hand that produced the "latent trait" as some substantive property or quality permanently attached to the person tested, somehow magically unrelated to the highly subjective, contrived, interrelational world where a student sits at a desk, reads some questions, and places ticks in computer marked boxes. Such tests construct current fashionable truths. They are being presented as the latest panacea for testing human ability, or "skills" or "competencies" as they are now called; they are being presented as the theoretical support for an invasion of competency based assessments in all areas of human measurement (in schools, businesses, bureaucracies, or where-ever else hierarchies operate). So we should be clear about three things: The first is that constructing a domain referenced test and naming it produces no evidence that the tests measures any sort of trait or ability that can be attached to an individual person (Lord, 1980). The second is that they are not absolute, or error free measures; the scores are related to relative merit, and there is no "standard" performance or score that relates to any minimum or other grade of "competency" that can be theoretically attributed to any score (Glass, 1978). Which takes us to the third point, which is a logical conclusion from
the previous two. Domain referenced tests can make little contribution
to a field of "competency" assessment which purports to describe (or more
significantly measure) some "standards" of competency in various "skill"
areas of human performance.
Limiting constructs, limiting errorLet's examine briefly how some of the more general criteria of assessment; labelling, construction, stability, generality, prediction, tend to be limited to what can be controlled by test makers.Labelling is achieved by the simple act of giving a name to the true, or universe, or latent trait, score. Which means, in practice, to the estimated score. The errors implicit in the communication of what that label means, between those who define the course, those who teach it, those who produce the test, those who do it, and those who consume its product, are thus not considered. All of these people will give their various meanings to the label, and make their judgments accordingly. We may be certain that these meanings vary considerably. How much they vary will probably never be known, because it is not in the interests of any institution to uncover yet another source of error. Labelling errors are not currently considered in any estimate of test error. I believe they are immense. If communication is its effect, then such confusions are, to the student, irrelevant. To the student the meaning of the label is the grade or the mark attached to it. Within the structure that contains the assessment system, the meaning of the label, as distinct from the meaning of the mark, amounts to little more than ideological gossip. At least some students recognise the meaninglessness of the label. I remember vividly a television program which followed the fortunes of four students through the final months of their preparation for the University Selection Examination in New South Wales. One student in particular, a science student, a paragon, studied hard and reaped the ultimate reward. Straight A's. Just after he received his results he was interviewed for the last time. He was obviously pleased with his success.
"The marks?" "The understanding. The knowledge." "Oh that. No, I don't expect that to be of any use to me at all. I'm going to be a lawyer." One thing is certain though; no course has stated as its major, or even minor objective, the ability to answer a pencil and paper test in a given time under stress conditions. And why not? Surely this is the essential behavioural objective. Stability becomes narrowed to test reliability, more accurately called internal consistency, an internal test measure that cannot take account of variation over time and place and assessors. Theoretically test-retest reliability is one form of reliability, but in practice such estimates are rarely obtained. Generality becomes narrowly construed as related to the extent to which the test samples the universe of possible test items, or how well the item specifications cover the domain. Generality becomes a function of test items and is called generalizability. Generalizability ignores previous performance in different contexts, forms and media. It ignores all performance other than the purely cognitive response to simulated experience of a multiple choice or written form. It thus ignores all cooperative and all production modes of expression. It reduces human response to the act of recognising a "best" answer, to conforming adequately to some authority's view of importance, relevance and reality, or to answering someone else's question in a particular way. And prediction becomes tied to numbers and test scores. In this psychometric
world we are no longer concerned with the extent to which actual people
are helped to function in differential social situations of great complexity.
Prediction does not attempt to describe the relationship between a particular
set of learning experiences for some person, and how helpful that is in
some future situation for that person. Rather it ranks a group of people
on their "success" in the "learning" situation, then ranks them again in
some criterion situation. The correlations between the two rank orders
represents the predictive value of the test. Not of the course, of the
test. And not of its relevance to the quality of their performance, but
to its correlation with some person's or group's ranking of their relative
performance. And note that even if this correlation is high, which is unusual
unless a similar test has been used to measure the criterion, this tells
us nothing about whether the relation is in any way causal.
How the fudge worksThe psychometric fudge occurs through the following processes:Firstly, the criteria by which assessment is determined are chosen so that they are easily adaptable to the construction of tests and to the statistical manipulation of test data. Criterion-referenced tests are just that: Only those criteria that are appropriate for referencing test items are chosen. Secondly, the validity of the test is prejudged by labelling it to describe what it is supposed to measure. Such is the power of labelling that this exercise in wishful thinking, this untenable assertion, is interpreted by most people, including the test constructors who become entranced with their own propaganda, as being an accurate description. At a deeper level still the mathematical theory itself contains such terms as true score, ability, and trait before any empirical information at all is available; that is, before any connection (let alone correspondence) with the world outside mathematics is established. Thirdly, definitions are appropriated and defined to fit specific statistical models; in particular, by narrowing the universe of possible test situations to a universe of possible test items (random sampling model), or by narrowing the universe of possible test items further to the universe of suitable test items (domain referenced testing). In both cases the performance of students outside of such test situations is disregarded, or downgraded, and the right to appropriate the personalising labels (ability, trait, true score) is assumed. Fourthly, the data is presented in a way that is misleading at best and deceitful at worst, by hiding error of individual marks and grades with obscure and displaced statistical data, thus implying, to all but the statistically sophisticated, that estimates are "true" scores. Further, the implication is made that such tests are accurate as predictors, claims that in most cases cannot be substantiated (Reilly, 1982). Finally, estimates of confusions and errors related to construct validity are ignored, usually theoretically, and almost always practically. We could look at these fudges as things done by individuals, and thus attributable specifically to them. From this psychological frame how could we make sense of this fudging behaviour? At best the fudges can be interpreted as logical or psychological slips propped up by delusions of grandeur. At worst they represent academic chicanery and political manipulation in high degree (Nairn, 1981, p58). If we regard this in a sociological context, however, a different picture
emerges; psychometricians may well be regarded as the moral guardians of
the age of competency, the high priests who hold society stable by propagating,
preaching, and propping up the gospel of the Standard, and the cult of
the linearly determined individual that it constructs and supports.
In the beginning"What's in a name?" Bill Shakespeare said, "that which we call rose by any other name would smell as sweet." Maybe so, yet that which we call a trait when it is just a mathematical function takes on a different odour indeed. Names have a magic of their own, and the stickiness of the name is very dependent on the power of the namer.Lord (1980) produced the seminal work on item response theory, in his book Applications of item response theory to practical testing problems. It is possible here to trace in detail the birth of a fudge. Early on there are some laudably honest statements: The true score is a mathematical abstraction. A statistician . . . does not try to define the model parameters as if they actually existed in the real world. A statistical model is chosen, expressed in mathematical terms undefined in the real world. The question of whether the real world corresponds to the model is a separate question to be answered as best we can. It is neither necessary or appropriate to define a person's true score or other statistical parameter by real world operational procedures (p6). In item response theory . . . the expected value of the observed score is still called the true score (p7). Undeterred we press onwards. Five pages later Lord commences the serious work in developing the theory: We wait expectantly till page 45 to find out what ø means mathematically. "A person's number right score . . on a test is defined . . . as the expectation of his observed score x. It follows immediately . . that every person at ability level ø has the same number right true score." Then on page 46 the crucial point finally emerges "true score . . . and ability . . . are the same thing expressed on different scales of measurement. " And just in case you missed it, the best estimate of this true score, this ability, is the number of items answered correctly on the test. Thus on his own admission Lord has done exactly what he claims statisticians do not do. He defines the parameter as having "real world" status when he calls it ability. (Just as he infers it has some objective or propositional reality when he calls it true). Its mathematical status is simply the number of items answered correctly under the idealised conditions specified in the theory. It's empirical status is the actual number of items answered correctly, or some statistical manipulation of that number. There is one more aspect of this fudge that we need to look into. It is the fascinating use of the adjective "latent" in front of trait. Hambleton & Swaminathan (1982) elucidate: Item response theory doesn't need any assumption about traits at all. The talk of traits and abilities is redundant and gratuitous. After all the terribly refined and elegant statistical manipulations, Item response theory simply produces a total score which (given knowledge of the structural characteristics of individual items) allows a prediction of the probability with which any particular item will be answered correctly by a person with that total score. It does require a certain consistency of correct (or incorrect) response for specific items on the part of the examinee. All else, as far as item response theory is concerned, is fantasy. Incidentally, such prediction is in no way an explanation; to assume that is to evoke the dormative principle; the total score is just a summary of information about a particular person answering the individual items. Such a score cannot now be used to explain why the items were answered correctly. On page 55 Hambleton and Swaminathan (1982) come clean; rather by accident
that design, I fear. "Ability", we read, "is the label that is used to
describe what it is that the set of test questions measures." Precisely.
And what it measures is an estimate of probabilities of answering certain
test items correctly. To what extent that measure relates to any "characteristic"
or "trait" or "ability" of the examinee may only be known after "construct
validation studies . . . (which) validate the desired interpretations of
the ability scores" (p55). Shouldn't that read "validate or invalidate"?
Mistakes: probability, correctness, and checkingItem response theory cannot predict whether a particular person (whose true score we don't know but whose estimated score we do know), will get a particular item (whose characteristics we know), correct or incorrect. The theory will predict the probability of getting it correct. In practice it will either be correct or incorrect (probabilities are only 1 or 0).So item response theory never even pretends to estimate what people know or can do. It only claims to estimate the probability that they can do certain things. Then the assumption (and that's exactly what it is) is made that this indicates an ability of the person in that area of cognition. It might mean something else. Or it might not. When I worked as a test constructor I noticed one aspect of answering
tests that was interesting. When groups of year 10 students did the 100
item tests most would finish in about ninety minutes. When groups of year
8 students did the tests most would finish in about 60 minutes. The year
10 students got slightly better results (about 0.3 S.D. better). Conventionally
this would be interpreted as meaning that they had more ability, or simply
more maturation. But given my perceptual data, perhaps it just means that
they did more checking!
Psychometric selection myths and fudgesHulin, Drasgow & Parsons (1982) complain that the controversy and rhetoric about standardised educational admission tests seem to have developed independently of the psychometric evidence about the usefulness of admission tests in reducing errors in prediction. They claim that Cleary, Humpreys, Kendrick, & Wesman (1975), Rubin (1980), Linn, Harnisch, & Dunbar (1981) among others, have produced summaries of large numbers of studies relating college and professional school admission test scores to performance in post secondary and postgraduate educational institutional institutions:Cleary's (1975) data involved correlations between verbal and mathematical SAT scores on the one hand and High School grade averages and College grade averages on the other. The correlations ranged from 0.35 to 0.50. But the correlations between the High School and College grades were higher at 0.64. So two points about Cleary's study: firstly the correlations are at best only 25% better than pure chance. Is this "appreciably large"? Secondly, they were considerably lower than the correlations from grade averages, so why were they necessary at all? Rubin's (1980) study involved the use of the Law School Admissions test to predict first year grades in 82 law schools. The correlations ranged from 0.03 to 0.5; after corrections for range (Linn, 1981), the correlations range from 0.2 to 0.7. In 14 of the schools they were below 0.35, which is 12% better than chance. Is this "appreciably large"? When it is known that issues of construct validity introduce far more sources of error than are involved in simple predictive correlations of this sort, it is difficult to understand how this sort of justification, which is quite common in the literature, goes on for decades virtually unchallenged within the psychometric community; on the other hand, compared to the abysmally low correlations often obtained in such predictive correlational studies, perhaps they are appreciably large. However, these studies raise another issue and another fudge; the correction
(always upwards) of predictive correlations.
Fudging the predictive correlationsCorrelations between a selection instrument and later performance are often corrected for range restrictions and for criterion unreliability. Range restriction is reasonable; generally some of the people tested were not selected, so had no opportunity to be in the final sample. It is considered appropriate by statisticians then to estimate what the correlation would have been had all of those selected actually been appointed. After the correction, of course, it is a correlation about something different; it becomes the estimated correlation between test performance and later performance of all those who sat for the test. Prior to the correction it was the correlation between test performance and later performance of all those who performed later. Different sample, different correlation. Which to use depends on what question you ask. Automatically raising the correlations is a fudge.Correcting for criterion unreliability is a different matter. Most job tasks are multi-dimensional; that is, they involve many very lowly correlated tasks. And college grades are likewise composites based on lowly correlated components. If a single correlation is to be obtained a with multi-dimensional job performance the various ranks or gradings have to be collapsed into one single rank or grading; and that requires some arbitrary and explicit loading to be applied to each dimension (See Chapter 10 on Comparability). Even when this is done (and it often isn't), there is still the assumption that there is indeed a meaningful rank order to be obtained. If most people in most jobs or in most courses do their work adequately (just as most people drive cars adequately), then we would expect correlations to be low, and ultimately, where training schemes are very adequate, to be zero. In such situations, the reliabilities would be low not because of rater inadequacy that can be corrected for, but because raters are attempting to separate performance when it cannot be separated, or/and are trying to pretend that a multi-dimensional performance is in fact uni-dimensional. In such cases it is obviously not appropriate to artificially inflate the correlations because of rater unreliability. The changes are more than trivial. A study by Schmidt, Hunter &
Pearlman (1981) involved 150 000 people, 2000 predictive correlations.
Before correction the average correlations between eight aptitude tests
and job performances in clerical job categories ranged between 0.15 and
0.25. After the statistical corrections, however, they magically rise to
between 0.3 and 0.5. Still not good. In fact, still quite awful. But they
certainly look better than before, and aptitude tests survive again to
live another day.
The great Queensland reliability fudgeI was talking to the Principal of a secondary school in Queensland. Students in year 12 are assessed internally, with the help of some external monitoring. I suggested that there might be some problem with reliability. "It's 0.95," he replied with confidence. "Excellent," I responded with some scepticism. Then I decided to check the data.The study is titled Random sampling of student folios: a pilot study (Travers, 1994). In this study So this is not a blind reliability study: The astute reader will also doubtless have expected a very large halo effect, and would not be surprised if reliability coefficients, at least in relation to levels of achievement, were very high. As indeed they were. Eighty per cent of achievement levels remained unchanged, most of the aberrant cases being one level lower, indicating, no doubt, the "high standards" of the review panellists. The overall correlation figure obtained for agreement between school exit and review level rung placements, on a fifty point scale, was 0.95. The authors were particularly pleased with the rung placement data: It follows that acceptance of given levels of achievement (halo effect), combined with random allocation of rung placements, is sufficient to account for the 0.95 correlation that was used to justify the whole procedure, not only of the pilot study, but indeed for the whole examination system, as evidenced by the Principal's comments. Rather than evidence of precision in rung placements, which determine
tertiary entrance scores, the data generates evidence of randomness, and
another psychometric fudge is perpetrated by well meaning psychometricians
on a gullible public.
The General frame and the true scoreThe General frame of reference as hijacked by psychometricians contains as an essential element of its assumptions the notion of a true score; a further element of those assumptions contains the notion that it is possible in some way or another to approach that true score; to get measures empirically closer to the true score by various procedures implied by the particular model. For example, in classical test theory by increasing the number of items on the test; in generalisability theory by sampling more tasks more randomly from a bigger collection of possibilities; in item response theory by having more items of appropriate characteristics which are uni-dimensional; in domain referenced tests by having the domain of items criterion referenced to a high degree.Allied to this frame but not tied to it so tightly are the various notions of reliability and validity that have not been developed as part of the mathematical models mentioned in the previous paragraph, but have emerged from more general considerations of the notions of assessment, rather than of tests. In my terminology, these considerations have challenged the artificial constriction of the general frame by psychometricians, and have restored, through notions of construct validity and consequential validity, at least some of error components previously bypassed. However, this has produced a contradiction with the notion of the true score that has not been made overt. For example, as described in Chapter 16, most achievement tests are not made more valid by increasing their reliability; on the contrary high reliability is seen to be, in most circumstances, an indicator of low validity. For most achievement areas involve a large number of disparate activities, and there is no a-priori, or even post empirical reason to believe that these activities are uni-dimensional, or otherwise closely inter-correlated. I argue in Chapter 15 generalising the assessment events across contexts,
or time, or media, or even value assumptions or frames of reference, does
not (as does generalising across selection of test items or markers), reduce
the standard error of the estimate; on the contrary, we have every reason
to believe that it will increase such error, to a point where the whole
notion of true score becomes unsustainable. After all it is not by chance
that so much space is given in test manuals to ensuring the conditions
under which the test is given are kept constant. Obviously this indicates
the fragility of the test to contextual shifts. (On second thoughts, it
could be as much a ritual designed to imply scientific accuracy, and sustain
the notion of fairness). Regardless, it is clear that contextual shifts
increase the error term, whilst contextual control artificially reduces
it; artificially because no argument is ever given, nor could it be sustained,
that this particular test context is superior to any other to the measurement
of this "ability." So once again the price of higher reliability is lower
validity.
PreviewWe could go on dealing with the specifics, but it is time to present the greatest fudge of all. Validity. For as will become clear, the very definition of validity creates a discourse around it where every test may be assumed valid until proved otherwise, and as there are no specific descriptions as to how such a proof might be constructed, and no specific standards of acceptability to which such descriptions might be compared, all assessments may claim to be valid.
Return to Table of Contents |