This article has been retrieved
times since May 12, 2000
Education Policy Analysis Archives | ||
Volume 8 Number 23 |
May 12, 2000 |
ISSN 1068-2341 |
|
Editor: Gene V Glass, College of Education Arizona State University
Copyright 2000, the
EDUCATION POLICY ANALYSIS ARCHIVES. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
School-based Standard TestingCraig Bolon
|
|
Abstract School-based standard testing continues to evolve, yet in some ways it remains surprisingly close to its roots in the first two decades of the twentieth century. After use for many years as a diagnostic and as a filter for access to education, in the closing years of the century it has been pressed into service for state-run political accountability programs. In this role, it is generating vehement controversy that recalls protests over intelligence testing in the early 1920s. This background article explores primary characteristics and issues in the development of school-based standard testing, reviews its typical lack of qualification for political accountability programs, and suggests remedies to address major problems. In general, the attitude toward new techniques of assessment is skeptical, in light of the side-effects and unexpected problems that developed during the evolution of current techniques. |
|
Survival of the Fittest School-based standard testing began a dream decade in the early 1950s, driven by waves of public anxiety over Soviet "dominos," nuclear weapons, Sputnik and the "missile gap." Now, so many years later, it can be hard to imagine the intensity of fears that the Russians were ahead of everybody else¾ not just in the size of their standing army but in scientific knowledge, inventions and industry. There was widespread agreement that the U. S. needed to identify talented people and train them for critical occupations. (Note 1) Of course we know more of the dreary facts today¾ a Russia of gray poverty and workplace spies, burdened with heavy but narrow investment to produce arms, rockets and nuclear bombs. But in those times, who knew? We saw North Korea fortified with MiG-15s, the Hungarian revolt crushed with Russian tanks, and then the Berlin wall built. Russia had been four years behind the U. S. in testing an atomic bomb but only one year behind with its first thermonuclear blast. And although the U. S. employed the Nazi rocket designers from World War II, Soviet Russia had a space satellite first¾ winking at us and mocking "the American century." And so it was, into the breach against Godless communism, (Note 2) that we launched our homespun Scholastic Aptitude and Iowa tests. Few questioned the methods or values. In the climate of those days, school-based standard testing was an engine of progress. (Note 3) It would promote technical expertise and fairly chosen leadership to right the balance and put America first again.
Background |
|
Standard Tests The distinguishing features of a standard test are uniform administration and some form of calibration. Before routine use, standard tests or component items will be tried out with groups intended to represent populations of test- takers. These trials are used to measure distributions of scores and other properties of a test (Rogers, 1995, pp. 256-257 and 734-741). After calibration, test scores are typically reported by using a formula derived from the calibration (to percentile ranks, for example). Beginning in the 1910s, statistical metrics were developed to characterize test items and report scores (Rogers, 1995, pp. 197-208, 317-325 and 382-388). The IQ score and the SAT scaled score ranging from 200 to 800 are among the well-known metrics. A quantitative approach helped give standard tests the appearance of objectivity and encouraged a test format that is easily adapted to numerical scoring. Multiple choice and short answer questions quickly became the conventional format. Such questions are scored only as right or wrong. While in principle there is nothing to prevent a standard test from using essays, extended reasoning and scales of partial credit, reliable scoring of extended answers and essays requires careful training and monitoring of test evaluators and substantially more effort. Rushed and inept evaluation of extended answers can be at least as troublesome as restricting testing to multiple choice and short answer formats. Standard tests have long been distinguished as having "speed" or "power" formats, meaning that they are strictly timed or that they are loosely timed or untimed (Rogers, 1995, p. 256, and Goslin, 1963, pp. 148-149). The distribution of scores is deliberately widened by strict timing. Many common school-based standard tests, including the Stanford, California and Iowa achievement tests, claim to measure knowledge and skill but are in fact "speed" tests. More recent distinctions are proposed between so-called "norm-referenced" and "criterion-referenced" tests (Rogers, 1995, pp. 653-666). Supposedly a "norm-referenced" test has a calibration relative to a population, while a "criterion-referenced" test has an absolute standard (for example, basic competence to drive a motor vehicle). However, for practical purposes nearly all school-based standard tests are "norm-referenced," because critical decisions about how to use the scores are made after score distributions have been measured. We used to call this "grading on the curve." In fact, wild attempts to produce "criterion-referenced" tests, without knowing how many people can actually pass them, generate some of the horror stories of testing. Another recent and somewhat misleading distinction is so-called "high- stakes testing," meaning the use of test scores to make decisions that critically affect people. Supposedly this is a new practice. Actually it is quite old; parts of the Chinese civil service were closed to applicants who could not pass required examinations more than twenty centuries ago (Reischauer and Fairbank, 1958, p. 106). Beginning in the nineteenth century, standard tests were developed to place students in French schools according to ability. During World War I, U. S. Army recruits were assigned to combat or support missions on the basis of IQ scores. According to current psychometric standards, it is improper to use a test for some purpose for which it was not "designed." Ninety years ago, however, intelligence tests were quickly appropriated to identify "morons," "imbeciles" and "idiots," who were then to be sexually restricted. Claims were advanced that experienced testers could readily identify "feeble-minded" people by observation (Gould, 1981, p. 165). We are not as far away from those days as some would like to think. Recent applicants who failed a new, uncalibrated teacher certification test were denounced as "idiots" by a prominent Massachusetts politician. (Note 7) Although some strong advocates of standard testing were once inspired by egalitarian views (such as Conant, 1940), standard tests have long been instruments for social manipulation and control. In an irony of the late twentieth century, tests like the former Scholastic Aptitude series, once praised as breaking the stranglehold of social elites on access to higher education, became barricades tending to isolate a new, test-conscious elite which, as we will see, largely tracks the social advantages of the old elite. Aptitude, Achievement and Ability School-based standard testing is largely a phenomenon of the twentieth century. An early product, the "intelligence scale" published by Alfred Binet and Théodore Simon in 1905, was intended to identify slow learners. By the 1920s, the testing movement had split into two camps which remain distinct today (see Goslin, 1963, pp. 24-33). The Binet-Simon scale and its offspring¾ such as the IQ test produced by Lewis M. Terman in 1916, the Army Alpha and Beta tests organized by Robert M. Yerkes during World War I, and the Scholastic Aptitude Test designed by Carl C. Brigham in 1925¾ all claimed to measure "aptitude." The essay exams of the College Entrance Examination Board, founded in 1900, the Stanford Achievement tests, first published in 1923, and Everett F. Lindquist's Iowa Every-Pupil tests, developed in the late 1920s and early 1930s, claimed instead to measure "achievement." Tests of "aptitude" try to measure capacity for learning, while tests of "achievement" aim only to measure developed knowledge and skills. From their earliest days, standard aptitude tests have been clouded in controversy. It has never been clearly shown that "aptitude" can be measured separately from knowledge and skills acquired through experience (see Ceci, 1991; also see Neisser, 1998, and Holloway, 1999, on changes over time). Standard achievement tests, while nominally free of these snares, share assumptions about language and cultural proficiency. Performance on almost any test is strongly influenced by language skills. Likewise, all tests rely to some degree on trained and culturally influenced associations and styles of thinking. Despite longstanding claims of distinct purposes, standard aptitude and standard achievement tests may have more similarities than differences. Standard achievement test scores tend to correlate with standard aptitude test scores, as shown by Cole (1995) and others. To some observers, such as Hunt (1995), this simply shows that bright people learn well, and vice-versa. To others, it suggests that much of what is being tested might be called test- taking ability (see Hayman, 1997, and Culbertson, 1995). Most content of the widely used school-based standard tests can be viewed as collections of small puzzles to be solved rapidly by choosing options or writing brief statements. Such a pattern of tasks is rarely encountered by most adults in everyday life. By design, the times allowed to complete standard tests are typically too short for a sizeable fraction of test-takers, putting great stress on rapid work and leaving little opportunity for reflection. For some strictly timed tests favoring men it has been shown that the same tests conducted without time limits favor women (see Kessel and Linn, 1996). Standard test designers may assign high scoring weights to test items written to be ambiguous, so that they will encourage wrong answers (see Owen and Doerr, 1999, pp. 70-72). Right answers are guided in part by trained or culturally acquired associations¾ intuitions about a test designer's unstated viewpoint. When ambiguous questions are removed, differences in scores between ethnic groups may be reduced. Test designers sometimes say that ambiguous questions "stretch the scale," differentiating the more skilled from the less skilled. Owen and Doerr (1999, pp. 45 ff.) suggest instead that they raise the scores of test-takers who have the favored patterns of associations and thinking. The stressful properties of a typical standard test make test-taking into a sort of mental gymnastics, an ability that may well have its uses but does not necessarily predict performance in other situations (see Sacks, 1999, pp. 60- 61). We recognize many special skills, such as remembering complex patterns in card games, multiplying numbers in one's head, or solving crossword puzzles. People who do these things deftly may also perform well in other pursuits, or they may not. |
|
Predictive Strengths Standard tests are promoted on the basis of claims to predict future performance. Their predictive strengths are measured by how well they do this. Despite heavy use of standard tests in circumstances that may critically affect people's lives, there have been remarkably few evaluations of these tests by organizations independent of the test vendors. The underlying substance of predictive evaluations is sometimes shallow. For example, it may be claimed that a standard test required for acceptance to a school program helps to predict the likelihood of graduation, when a key criterion for graduation is the score on a similarly organized standard test. For a standard test to be useful, it cannot merely predict performance to some degree. It must significantly improve the accuracy of prediction over readily obtained information. Unless it does so, the effort of testing is wasted. (Note 8) During the last forty years, predictive strengths of the SAT, ACT, GRE and similar aptitude tests have been independently evaluated. Scores from these tests improve predictions of first year grades by at most a few percent of the statistical variance over predictions based solely on previous grades, family income and other personal factors. (Notye 9) For later and broader measures of performance, the predictive strengths of these tests evaporate. Sometimes negative correlations have been found¾ lower performance associated with higher scores. (Note 10) In response to the low predictive strengths of standard aptitude tests, growing numbers of colleges have stopped requiring them as part of applications. (Note 11) Predictive strengths of standard tests are falsely enhanced when they are used to "track" or group students in schools, providing extra opportunities to some while denying them to others. The favored students stand to gain not only skills and knowledge but also self-esteem, which has been shown to correlate with higher test scores. (Note 12) Ability grouping based on standard tests is a form of "high-stakes testing" which has been practiced for at least 80 years in U. S. public schools. We can clearly distinguish between the selection procedures of public schools, which have a legal duty to treat every student fairly, and those of taxpaying private institutions, which may not. Of the public schools, we can surely ask, "Why not provide opportunity to everyone?" Beyond the schoolhouse door, school-based standard tests show hardly any predictive strength for creativity, professional expertise, management ability or financial success. (Note 13) However, these tests stress either generalized test-taking abilities or subjects that are only occasionally relevant to adult life. Tests for competence in specific skills have been used successfully to predict whether workers can perform tasks that require those skills. For example, some temporary employment agencies now administer technical skills tests to new job-seekers before sending them out to interview with potential employers. This practice has increased employer satisfaction with job performance. Errors of Testing All measurements are subject to potential error. Compared with physical measurements, the errors in standard test scores are enormous. There are many sources of error. These include:
Vendors and promoters of standard tests do not often discuss errors of testing. When they do, they usually bury information in opaque language, tables and formulas found in "technical reports" that may be hard to obtain. Careful reading of such information often reveals defects in the error evaluation as well as large errors. Test vendors typically present themselves as diligent in reducing or eliminating mechanical, consistency, computer and systematic errors. There are well developed methods for controlling these gross errors. However, such errors do occur. Advanced Systems, a company used by the Massachusetts Board of Education since 1986, was embarrassed by errors in score reporting in Kentucky and lost its Kentucky contract in 1997 (see Szechenyi, 1998, and "Problems," 1998). Gross errors seem to be more common with smaller and newer test vendors than with larger and longer established ones. The most common error measurement for a standard test is its "reliability." By convention, this describes the range of scores which a test-taker would receive in taking repeated, comparable versions of a test (Rogers, 1995, pp. 61-62, 368-378 and 741-743). A narrow range means high reliability: a test-taker would be likely to receive about the same score on repeated tests. Because training effects occur when tests of a particular type are actually repeated, indirect methods must be used to estimate reliability, such as mathematical models. Details of these methods can be adjusted to change estimates of reliability. When mechanical, consistency, computer and systematic errors have been well controlled, reliability mainly measures random errors arising from unpredictable, individual circumstances of test-takers. Such errors are often larger than is generally known. As cited by Owen and Doerr (1999, p. 72), the Educational Testing Service has estimated that, on average, individual differences of less than 70 points for its SAT Verbal scores and 80 points for its SAT Math scores are not significant. These margins increase for high scores. Massachusetts (1999a, p. 86, Table 14-4) has estimated there is only about a 56 percent chance that a fourth-grader who is advanced in English language arts, according to its standards, will receive an "advanced" rating on its MCAS fourth-grade English language arts test. People who are unfamiliar with the large random errors of standard test scores often assume that the scores can be used reliably to rank-order test-takers according to ability. In fact, random errors of testing are so great that scores can be used at most to classify individuals in a few levels. Using only four levels to classify MCAS scores, Massachusetts (1999a, p. 86, Table 14-4) has estimated substantial likelihoods, ranging from 8 to 46 percent, that an MCAS test-taker will be misclassified. Many types of bias errors have been discovered in standard tests. For example, if the format of a test is changed from multiple choice to essay, different groups of test-takers are favored. A study performed by the Educational Testing Service found that multiple choice questions on its advanced placement tests favored men and European-Americans, while essay questions favored women and African-Americans (cited by Sacks, 1999, p. 205). Grouping test-takers with high essay and low multiple choice scores and those with the reverse pattern, the study showed comparable college grades for the two groups but a sixty point difference in their average Educational Testing Service SAT scores, in favor of the group with high multiple choice scores (Sacks, 1999, p. 206). People tested using a language in which they are not fluent are likely to do much worse than native speakers of the language. Tests that require reading, in the formats used for most standard testing, assume reading proficiency. Individuals with poor reading proficiency, whatever the cause, are at major disadvantage with respect to others who do not have such limitations. Bias caused by test timing and ambiguous questions has been previously mentioned. Most attempts to compensate for bias involve identifying substantially impaired individuals and providing them extra test time. There is little evidence that test bias is actually corrected with this approach (see Heubert and Hauser, 1999, p. 199). Perhaps the greatest source of bias and content error in school-based standard testing is the conventional process of standard testing itself, as contrasted with rating actual performance. When an educational assessment should measure success at significant tasks, such as writing a research report or investigating a technical theory, it may be impossible to design a standard test with much accuracy or predictive strength. In the U. S., there has been a movement toward replacing standard testing with criterion-based "performance assessment" (see Appendix 6). A goal of this movement, also called "authentic assessment," is eventually to integrate educational testing with the ordinary processes of teaching and learning. There have been attempts to use performance assessment as part of state testing programs in Kentucky (1990-1997) and California (1991-1995), reviewed by McDonnell (1997, pp. 5-8 and 62-65). School Accountability The performance of public schools became an issue in the U. S. almost soon as support for public education began. In 1845 the Massachusetts Board of Education printed a voluntary written examination to measure eighth-grade achievement. Most students could not pass the test. Schoolmasters complained that knowledge tested did not match their curricula. After a few years the test was abandoned (see Appendix 2). In 1874 the Portland, Oregon, school superintendent distributed a curriculum for each of eight school grades. At the end of the school year, he administered written tests on the curriculum. Test scores were published in a newspaper. Based on test scores, less than half the students were promoted that year and the following year. An uprising by parents and teachers then led to dismissal of the superintendent and an end to the practices of publishing scores and denying promotion on the basis of a test score alone. (Note 14) Since those days similar initiatives and reactions have often occurred throughout the U. S.. The U. S. has sponsored a continuing expansion of public education for 350 years. Most people did not expect to graduate from eighth grade until late in the nineteenth century. High-school graduation became a normal expectation only in the 1930s. Today, we are still struggling with rising expectations that include college. At each stage of this growth, critics have condemned the lowering of educational standards and demanded accountability. However, each of these stages can also be seen as intrusion into a formerly elite province of education by large numbers of students who would previously have been excluded. For several years, levels of performance go down as the system adapts to less prepared students. Over a longer period, curricula change, often abandoning cultural traditions for more practical approaches. School accountability became a public demand during the first two decades of the twentieth century. (Note 15) Over the ten years from 1905 through 1914 the U. S. accepted the largest flow of immigrants in its history, averaging more than a million per year. Immigration, coupled with stronger school attendance laws, raised school enrollments and increased the fraction of students for whom English was not a native language. Declines in student achievement were noticed and became an object of public concern. At first standard tests were used to document declining student achievement, but they did not provide a method to improve it. By 1920 many urban school systems had started to use the newly available intelligence tests to measure student aptitude; they grouped students in classes by IQ. (Note 16) Educators hoped to improve performance by providing instruction that was adjusted to student aptitudes. In 1925 a U. S. Bureau of Education survey (cited by Feuer et al., 1992, p. 122, footnote 91) showed that 90 percent of urban elementary schools and 65 percent of urban high schools had adopted this approach. As immigration declined and school attendance became more uniform, student achievement tended to stabilize, and public concern relaxed. Despite warnings from progressives such as John Dewey and Walter Lippmann about a "mechanical civilization" run by "pseudo- aristocrats" (Dewey, 1922), IQ testing and the multiple choice test format had acquired prestige as techniques to improve public schools. Strong U. S. demand for school accountability arose again in the 1970s through the 1990s. This time aptitude testing and finances played significant roles. Acceptance of Scholastic Aptitude Test scores as a measure of merit by highly selective colleges was regarded by many people as sanctioning a measure of merit for public schools. Average SAT scores for schools and communities began to circulate as tokens of prestige or shame. During the period from 1963 through 1982, the Educational Testing Service reported a continued decline in its national average SAT scores, followed by a slower recovery, as shown by the scores in Table 1. Table 1
|
| Test / Year | 1963 | 1980 | 1995 |
| SAT Verbal | 478 | 424 | 431 |
| SAT Math | 502 | 466 | 482 |
Table 2
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Reading Scores | ||||||
| 1984 | 1996 | Change | 1984 | 1996 | Change | |
| Grade 11 | 292 | 291 | -1 | 285 | 279 | -6 |
| Grade 8 | 260 | 261 | +1 | 256 | 252 | -4 |
| Grade 4 | 216 | 220 | +4 | 204 | 206 | +2 |
| Writing Scores | ||||||
| 1984 | 1996 | Change | 1984 | 1996 | Change | |
| Grade 11 | 291 | 290 | -1 | 287 | 273 | -14 |
| Grade 8 | 273 | 264 | -9 | 267 | 260 | -7 |
| Grade 4 | 212 | 213 | +1 | 204 | 200 | -4 |
| Math Scores | ||||||
| 1982 | 1996 | Change | 1982 | 1996 | Change | |
| Age 17 | 304 | 309 | +5 | 292 | 303 | +11 |
| Age 13 | 277 | 275 | -2 | 258 | 270 | +12 |
| Age 9 | 226 | 236 | +10 | 210 | 227 | +17 |
| Science Scores | ||||||
| 1982 | 1996 | Change | 1982 | 1996 | Change | |
| Age 17 | 284 | 296 | +12 | 276 | 288 | +12 |
| Age 13 | 254 | 255 | +1 | 239 | 251 | +12 |
| Age 9 | 222 | 234 | +12 | 214 | 224 | +10 |
| | ||||||
| States under "school reform" | States without "school reform" | ||
| Alabama | 58% | Connecticut | 74% |
| Florida | 58% | Maine | 72% |
| Georgia | 55% | Massachusetts | 76% |
| Louisiana | 58% | New Hampshire | 75% |
| Mississippi | 57% | New Jersey | 83% |
| North Carolina | 62% | New York | 62% |
| South Carolina | 54% | Pennsylvania | 76% |
| Texas | 58% | Rhode Island | 71% |
| Virginia | 76% | Vermont | 90% |
Only one southern or southwestern state with major "school reform" had
a normal graduation rate above two-thirds, while only one of the
northeastern states had a rate below two-thirds. The worst northeastern
state is New York, which has a longstanding Regents examination for high-school
graduation but during the 1992-1996 period was also awarding "local"
diplomas (see Appendix 3).
Reform Schools and Private Interests
By the early 1990s, with reform schools entrenched for ten years or more in
several states, a perverse competition began, which might be called Our
Standards Are "Stiffer" Than Yours:
|
The Social Context School-based standard testing does not occur in a social vacuum. It has consequences, and the techniques it uses reflect interests and values. Insight and candor about these consequences, interests and values are rare today; they must often be inferred from behaviors. In previous times, the advocates of standard testing were less guarded about their intents. It has become well known that early promoters of standard aptitude tests were profoundly racist and sexist. Goddard, Terman, Thorndike, Burt, Yerkes and Brigham all believed that these tests identified African-Americans, native Americans, immigrants from southern and eastern Europe, or women as typically less able than white men whose ancestors came from northern and western Europe. (Note 24) Goddard, Terman and Brigham were advocates of the "eugenics" movement, (Note 25) favoring IQ tests followed by sexual restriction of the "feeble-minded." An echo of their attitudes can be heard in the enthusiasms for standard tests sometimes expressed in the U. S. today, reducing access by African-Americans and Hispanic-Americans to universities and professional schools. Few of the modern promoters of standard tests flaunt prejudices that were once openly displayed. Relative success on these tests by Jews and by the offspring of Asian immigrants has greatly tempered hubris over "Nordic superiority." The myth of measuring innate talent has been exposed. Multifactor studies link high scores on aptitude tests with advantages in family income, language and cultural exposure, motivation, self-confidence and training (see, for example, Goslin, 1963, pp. 137-147, Duncan and Brooks-Gunn, 1997, pp. 132-189, and Brooks-Gunn et al., 1996). Key research on the inheritance of intelligence, once widely cited, has been probed and found to have been scientific fraud (Gould, 1981, pp. 234-239). After accounting for measurable influences of environment, studies of multiple factors do leave unexplained residues that might be called aptitudes, but they can only be inferred from comparisons across groups. There are no reliable techniques for measuring aptitudes in an individual which are independent of experience, nor has it been shown how many such aptitudes there might be. Despite exposures of motive and mythology, use of standard testing continues to grow. A century after their origins, school-based standard testing and its scavenger, test preparation, have become industries sustained by powerful institutions and deeply felt personal interests. Their supporters are now often driven by secondary motives that result from widespread testing programs. At least two generations have been able to profit from test-taking success, entering professions and making connections during their college years that might otherwise have been closed to them. They know how to crack the tests; they make sure their children learn; and they can be angered to think that this useful wedge into income and influence might be removed. Today's standard test enthusiasts range from right-wing extremists to hard-nosed business people to ambitious young professionals to church schools and home schoolers who are looking for validation of their work¾ in other words, some of our neighbors. Parents who want to keep young children out of the testing game are now beset with legal mandates in many states and with social pressure almost everywhere. Far too few people are asking whether the public schools are really broken and in need of this kind of a fix (see Berliner, 1993, and Berliner and Biddle, 1995). Among the right-wing, there is a Libertarian perspective from which conventional standard tests are an intrinsic evil because they interfere with local control of schools. Also, it is worth noting that a number of the business enthusiasts for standard testing actually send their own children to private schools where such testing is not emphasized. Berliner and Biddle (1995) have extended such observations into an argument that some testing promoters have a different agenda: using the embarrassment of low test scores in public schools as a weapon to force governments toward corporate schools, which they will operate at a profit. Much as in the 1920s, its first great decade, school-based standard testing is still sold as a key to discovering talent and measuring ability objectively. When possible its critics are ignored, or they are dismissed as extremists, dreamers or losers. Test development and scoring procedures are wrapped in mystification. "Validation" of tests is widely touted, but it usually means only that people who do well on one test do well on another. Public enlightenment has made progress, but it struggles upstream against a flow of laundry soap, liver pills and snake oil. What have all the years of more than 100 million school-based standard tests a year (Note 26) brought us? The "one minute" people, perhaps, who judge anything that takes longer as not worth the bother. Try to make life into a rush of standard questions. The idiot-genius computer programmers, fast as lightning. The ones who saddled us with about $200 billion worth of "year 2000" problems, because they didn't think about a slightly bigger picture. The test prep industry, a scrounger that otherwise has no purpose. The product support staff who don't know what to do when they run to the end of their cheat sheets. The cutback from education to test cramming in the states with standard punishment systems. Don't take chances; teach and learn the test. Remedies School-based standard testing has seen more than a century of development in the U. S. (see Appendix 7). No quick or simple remedy can cure the many problems it has caused. Any remedy will require resolute public action. The following priorities are essential:
These are the key weapons of the state punishment systems. The significance and accuracy of standard test scores do not justify these measures. They are viruses that transform schools from education to test cramming. They are all harm and no benefit. If we do not stop the damage being wrecked by these mistaken "school reforms," no other remedies will matter much. If the catastrophes from "school reform" can be curtailed, we can tackle the worst problems of current school-based standard testing:
The root of these conflicts is the same: choosing speed and price over effectiveness. If we want accurate and meaningful results, we must reverse these priorities. Good tests will not be quick or cheap. A test to measure basic competence in a skill or subject must cover a broad range of what we believe basic competence should mean. A test to measure high levels of skills and knowledge must include open-ended tasks that can be performed with many different strategies. We will need to weigh costs and benefits carefully. Even when they do not corrupt education, meaningful tests will take time and resources that could have been spent otherwise. The "authentic assessment" and "performance assessment" movements seek to combine educational assessments with the learning process. Classic models are the "course project" and the "term paper." While the intents of these movements are understandable, Kentucky and California experiences in the 1990s suggested that such techniques were not mature enough to provide reliable comparisons among schools or school districts, much less to create promotion or graduation tests (Sanders and Horn, 1995). Moreover, we have no school-based achievement tests at all that have been proven to predict meaningful accomplishments by students in the world beyond the schoolhouse door. Schools probably test too much, yet at the same time they may fail to use tests when tests can help. A key example is poor and late diagnosis of reading disorders. A great fraction of adult activities require proficient reading; most school activities and standard tests do also. We know that some young students have much more difficulty reading than others, although they may otherwise have strong skills. Schools need to identify reading disorders as early as possible and help to remedy them before they become deeply ingrained. Limited and conflict-ridden as it is, current standard testing shows systematic deficits for students from low-income and minority households. Better testing will give a better picture of how serious these problems are, but it will not cure them. We need plans and resources to address the problems which are already clearly understood:
We do not understand all the problems. We do not know how to solve all the problems that we do understand. But we know enough to begin. If not now, then when? Validity and Relevance School-based aptitude testing is known to have low predictive strength. Studies have shown that it heavily reflects the income and education levels of students' households and that most of what it can predict is associated with social advantages and disadvantages. If tax-supported or tax-exempt schools use scores on intelligence or other aptitude tests to deny opportunities to some students while providing them to others, they violate the public trust. For school-based achievement testing, we have few studies of predictive strength (as one example, see Allen, 1996, section IV-B, pp. 118-120). In most circumstances, we simply do not know whether these tests measure anything apart from social privilege that is useful outside a school setting. After adjustment for social factors, can their scores accurately predict future success in occupations, creative achievements, earning levels, family stability, civic responsibility or any of the other outcomes we mean to encourage with public education? Are there alternative assessments that can accomplish these goals? Given the heavy engagement in "school reforms" and the energy spent on their testing programs, it is amazing to see how little attention these matters receive (see related observations by Broadfoot, 1996, pp. 14-15). Academic and foundation-supported scholars specializing in psychometrics have the greatest opportunities to answer these questions, but they have largely ignored them. Journalists, broadcasters, bureaucrats, politicians, educators and their critics¾ like most of the public¾ usually assume that a mathematics test, for example, actually measures some genuinely useful knowledge and skill. Who has shown this to be true, and for which tests? Is there actually a strong and consistent relation, for example, between top scores on a particular high school math achievement test and a successful career as a civil engineer? If there were not, then what does that test measure? Is there a strong and consistent relation between acceptable scores on a social studies test and adult voting participation? If there were not, then how is such a test of use? Unfortunately, it is far from proven that any method of assessment can escape the biases, the other errors, and the low or unknown predictive strengths outside the schools which plague the current tests. We should take this not as a signal of defeat but as an invitation to humility. The complexities of human behavior are immense, and our current approaches measure them poorly. Rather than try to stretch each student onto a Procrustean bed of so-called "achievement," taking pride in lengthening the beam a bit every few years, we need to promote core competence and recognize the diversity of other skills. If standard tests were to have any useful role, it would most likely be as an aid to help insure that students can exercise skills which have been clearly proven essential for ordinary occupations. Even such a limited objective as this requires both education and test validation well beyond current educational and psychometric practices. As we question the validity of testing, we may also question the relevance of the education supposedly being tested. Are we using the irreplaceable years of youth to convey significant skills and knowledge, or are we cultivating fetishes and harping on hide-bound answers to yesterday's questions? Somehow, despite decades of claims that our schools are inferior, we in the U. S. have achieved a stronger economy than most other industrial countries. Yet we also have more crime than most of these countries. Is our education responsible for these situations? We have many such issues to address. They present truly difficult questions. None of them will be found on school-based standard tests. |
NotesComments and suggestions from several reviewers are gratefully acknowledged. Mistakes or omissions remain, of course, the fault of the author.
ReferencesAssociated Press (1999, June 3). Blacks nearly four times more likely to be exempt from TAAS than whites. Capitol Times, Austin, TX. Berliner, D. C. (1993). Educational reform in an era of disinformation. Educational Policy Analysis Archives 1(2), available at http://epaa.asu.edu/epaa/v1n2.html. Berliner, D. C., & Biddle, B. J. (1995). The Manufactured Crisis: Myths, Fraud, and the Attack on America's Public Schools. Reading, MA: Addison- Wesley. Brigham, C. C. (1923). A Study of American Intelligence. Princeton, NJ: Princeton University Press. Broadfoot, P. M. (1996). Education, Assessment and Society: A Sociological Analysis. Philadelphia, PA: Open University Press. Brooks-Gunn, J., et al. (1996). Ethnic differences in children's intelligence test scores. Child Development 67(2), 396-408. California Department of Education (2000). Academic Performance Index School Rankings, 1999. Sacramento, CA: Department of Education, Delaine Eastin, State Superintendent. Ceci, S. J. (1991). How much does schooling influence general intelligence and its cognitive components? A reassessment of the evidence. Developmental Psychology 27(5), 703-722. Census Bureau (1992). Census of 1990. Washington, DC: U. S. Department of Commerce. Cole, P. G. (1995). The bell curve: Should intelligence be used as the pivotal explanatory concept of student achievement? Issues In Educational Research 5(1), 11-22. Committee to Develop Standards for Educational and Psychological Testing, Melvin R. Novick, Chair (1985). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. Conant, J. B. (1940, May). Education for a classless society. Atlantic Monthly 165(5), 593-602. Cremin, L. A. (1962). The Transformation of the School: Progressivism in American Education, 1876-1957. New York: Alfred A. Knopf. Crouse, J., & Trusheim, D. (1988). The Case Against the SAT. Chicago: University of Chicago Press. Culbertson, J. (1995). Race, intelligence and ideology. Educational Policy Analysis Archives 3(2), available at http://epaa.asu.edu/epaa/v3n2.html. Daley, B., & Zernike, K. (2000, January 26). State may change MCAS contractor. Boston Globe. Dewey, J. (1922, December 13). Individuality, equality and superiority. The New Republic 33(419), pp. 61-63. Duncan, G. J., & Brooks-Gunn, J., Eds. (1997). Consequences of Growing Up Poor. New York: Russell Sage Foundation. Feuer M. L., et al., Eds. (1992). Testing in American Schools: Asking the Right Questions (Publication OTA-SET-519). Washington, DC: U. S. Congress, Office of Technology Assessment. Goddard, H. H. (1914). Feeble-mindedness; Its Causes and Consequences. New York: Macmillan. Goslin, D. A. (1963). The Search for Ability. New York: Russell Sage Foundation. Gould, S. J. (1981). The Mismeasure of Man. New York: W. W. Norton and Co. Haney, W. M. (1999). Supplementary Report on the Texas Assessment of Academic Skills Exit Test (TAAS-X). Boston: Center for the Study of Testing, Evaluation and Educational Policy, Boston College School of Education. Hayman, R. L., Jr. (1997). The Smart Culture: Society, Intelligence, and Law. New York: New York University Press. Heubert, J. P., & Hauser, R. M., Eds. (1999). High Stakes Testing for Tracking, Promotion and Graduation. Washington, DC: National Academy Press. Holloway M. (1995, January). Flynn's effect. Scientific American 280(1), 37-38. Hunt, E. (1995). The role of intelligence in modern society. American Scientist 83(4), 356-369. IDRA Newsletter (1998, January). Intercultural Development Research Association, San Antonio, TX. Kessel, C., & Linn, M. C. (1996). Grades or scores: Predicting future college mathematics performance. Educational Measurement: Issues and Practice 15(4), 10-14. Lehigh, S. (1998, June 28). For teachers, criticisms from many quarters. Boston Globe. Lemann, N. (1995, September). The great sorting. Atlantic Monthly 276(3), 84-100. Longitudinal Attrition Rates in Texas Public High Schools, 1985-1986 to 1997- 1998 (1999). Intercultural Development Research Association, San Antonio, TX. Massachusetts Department of Education (1999a). Massachusetts Comprehensive Assessment System 1998 Technical Report. Malden, MA: Department of Education, David P. Driscoll, Commissioner. Massachusetts Department of Education (1999b). Massachusetts Comprehensive Assessment System, Report of 1999 State Results. Malden, MA: Department of Education, David P. Driscoll, Commissioner. McDonnell, L. M. (1997). The Politics of State Testing: Implementing New Student Assessments (Publication CSE-424). Los Angeles: National Center for Research on Evaluation, Standards and Student Testing, University of California. McDonnell, L. M., & Weatherford, M. S. (1999). State Standards-Setting and Public Deliberation: The Case of California (Publication CSE-506). Los Angeles: National Center for Research on Evaluation, Standards and Student Testing, University of California. Merton, R. K. (1957). Social Theory and Social Structures. Glencoe, IL: Free Press. Nairn, A., & Associates (1980). The Reign of the ETS: The Corporation that Makes Up Minds. Washington, DC: Center for the Study of Responsive Law. National Center for Education Statistics (1996). Digest of Education Statistics, 1995. Washington, DC: U. S. Department of Education. National Center for Education Statistics (1997). NAEP 1996 Trends in Academic Progress (Publication NCES 97-985). Washington, DC: U. S. Department of Education. National Center for Education Statistics (1999). Digest of Education Statistics, 1998. Washington, DC: U. S. Department of Education. Neisser, U., Ed. (1998). The Rising Curve: Long-Term Gains in IQ and Related Measures. Washington, DC: American Psychological Association. New York State Education Department (1998). New York State School Report Card for the School Year 1996-1997. Albany, NY: Education Department, Richard P. Mills, Commissioner. New York State Education Department (1999). New York State School Report Card for the School Year 1997-1998. Albany, NY: Education Department, Richard P. Mills, Commissioner. Owen, D., & Doerr, M. (1999). None of the Above (Revised ed.). Lanham, MD: Rowman and Littlefield Publishers. Problems with KIRIS test erode public's support for reforms (1998, February 2). Lexington Herald-Leader, Lexington, KY. Ravitch, D. (1996, August 28). Defining literacy downward. New York Times. Regional Profile, Juarez and Chihuahua (1999). Texas Centers for Border Educational Development, El Paso, TX. Reischauer, E. O., & Fairbank, J. K. (1958). East Asia: The Great Tradition. Boston: Houghton Mifflin. Rickover, H. G. (1959). Education and Freedom. New York: E. P. Dutton and Co. Roderick, M., et al. (1999). Rejoinder to Ending Social Promotion: Results from the First Two Years. Chicago: Consortium on Chicago School Research, Designs for Change. Rogers, T. B. (1995). The Psychological Testing Enterprise. Pacific Grove, CA: Brooks/Cole Publishing Co. Sacks, P. (1999). Standardized Minds. Cambridge, MA: Perseus Books. Sanders, W. L., & Horn, S. P. (1995). Educational assessment reassessed: The usefulness of standardized and alternative measures of student achievement as indicators for the assessment of educational outcomes. Education Policy Analysis Archives 3(6), available at http://epaa.asu.edu/epaa/v3n6.html. Schultz, S. K. (1973). The Culture Factory: Boston Public Schools, 1789-1860. New York: Oxford University Press. Szechenyi, C. (1998, March 8). Failing grade? Firm with state's assessment contract has troubled past. Middlesex News, Framingham, MA. TAAS scandal widens (1999, April 9). Lone Star Report, Austin, TX. Terman, L. M. (1916). The Measurement of Intelligence. Boston: Houghton Mifflin. Texas Education Agency (1998). 1998 Comprehensive Biennial Report on Texas Public Schools. Austin, TX: Education Agency, Jim Nelson, Commissioner. Tyack, D. B. (1974). The One Best System. Cambridge, MA: Harvard University Press. Tyack, D. B. & Cuban, L. (1995). Tinkering toward Utopia: A Century of Public School Reform. Cambridge, MA: Harvard University Press. |
About the AuthorCraig BolonPlanwright Systems Corporation, Inc. Email: cbolon@planwright.com
Craig Bolon is President of Planwright Systems Corp., a
software development firm located in Brookline,
Massachusetts, USA. After several years in high energy
physics research and then in biomedical instrument
development at M.I.T., he has been an industrial software
developer for the past twenty years. He is author of the
textbook Mastering C (Sybex, 1986) and of several technical
publications. He is an elected Town Meeting Member and has
served as member and Chair of the Finance Committee in
Brookline, Massachusetts.
Appendix 1
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year | Enrollment 1,000,000s | Spending $B (1998) | Spending per student |
| 1850 | 3.4 | ||
| 1860 | 4.8 | ||
| 1870 | 6.9 | ||
| 1880 | 9.9 | ||
| 1890 | 12.7 | ||
| 1900 | 15.5 | 4.2 | 270 |
| 1910 | 17.8 | 7.4 | 420 |
| 1920 | 21.6 | 8.4 | 390 |
| 1930 | 25.7 | 22.6 | 880 |
| 1940 | 25.4 | 27.3 | 1070 |
| 1950 | 25.1 | 39.5 | 1570 |
| 1960 | 35.2 | 86.0 | 2440 |
| 1970 | 45.9 | 170.8 | 3720 |
| 1980 | 40.9 | 189.9 | 4650 |
| 1990 | 41.2 | 265.4 | 6440 |
| 2000 | 47.4 | 338.6 | 7140 |
Sources: U. S. Department of Education, Digest of Education Statistics,1998 (spending not available in this series before 1900); U. S. Census Bureau, Census of 1850 and Census of 1860; U. S. Bureau of Labor Statistics, Consumer Price Index: All Urban Consumers (annual averages, estimated before 1913).

Source: "The population 6 to 17 years old enrolled below modal grade: 1971 to 1998," Current Population Survey Report School Enrollment Social and Economic Characteristics of Students, U. S. Bureau of the Census, Washington, DC, Supplementary Table A-3, October, 1999.
| School Grade | Average Score | Percent Advanced | Percent Proficient | Percent Needs Improvement | Percent Failing |
| 10 | 229 | 4 | 32 | 34 | 30 |
| 8 | 237 | 3 | 52 | 31 | 14 |
| 4 | 230 | 0 | 20 | 66 | 14 |
| School Grade | Average Score | Percent Advanced | Percent Proficient | Percent Needs Improvement | Percent Failing |
| 10 | 222 | 8 | 16 | 23 | 53 |
| 8 | 226 | 7 | 23 | 29 | 41 |
| 4 | 234 | 12 | 23 | 44 | 21 |
| School Grade | Average Score | Percent Advanced | Percent Proficient | Percent Needs Improvement | Percent Failing |
| 10 | 225 | 2 | 21 | 40 | 37 |
| 8 | 224 | 4 | 24 | 29 | 43 |
| 4 | 239 | 8 | 44 | 38 | 10 |
| School Grade | Average Score | Percent Advanced | Percent Proficient | Percent Needs Improvement | Percent Failing |
| 10 | |||||
| 8 | 221 | 1 | 10 | 40 | 49 |
| 4 |
Source of data: Massachusetts Department of Education, 1999b.
| | |||
| Examination score | 55 or more | 65 or more | 85 or more |
| Comprehensive English | 63% | 56% | 17% |
| Mathematics I | 66% | 59% | 29% |
| Biology | 51% | 44% | 15% |
| US History | 56% | 48% | 15% |
| Global Studies | 57% | 48% | 14% |
| | |||
| Examination score | 55 or more | 65 or more | 85 or more |
| Comprehensive English | 65% | 57% | 15% |
| Mathematics I | 70% | 62% | 33% |
| Biology | 51% | 44% | 16% |
| US History | 60% | 52% | 17% |
| Global Studies | 65% | 56% | 17% |
Source of data: New York State Education Department, 1998 and 1999.

Source: Longitudinal Attrition Rates in Texas Public High Schools, 1985-1986 to 1998-1999, Intercultural Development Research Association, San Antonio, TX, 1999. By permission. Chart prepared by the author. Not shown are data for Asian/Pacific Islander and Native American students. No data were published for 1991 or 1994.

Source: Academic Performance Index School Rankings, 1999, California Department of Education, Sacramento, CA, January, 2000. Chart prepared by the author. Data were grouped into the API ranges shown. Four schools were unrated.
1900 The College Entrance Examination Board is founded at Columbia College in New York.
1905 [Alfred Binet publishes the first intelligence test, to identify slow learners.]
1908 Edward L. Thorndike, a Columbia professor, begins writing a series of standard achievement tests for use in elementary and high schools, completed in 1916.
1916 First publication of the Stanford-Binet IQ test by Houghton Mifflin, developed by Lewis M. Terman, a Stanford professor.
1916 Arthur S. Otis, a student of Terman and later a test editor for the World Book Company, invents the multiple choice format. It is used in the Army Alpha test.
1917 Robert M. Yerkes, a Harvard professor, organizes the Army Alpha and Beta intelligence tests, given to 1.7 million World War I recruits.
1921 The Psychological Corporation is founded in New York by James M. Cattell, Robert S. Woodworth and Edward L. Thorndike.
1923 First publication of the Stanford Achievement Tests by the World Book Company, developed under the direction of Lewis M. Terman.
1925 Carl C. Brigham, a Princeton professor, develops the Scholastic Aptitude Test for the College Entrance Examination Board.
1927 The California Test Bureau is founded in Los Angeles by Ethel M. Clark and Willis W. Clark, a Los Angeles school teacher.
1928 Everett F. Lindquist, a professor at the University of Iowa, begins the Iowa Testing