This article has been retrieved
times since May 6, 2002
Education Policy Analysis Archives | ||
Volume 10 Number 24 |
May 6, 2002 |
ISSN 1068-2341 |
|
Editor: Gene V Glass College of Education Arizona State University
Copyright 2002, the
EDUCATION POLICY ANALYSIS ARCHIVES . Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Lake
Woebeguaranteed:
|
|
Abstract Misuse of test results in Massachusetts largely guarantees woes for both students and schools. Analysis of annual test score averages for close to 1000 Massachusetts schools for four years (19982001) shows that test score gains in one testing period tend to be followed by losses in the next. School averages are especially volatile in relatively small schools (with less than 150 students tested per grade). One of the reasons why scores fluctuate is that the Massachusetts state test has been developed using norm-referenced test construction procedures so that items which all students tend to answer correctly (or incorrectly) are excluded from operational versions of the test. This article concludes with a summary of other reasons why results from state tests, like that in Massachusetts, ought not be used in isolation to make high-stakes decisions about students or schools. |
|
Lake Wobegon is the mythical town in Minnesota popularized by Garrison Keillor in his National Public Radio program "A Prairie Home Companion." It is the town where "all the children are above average" (and "all the women are strong, and all the men, good-looking"). In the late 1980s, it became apparent that Lake Wobegon had come to schools nationwide. For according to a 1987 report by John Cannell, the vast majority of school districts and all states were scoring above average on nationally normed standardized tests (Cannell, 1987). Since it is logically impossible for all of any population to be above average on a single measure, it was clear that something was amiss, that something about nationally normed standardized tests or their use had been leading to false inferences about the test scores of students in the nation's schools. As a result, people came to refer to inflated test results as the Lake Wobegon phenomenon. I do not try here to recap the story of Cannell's work on the Lake Wobegon phenomenon and how independent researchers came to verify the phenomenon. (The story is recounted in chapter 7 of Haney, Madaus & Lyons, 1993, for anyone interested). Rather, my purpose is to introduce a place considerably east of Lake Wobegon; namely, Lake Woebeguaranteed. In this place, the use of state test results in isolation to make important decisions about schools and students pretty well guarantees woes will follow. For as I will explain, such uses of results from what is essentially a norm-referenced test constitute ill-conceived misuses of test results. Before proceeding to this larger story, I recap how the work reported here evolved. After reading Kane & Staiger (2001), and Bolon (2001), I undertook an analysis of school average scores on the Massachusetts Comprehensive Assessment System (MCAS) grade 4 mathematics tests for 1998, 1999, 2000 and 2001. After summarizing these previous works, I describe the sources of data used in the present analysis, the means by which data were merged from different sources, the analyses undertaken, and the results. The latter confirm the findings by Kane & Staiger (2001) and Bolon (2001); namely, that changes in school average test scores from one year to the next are unreliable indicators of school quality. Next I discuss three reasons this is so, and why misuse of results of the Massachusetts state test virtually guarantees woes for schools and students.2 BackgroundThe works that prompted the analyses reported here were Kane & Staiger (2001), and Bolon (2001). The first of these works focused on state test results in North Carolina. North Carolina has an extensive system of testing students, not just with state "competency" tests in grades 311, but also with norm-referenced tests in grades 5 and 8. (CCSO, 1998, pp. 19, 21, 22, 24). Students must pass state competency tests in reading and math to graduate from high school. Schools in North Carolina are publicly rated in terms of student test results. However, the paper by Kane and Staiger (2001) from the National Bureau of Economic Research (http://www.nber.org/papers/w8156) shows how misleading these ratings tend to be. Kane and Staiger analyzed six years worth of student assessment data from the entire state of North Carolina (for nearly 300,000 students in grades 3 through 5). They showed that, regardless of whether results were analyzed in terms of annual results or year to year changes, the test results are mainly random noiseresulting from the particular samples of students who are in tested grades in particular years, and the vagaries of annual test content and administrationnot meaningful indication of school quality. Kane and Staiger concluded with the following four "lessons":
Incentive systems establishing separate thresholds for each racial/ethnic subgroup present a disadvantage to racially integrated schools. In fact, they can generate perverse incentives for districts to segregate their students. As a tool for identifying best practice or fastest improvement, annual test scores are generally quite unreliable. There are more efficient ways to pool information across schools and across years to identify those schools that are worth emulating. When evaluating the impact of policies on changes in test scores over time, one must take into account the fluctuations in test scores that are likely to occur naturally. (Kane & Staiger, April 2001). |
|
The second work prompting the analyses reported below is Bolon's "Significance of test-based ratings for metropolitan Boston schools" (2001, in Education Policy Analysis Archives, http://epaa.asu.edu/epaa/ v9n42/. Also, see Michelson, 2002; Willson & Kellow, 2002 and Bolon, 2002 for discussion of the orginal Bolon article). In this study Bolon examined 1998, 1999, and 2000 MCAS mathematics scores for 47 academic high schools in 32 metropolitan communities in the greater Boston area (vocational high schools were excluded on the grounds that they have a substantially different mission than academic high schools). Bolon found that school average grade 10 MCAS math scores generally changed little over this interval (+1.3 points from 1998 to 1999; and +5.9 points from 1999 to 2000) relative to the range in school average scores (in 1999, for example, school averages ranged from 203 to 254, on the MCAS scale of 200 to 280.) Bolon does note, however, that according to data released by the Massachusetts Department of Education, between 1998 and 2000 grade 10 MCAS math scores rose substantially more than English or science scores (see Bolon's Table 1-1). Bolon then examined the extent to which seven school characteristics, plus community income (1989), might be used to predict school average grade 10 MCAS math scores. He found that three variables (percent Asian or Pacific Islander, percent limited English proficiency, and per-capita community income) were the only ones statistically significantly related to school average scores (Table 2-12). Together these three variables accounted for 80% of the variance in school average scores. After excluding schools in Boston (for which separate community income data were not available), Bolon found that "by far the strongest factor in predicting tenth grade MCAS mathematics scores is 'per capita community income (1989).' For the schools outside the City of Boston, this factor alone performed nearly as well as all available factors combined, associating 84 percent of the variance compared with 88 percent when all available factors were used." The study reported here builds on both of the works just discussed. For example, an analytical approach applied by Kane and Staiger to data from North Carolina, namely comparing school size with changes in annual score averages, is employed here. Additionally, while Bolon examined average MCAS scores for Massachusetts high schools, this inquiry addresses MCAS averages for elementary schools. There are three broad reasons why elementary school average test scores might be more useful indicators of school quality than test averages for high schools. First is the simple fact that there are more elementary schools than high schools. In his study, Bolon analyzed test scores for less than 50 high schools. In contrast, MCAS scores are available for around 1000 elementary schools in Massachusetts. A larger sample offers greater potential to discern meaningful differences in school quality. The second reason for hypothesizing that grade 4 test scores may be better indicators of school quality than grade 10 test scores is the extent of institutional experience that they may reflect. Children typically enter school in Massachusetts in kindergarten. This means that by spring of grade 4, they have almost five years of education in a particular elementary school (presuming, of course, they did not switch schools). In contrast, grade 10 test scores typically reflect just two years' experience in high school. So on this count, grade 4 test score averages clearly have more potential to reflect differences in school quality than grade 10 score averages. The third reason for thinking that grade 4 test scores may be better indicators of school quality than grade 10 test scores is that by grade 10 (roughly age 16) individuals' standardized test scores have become relatively fixed, whereas test scores of young children are relatively malleable. This may be illustrated by reference to Benjamin Bloom's classic (1964) work, Stability and Change in Human Characteristics. In this book, Bloom reviewed a wide range of evidence on how a number of human characteristics, including height, weight and test scores, tend to change as people age. He showed, for example, that height in the early childhood years tends to be a moderately good predictor of height at maturity, with correlations between height at ages 6 - 10 years and height at age 18 falling in the range of 0.75 to 0.85 for both males and females. Interestingly, height at ages 11-13 for females and 13-15 for males is a less good predictor of height at maturity. This is, of course, due to variation in the ages at which children experience growth spurts as they go through puberty. In contrast to the physical characteristic of height, mental abilities of young children as measured by standardized tests show relatively little power to predict mental abilities at maturity. Not until around grade 3 or 4 (or age 8 9) do children's test scores become relatively reliable predictors of future performance. To provide one example, reading test scores at age 6 (or grade 1) correlate with reading test scores in grade 8 only about 0.65 (Bloom, 1964, p. 98). As Bloom himself put it, "We may conclude from our results on general achievement, reading comprehension and vocabulary development that by age 9 (grade 3) . . . 50% of the general achievement pattern at age 18 (grade 12) has been developed" (Bloom, 1964, p. 105). The relative malleability of young children's test scores suggests that there may be more potential for grade 4 test scores to be affected by school quality, as compared with high schools' effects on grade 10 test scores. In sum, while Bolon found that school average scores on the Massachusetts' grade 10 state test (MCAS) were not sound indicators of schools quality, there are several reasons for hypothesizing that school average scores for grade 4 might be better indicators of school quality. To test this possibility, the data and analyses described below were employed.
The data used in this study were drawn from four sources. MCAS results for 1998, 1999 and 2000 were drawn from CD data disks issued by the Massachusetts Department of Education entitled "School, District and State MCAS Results, Grades 4, 8 and10, Tests of May 1998," "School, District and State MCAS Results, Grades 4, 8 and10, Tests of May 1999," and "School, District and State MCAS Results, Grades 4, 8 and10, Tests of Spring 2000." The MCAS results for 2001 were drawn from an Excel file named "MCAS2001pub_g4sch01.xls" downloaded from http://boston.com/mcas/ on November 9, 2001. The files from these four sources contain MCAS results for all schools and districts in Massachusetts for 1998, 1999, 2000 and 2001. From these results, grade 4 MCAS math averages were extracted for all schools in Massachusetts. Math rather than English Language Arts (ELA) test scores were selected for study for two reasons. First, it is reasonably well-established that schools have more influence on math test scores than on English (or at least reading) test scores (Haney, Madaus & Lyons, 1993). Second, it is apparent that there have been a number of problems in past years in the scaling of MCAS grade 4 ELA scores. The numbers of records for which MCAS grade 4 average results are available from each of the sources mentioned above are as follows:
The reason for more records in 1999 and 2000 than in 1998 is the creation of a number of new elementary schools (mostly charter schools). The file for 2001 is smaller than those for previous years because it included only school average, but not district average scores. Merging records from these four data files proved more difficult than anticipated. Labels for some variables were changed across the years and names for some schools are reported inconsistently in these four sets of data. Nonetheless after examining pairs of records for 1998, 1999, 2000 and 2001, I was able to create a merged data file of MCAS grade 4 math results (and numbers of students tested) for 1998-2001. A copy of this data file is appended to this article for anyone interested in secondary analysis (see Appendix). The merged data file of grade 4 MCAS math school averages, after deletion of district averages, contained records for 977 schools. Table 1 shows summary descriptive statistics for this data set. As can be seen, the numbers of fourth graders tested per school in these three years ranged from just 10 to 328. The school average MCAS scores ranged from a low of 206 in 1998 to a high of 263 in 2001. Over the four years of MCAS testing, on average, there were initially slight increases in average MCAS scoresa 1.5 point increase, on average, between 1998 and 1999, and an increase of 0.5 of a point between 1999 and 2000, but then level scores between 2000 and 2001. Changes in score averages for individual schools from year to year ranged from a low of22 to +18 points. As Bolon found with regard to grade 10 MCAS scores, these school average changes in grade 4 MCAS scores are considerably smaller than the range in school average scores, which varied by 50 points or more in all four years of test administration. |
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Figure 1 shows a scatter plot of how 1999 school averages compared with those from 1998. As can be seen, there is a fairly strong relationship between 1998 and 1999 averages. Schools with higher MCAS grade 4 test score averages in 1998 tended to have higher averages in 1999. The correlation between score averages in 1998 and 1999 was 0.860. The regression relationship between score averages in 1999 and 1998 is: Gd4MCASAvg99 = 44.03 + 0.81(Gd4MCASAvg98)For statistically inclined readers, it may be noted that these correlation and regression relationships are statistically significantthat is, extremely unlikely that they might occur by chance.
Figure 1. Scatter Plot of School Average Grade 4 MCAS Math Scores 1998 vs. 1999 |
|
Figure 2 shows the relationship between grade 4 math MCAS score averages in 1999 and 2000. As can be seen, the relationship between year 2000 MCAS grade 4 math averages and those for 1999 is similar to the 1998-1999 relationship, but even slightly stronger. The correlation between score averages in 2000 and 1999 is 0.875. The regression relationship of average scores in 2000 and 1999 is: Gd4MCASAvg00 = 29.54 + 0.88(Gd4MCASAvg99)
Figure 2. Scatter Plot of School Average Grade 4 MCAS Math Scores 1999 vs. 2000 Gd4MCASAvg01 = 50.8 + 0.78(Gd4MCASAvg00)
Figure 3. Scatter Plot of School Average Grade 4 MCAS Math Scores 2000 vs. 2001 Next let us consider, á la Kane and Staiger the relationship between school size and change in score averages from one year to the next. For these analyses school size has been calculated simply as the average number of students tested in the two years across which change is calculated. Figure 4 shows the relationship between change in average MCAS grade 4 scores between 1998 and 1999 and school size (defined as the average of the numbers of students tested in the two years). As can be seen, schools with less than
Figure 4. Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. School Size 100 or so students tested show changes in MCAS average scores of as much as 15-20 points. However schools with more than 150 students tested per year show much smaller changesgenerally less than 5 points. Figure 5 shows analogous results for 1999 to 2000 score changes. As can be seen, the pattern shown here is similar to that shown in Figure 4. Schools with smaller numbers of students tested tended to have much more "volatility" (to use Kane and Staiger's phrase) in average scores than schools with larger numbers of students tested.
Figure 5. Change in MCAS Grade 4 Math Average Score 1999 to 2000 v. School Size Figure 6 shows the relationship between school size and change in average grade 4 MCAS scores between 2000 and 2001. The pattern is very similar to that apparent in the previous two figures. Schools with less that 100 students tested showed much larger swings in test score averages than schools with larger numbers of students tested.
Figure 6. Change in MCAS Grade 4 Math Average Score 2000 to 2001 v. School Size Given the political prominence of high stakes testing in Massachusetts (as elsewhere), it is not surprising that various observers have tried to use changes in school MCAS scores from one year to the next to identify high quality or "exemplary" schools. For example, in a high profile ceremony at the Massachusetts State House in December 1999, five school principals were presented with gifts of $10,000 each "for helping their students make significant gains on the MCAS" ( http://www.doe.mass.edu/news/archive99/Dec99/122299pr.html) (accessed November 15, 2001). Though the cash awards were donated by a private foundation, the ceremony recognizing the five schools was attended by the Massachusetts Governor, Lieutenant Governor and Commissioner of Education. The press release for the event stated: "The schools were recognized as having the highest percentage improvement in overall MCAS scores between 1998 and 1999 in English Language Arts, Mathematics and Science and Technology" ( http://www.doe.mass.edu/news/archive99/Dec99/122299pr.html). Anyone with even a modest knowledge of statistics will note the absurdity of this statement. Since the MCAS scale of 200 to 280 is arbitrary and has no meaningful zero point, it is meaningless to calculate percentage increases in scores. This indicates that whoever in the Massachusetts Department of Education wrote this press release is fundamentally ignorant of statisticsor to be less politically incorrect, in need of improvement in knowledge of statistics. For anyone who has not studied statistics lately and hence may not appreciate the absurdity of calculating percentage increases on arbitrary test score scales, I suggest the following exercise. Calculate the percentage increase in temperature going from 50 degrees Fahrenheit to 68 degrees Fahrenheit. Next, figure out the equivalent temperatures on the Celsius scale and calculate the percentage increase on the Celsius scale. Finally, ask yourself which "percentage increase" is correct. Four of the five schools receiving the so-called Edgerly awards in 1999 were elementary schools, namely, Riverside Elementary School in Danvers, Franklin D. Roosevelt Elementary School in Boston, Abraham Lincoln Elementary School in Revere, Kensington Elementary School in Springfield. Figure 7 is a recasting of Figure 3, but with these four 1999 Edgerly award schools shown with circles.
Figure 7. Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. School Size, with Award Schools Marked As can be seen, the four award schools share two characteristics. First, they are all relatively small schools, each with less than 100 students tested. Second, they showed unusually large score changes from 1998 to 1999. This is not surprising since large MCAS score gains from 1998 to 1999 served as basis for their receiving awards. Figure 8 recasts Figure 4, again with the 1999 "award" schools marked with circles.
Figure 8. Change in MCAS Grade 4 Math Average Scores 1999 to 2000 v. School Size, with 1999 Award Schools Marked As can be seen in Figure 8, three out of the four 1999 award schools showed declines in average grade 4 MCAS math scores from 1999 to 2000. Figure 9 is a variant of Figure 5, showing the relationship between average numbers of students tested in 1999 and 2000 versus the change in average grade 4 math scores between 1999 and 2000. In Figure 9, all of the schools showing a 10 or more point gain in average MCAS scores are marked with circles.
Figure 9. Change in MCAS Grade 4 Math Average Score 1999 to 2000 v. School Size, with Schools showing Gain of 10 Points or More Highlighted with Circles What happened to these schools the next year? Figure 10 shows change from 2000 to 2001, but with the schools having largest gains from 1999 to 2000 again marked with circles. As can be seen, there were a few schools showing largest gains from 1999 to 2000 that continued to show gains in 2001. But most of the large gain schools from 1999 to 2000, showed declines in 2001. Several of them showed declines from 2000 to 2001 that were just about as large (9-10 points) as were the gains from 1999 to 2000
Figure 10. Change in MCAS Grade 4 Math Average Score 2000 to 2001 v. School Size with Schools showing Gain of 10 Points or More '98 to'99 Highlighted with Circles Note that almost all of these schools showing large gains in average scores one year, but then large declines the next year, are ones with relatively small numbers of students tested. The relationship between changes in average scores across pairs of years can be seen more clearly in Figures 11 and 12. Figure 11 shows how change in school average grade 4 MCAS scores between 1998 and 1999 compares with the change between 1999 and 2000. Figure 12 shows how the change between 1999 to 2000 compares with the change between 200 and 2001. As can be seen, there is a negative relationship between change in one interval and changes the next. Schools that show large gains in one interval tend to show losses in the next interval. The correlation between change from 1998 to 1999 and change 1999 to 2000 is -0.388. The correlation for the next pair of years, that is change 1999 to 2000 versus change 2000 to 2001 is -0.396. These negative correlations are both statistically significant.
Figure 11. Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. Change 1999 to 2000, with Schools showing Gain of 10 Points or More Highlighted with Circles
Figure 12. Change in MCAS Grade 4 Math Average Score 1999 to 2000 v. Change 2000 to 2001, with Schools showing Gain of 10 Points or More '98 to '99 Highlighted with Circles |
|
These results are simply a manifestation of the kind of volatility that Kane and Staiger (2001) found in school average test scores in other states. As they found for North Carolina, we have seen above with MCAS scores. School average test scores are particularly volatile for relatively small schools. Moreover schools that show relatively large gains in score averages from one year to the next tend to show losses the following year. sThus, it is clear that school average test scores, or changes in averages from one year to the next, represent poor measures of school quality. Why are MCAS score averages poor indicators of school quality? School average test results fluctuate from year to year for several reasons. The most obvious is that one year's class of students will differ from the next. Especially in relatively small schools, with less than 100 students tested per grade, having a few especially test savvy, or not so savvy, students may skew results from one year to the next. A second likely cause of volatility in school average scores on the Massachusetts test is that the MCAS is of dubious technical merit. When I first examined the 1998 grade 4 English Language Arts (ELA) test, for example, I was surprised to find many poorly worded questions and reading questions for which one did not actually have to read the passage on which they were ostensibly based in order to answer the question (that is, the questions lacked passage dependency). More recently the Massachusetts DOE implicitly acknowledged defects in the 2001 grade 10 ELA and math exams when it dropped one item from each from scoring (http://www.doe.mass.edu/MCAS/01results/threshscore.html). The defective items on the 2001 test were discovered not by the test's developer or state officials but by students (Lindsay, 2001). More recently, Gallagher (2001) undertook a review of grade 10 MCAS math questions from the 2000 and 2001 test administrations. Gallagher, a professor of Environmental, Coastal and Ocean Sciences, at the University of Massachusetts, Boston, concluded that there were serious problems with 10 to 15% of the grade 10 MCAS math questions. He identified some questions as having wrong answers, some as having more than one correct answer and some as misaligned with the Massachusetts curriculum frameworks. "Overall, my review of these tests indicates that there are serious failures in the choice and review of MCAS questions" (Gallagher, 2001, p. 5, http://www.es.umb.edu/edg/MCAS/mcasproblems.pdf, accessed December 3, 2001). More generally, the MCAS is not a good indicator of school quality because it has been constructed as a norm-referenced test. Many people assume that the MCAS (and other state-sponsored tests) are criterion referenced teststhat is, tests of well-specified bodies of knowledge and skills. In a paper prepared for an October, 2001 conference at the John F. Kennedy Institute at Harvard University , for example, Kurtz wrote:
However examination of the technical manuals for the MCAS tests, reveals that items have been selected for inclusion on MCAS tests by using norm-referenced test construction procedures. Specifically, items are selected for inclusion on the MCAS in terms of item difficulty and discrimination. To explain the implications of thisthat is, why because of use of norm-referenced test construction procedures, the MCAS is not a "criterion-referenced exam," testing knowledge of a set curriculum and giv[ing] students scores based on their level of mastery," let me discuss at some length the original 1998 Technical Report on the MCAS (Massachusetts Department of Education, October 1999; hereafter, MDOE, 1999). The 1998 Technical Report on the MCAS is a fairly long report, 15 chapters and several appendices. I suspect there are not exactly legions of people who have waded through the whole document. So for those who may not have, let me try to summarize. Before doing so, I note that the 1998 Technical Report on the MCAS is one of an odd genre of bureaucratic documents that always arouses suspicion, for it bears not a single person's name as author or responsible authority. The 1998 Technical Report begins with a "Background and Overview" chapter. Chapter 2 then provides an overview of test design. The first page of this chapter recounts that "The [Massachusetts] Department of Education convened committees of educators from around the state to work with the Department and its testing contractor to design and develop assessments of the learning standards contained in the [Massachusetts] curriculum frameworks" (MDOE, 1999, p. 9). In this chapter it is explained that the MCAS tests were designed to have three different types of items: multiple choice items, short answer items requiring responses from a few words to a few sentences (scored 0 or 1 as incorrect or correct) and open or extended response items requiring responses of up to a half page long (and scored on a 0-4 point scale). Some 20 pages later (in chapter 6) it is explained that after pilot items were developed, reviewed and tried out, they were screened in terms of a number of statistical characteristics, including item difficulty and discrimination. At this point the 1998 Technical Report does not make clear exactly what statistical criteria were used in selecting items. But another 50 or so pages later, in chapter 13, "Item analyses," it becomes considerably clearer how items were selected in terms of difficulty and discrimination. Before offering my own interpretation, let me quote at some length from this chapter. Difficulty IndicesAll multiple-choice, short-answer, and open-response questions were evaluated in terms of difficulty and relationship to overall score according to standard classical test theory practice. Difficulty was measured by averaging the proportion of points received across all students who received the question. Multiple-choice and short-answer questions were scored dichotomously (correct v. incorrect), so for these questions, the difficulty index is simply the proportion of students who correctly answered the question. Open-response questions allowed for scores between 0 and 4. By computing the difficulty index as the average proportion of points received, the indices for multiple-choice, short-answer, and open-response questions are placed on a similar scale; the index ranges from 0 to 1 regardless of the question type. Although this index is traditionally described as a measure of difficulty (as it is described here), it is properly interpreted as an "easiness index" because larger values indicate easier questions. An index of 0 indicates that no student received credit for the question, and an index of 1 indicates that every student received full credit for the question. Item-test CorrelationsWithin classical test theory, these relationships are assessed using correlation coefficients that are typically described as either item-test correlations or, more commonly, discrimination indices. The discrimination index used to analyze MCAS multiple-choice items and short-answer items, which are scored 0 or 1, was the point-biserial correlation between item score and a criterion total score on the test. For open-response items, item discrimination indices were based on the Pearson product-moment correlation. The theoretical range of these statistics is from 1 to 1, with a typical range from .3 to .6. Discrimination indices can be thought of as measures of how closely a question assesses the same knowledge and skills assessed by other questions contributing to the criterion total score. That is, the discrimination index can be interpreted as a measure of construct consistency. In light of this interpretation, the selection of an appropriate criterion total score is crucial to the interpretation of the discrimination index. For MCAS, appropriate criterion scores were selected based on item type and function (common or matrix). The selected criterion scores are provided in Table 13-1. For example, the criterion score for common open-response and short-answer items was the total score on all common multiple-choice, open-response, and short-answer items. (MDOE, 1999, p. 78). The very next page of the 1998 Technical Report presents a summary table of the average difficulty and discrimination of different question types for each subject and grade tested on the 1998 MCAS. This table (Table 13-2 from MDOE, 1999, p. 79) is reproduced in Table 2 below. What these results reveal is that almost all items selected for inclusion on the MCAS tests were ones which showed item difficulties in the range of 0.35 to 0.65, meaning that between 35% and 65% of test-takers answered these items correctly. Similarly, almost all items selected for inclusion on the MCAS tests showed item discriminations in the range of 0.30 to 0.60.
These results make it clear the MCAS has been developed using norm-referenced test construction procedures. Items were selected for inclusion on the MCAS in terms of item difficulty and discrimination. This means that items which all students tended to answer correctly when the items were pilot tested are excluded from operational versions of the MCAS tests. Why? Because items that all students answer correctly (or all answer incorrectly) have no power to discriminate among test takers. Standard textbooks on testing point out that when 90% of test takers answer an item correctly, the maximum index of discrimination it may have is 0.20 (Anastasi, 1982, p. 208). As a result, items that show passing rates of 90% or greater (or discrimination indices less than 0.20) were systematically excluded from the common pool of operational MCAS from which students' scores are derived. Appendix B to the 1998 Technical Report presents details of item statistics for almost all MCAS items administered in 1998 (for reasons not explained item statistics for the 1998 grade 4 math test are not included.). The Appendix B data tables show that there were at least nine MCAS matrix items that showed difficulties of 0.90 or more meaning that at least 90% of students taking the items answered the items correctly). The 1999 MCAS Technical Report contains no analogous appendix showing details for item statistics for the operational MCAS tests for 1999. Hence we cannot be sure whether all of the easy items pilot tested in 1998 (that is, ones which were answered correctly by 90% or more of students who took them in 1998 on a pilot basis) were excluded from the operational MCAS tests for 1999. Nonetheless one direct comparison of results from the 1998 and 1999 technical reports clearly shows that pilot items answered correctly by large proportions of students in 1998 tended to be excluded from the 1999 operational tests. Table 3 shows the average discrimination of 1998 matrix multiple choice items and 1999 common multiple choice items for the MCAS tests administered at grades 4, 8 and 10 in those years. As can be seen, the average discrimination for the 1999 common (operational) tests are consistently higher than the average discrimination for the 1998 matrix items, which consisted mainly of items being pilot tested for future operational use. This contrast shows that developers of the 1999 MCAS tests clearly tended to select items with higher rather than lower discrimination. And this means that they would have systematically discarded items answered correctly during the 1998 pilot test of matrix items (recall that when 90% of test takers answer an item correctly, the maximum index of discrimination it may have is 0.20).
Selecting items in terms of item difficulty and discrimination is standard practice for norm-referenced tests of aptitude, ability and achievement. The Scholastic Aptitude Test (now called the SAT I), for example, was constructed with item specifications calling for most items to have biserial correlations of item discrimination in the range of 0.30 to 0.70 (Donlon 1984, p. 48). Here is how one authority on standardized testing, Anne Anastasi, described the rationale for selecting items in terms of difficulty:
Anastasi goes on to point out that for different kinds of testing, that is other than norm-referenced standardized testing, different kinds of test selection strategies would be appropriate:
Lake WoebeguaranteedHigh stakes testing is politically popular in Massachusetts as in other states. By high stakes testing, I refer to the use of standardized test results in isolation to make decisions about students or schools. Such use is contrary to professional standards regarding test use. (See, for example, the statement of the American Educational Research Association available at http://www.aera.net/about/policy/stakes.htm) Regarding use of test results to make decisions about individuals, decades of research regarding college admissions testing show that it is far more sound (more valid and with smaller adverse impact on minorities and females) to make decisions flexibly using test scores, grades and other information rather than to make decisions mechanically based on test scores alone (Linn, 1982; Willingham, Lewis, Morgan & Ramist, 1990, Haney, 1993). In the analyses presented above I have shown the folly of using annual MCAS test results in isolation to rate schools. Using data from MCAS testing in Massachusetts, we have seen the "volatility" of annual school average MCAS scores. Next I discussed three broad reasons why MCAS score averages are such poor indicators of school quality; namely, because groups of students tested can vary from one year to the next; because the MCAS tests are of dubious technical quality; and because the MCAS tests have been constructed using norm-referenced test construction techniques whereby items are selected for inclusion in terms of item difficulty and discrimination. It is the latter practice that brings us to Lake Woebeguaranteed. When test items are selected in terms of how well they discriminate among individual test takers, this means that the test results will tend to have little power to differentiate among schools (Madaus, Airasian & Kellaghan, 1980). And when items are systematically excluded from operational versions of the tests when more than 70% of pilot test students can answer them correctly, this flat out guarantees continuing failure on the tests. It also may help to explain Bolon's finding as to why grade 10 MCAS math school average scores are so strongly correlated with community income. The MCAS math and English language arts tests have been constructed using the sort of techniques used in building tests of verbal and quantitative aptitude. It has long been established that such aptitude test results are consistently associated with family socio-economic status (see Donlon 1984, p. 183 for just one piece of evidence on this point). In closing let me mention two additional reasons why the use of results from tests like the MCAS to make high stakes decisions about schools and students is fundamentally ill-conceived. Recent research by Russell and colleagues (Russell & Haney, 2000; Russell & Plati, 2001) has shown that "low- tech" tests (that is, paper-and-pencil tests in which students have to write longhand), in general, and the MCAS in particular, seriously underestimate the skills of students accustomed to working on computers. Finally if nothing else, the recent expose in the New York Times of widespread errors in test scoring and reporting in the testing industry (Henriques & Steinberg, 2001; Steinberg & Henriques, 2001) should make clear how unwise it is to make important decisions based on test scores in isolation. In just the last few years, virtually every major test developer has been found to have committed a major blunder, as a result of which, for example, students were wrongly forced to attend summer school, students were mistakenly denied high school diplomas and schools were incorrectly sanctioned for performance.
1 I would like to thank Anne Wheelock, Damian Bebell, Ron Nuttall, Craig Bolon and five members of the EPAA editorial board for comments on a previous version of this article. Nonetheless, as always should be the case, responsibility for the content and conclusions of the article is solely that of the author. 2 Several reviewers of a previous version of this article suggested that I ought to make clear that the issues discussed herein are related to longstanding statistical issues relating to the measurement of change, regression to the mean, and sampling theory. For such suggestions, I am grateful, even though I decided not to go into these more general issues in this article.
Ananstasi, A. (1982). Psychological testing (5th Edn.) New York: Macmillan. Bloom, B. (1964). Stability and change in human characteristics. New York: John Wiley & Sons. Bolon, C. (2001, October 16). Significance of Test-based Ratings for Metropolitan Boston Schools. Education Policy Analysis Archives, 9(42). Retrieved October 18, 2001 from http://epaa.asu.edu/epaa/v9n42/. Bolon, C. (2002, January 28). Response to Michelson and to Willson and Kellow. Education Policy Analysis Archives, 10(10). Retrieved February 4, 2002 from http://epaa.asu.edu/epaa/v10n10/. Cannell, J. J. (1987). Nationally normed achievement testing in America's public schools: How al 50 states are above the national average. Daniels, WV: Friends for Education. Council of Chief State School Officers, (December 1998). Key State Education Policies on K-12 Education: Standards, Graduation, Assessment, Teacher Licensure, Time and Attendance. Washington, DC: CCSSO, State Education Assessment Center. Donlon, T. (Ed.) 1984. The College Board Technical handbook for the Scholastic Aptitude Test and achievement tests. New York: College Entrance Examination Board. Gallagher, E. (2001). An analysis of problems on two 10th grade MCAS math tests http://www.es.umb.edu/edg/MCAS/mcasproblems.pdf, accessed December 3, 2001. Haney, W, Madaus, G. & Lyons, R. (1993). The Fractured Marketplace for standardized testing. Boston: Kluwer. Henriques, D. & Steinberg, J. (May 20, 2001). Right answer, wrong score: Test flaws take toll, New York Times, p. 1. http://www.nytimes.com/2001/05/20/business/20EXAM.html. Kane, Thomas, and Staiger, Douglas (April 2001). Volatility in school test scores: Implications for test-based accountability systems. Working paper of the National Bureau of Economic Research. (http://www.nber.org/papers/w8156). Kurtz, M. (September, 2001) The MCAS: High stakes testing in Massachusetts: An overview of the debate. Draft. The Rappaport Institute for Greater Boston, The John F. Kennedy School of Government, Harvard University Lindsay, J. (2001). Students find mistake in math portion of MCAS test. The Associated Press, 5/22/01 6:51 PM. (http://www.masslive.com/newsflash/index.ssf? /cgi-free/getstory_ssf.cgi? g0102_BC_MA--MCASMistake&&news&newsflash-massachusetts, accessed 6/10/2001). Madaus, G., Airasian, P. and Kellaghan, T. (1980). School effectiveness: a reassessment of the evidence. New York: McGraw-Hill. Massachusetts Department of Education (October, 1999). Massachusetts Comprehensive Assessment System 1998 Technical report. (MCAS 98tech_report_full.pdf, downloaded, February 24, 2000) Massachusetts Department of Education (November, 2000). Massachusetts Comprehensive Assessment System 1999 Technical Report. (MCAS 99tech_report_full.pdf, downloaded, September 13, 2001) Michelson, S. (2002, January 28). Reactions to Bolon's " Significance of Test-based Ratings for Metropolitan Boston Schools". Education Policy Analysis Archives, 10(8). Retrieved February 4, 2002 from http://epaa.asu.edu/epaa/v10n8/. Russell, M. & Haney (2000). Bridging the gap between technology and testing. Education Policy Analysis Archives, 8(41). Retrieved May 3, 2002 from http://epaa.asu.edu/epaa/v8n41/ Russell, M. & Plati, T. (2001). Effects of Computer Versus Paper Administration of a State-Mandated Writing Assessment. Teachers College Record On-line. Available at http://www.tcrecord.org/Content.asp?ContentID=10709 Steinberg, J. & Henriques, D. (May 21, 2001). When a test fails the schools, careers and reputations suffer, New York Times, p. 1. http://www.nytimes.com/2001/05/21/business/21EXAM.html. Willson, V. L. & Kellow, T. (2002, January 28). Confusing the messenger with the message: A reponse to Bolon. Education Policy Analysis Archives, 10(9). Retrieved February 4, 2002 from http://epaa.asu.edu/epaa/v10n9/.
Walt Haney 617-552-4199 Walt Haney, Ed.D., Professor of Education at Boston College and Senior Research Associate in the Center for the Study of Testing Evaluation and Educational Policy (CSTEEP), specializes in educational evaluation and assessment and educational technology. He has published widely on testing and assessment issues in scholarly journals such as the Harvard Educational Review, Review of Educational Research, and Review of Research in Education and in wide-audience periodicals such as Educational Leadership, Phi Delta Kappan, the Chronicle of Higher Education and the "Washington Post." He has served on the editorial boards of Educational Measurement: Issues and Practice and the American Educational Research Journal, the Journal of Technology Learning and Assessment and on the National Advisory Committee of the ERIC Clearinghouse on Assessment and Evaluation. Among other publications he has authored or co-authored a number of previous publications in Education Policy Analysis Archives (http://epaa.asu.edu/epaa/v5n3.html, http://epaa.asu.edu/epaa/v7n4/, http://epaa.asu.edu/epaa/v8n19.html, and http://epaa.asu.edu/epaa/v8n41/)
The MCAS data used in this study were drawn from three CD data disks issued by the Massachusetts Department of Education entitled "School, District and State MCAS Results, Grades 4, 8 and 10, Tests of May 1998", "School, District and State MCAS Results, Grades 4, 8 and10, Tests of May 1999" and "School, District and State MCAS Results, Grades 4, 8 and10, Tests of Spring 2000" and an Excel file named "MCAS 2001 ctable.xls" downloaded from http://boston.com/mcas/ on November 9, 2001. For some odd reason, this file was not available on the Massachusetts Department of Education web site when the 2001 MCAS results were publicly released on November 8. The primary variables used in merging the four sets of grade 4 MCAS math results were district and school code numbers. However there was a small number of cases in which ID codes (or school names) were not the same, but data from the four years were still merged to represent a single school "case." Reasons for this were as follows. In Beverly, the Mckay school has a school ID number of 35 in the 1998 data set, but an ID number of 37 in the 1999, 2000 and 2001 data sets. Similarly in the 1998 data set two schools in Easton (the Fl Olmstead and HH Richardson schools have a different school ID code number in the 1998 data set than in data sets for subsequent years. In general, data were merged based on school ID code numbers and names, though across the four data sets, there were numerous variants in school names. In several instances cases which had both different names and school ID across the four data sets were treated as a single school. These were four instances in which there was only a single school in a town, but which had roughly the same number of students tested across the across the four years of MCAS results. In the 1998 data set, the town of Chesterfield was reported to have one school, named Center School, in which 12 students were tested. In the 1999, 200 and 2001 data sets, the town of Chesterfield was also reported to have one school, but one named "NEW HINGHAM REGIONAL ELEM," with 14, 19, and 19 students reported as tested in the latter three years. In cases such as this, that is a town with only a single school with roughly the same number of students tested across the four years of MCAS results, data across the four years were considered to represent a single school case. The one school in Holliston with grade 4 students tested was treated as a single case even though, the name of the school in 1998 and 1999 was "Flagg Adams Middle" but "MILLER SCHOOL" in 200 and 2001. In Dover, the name of the one school with grade 4 students tested was "CARYL SCHOOL" in 1998, 1999 and 2000, but CHICKERING" in 2001. The one school in Medford was named "GREEN MEADOW SCHOOL" in 19982000, but" FOWLER MIDDLE" school in 2001. Also, two schools in Malden were treated as single cases because their records across the four years of data contained identical school names, even though the school code numbers varied by one digit. Similarly, the" KINGSTON ELEMENTARY" school was treated as a single case even though the ID code listed for it changed from 5 in 1998, 1999 and 2000 to 20 in 2001. Because of such issues, in the data set accompanying this article, I have included the original district and school codes for all four years of MCAS results, plus the original variable labels for all four years of MCAS results. This will allow anyone undertaking secondary analysis to make their own decisions about the cases described above. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright 2002 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu . EPAA Editorial Board
|