Lake Woebeguaranteed : Misuse of Test Scores in Massachusetts , Part I

Misuse of test results in Massachusetts largely guarantees woes for both students and schools. Analysis of annual test score averages for close to 1000 Massachusetts schools for four years (1998–2001) shows that test score gains in one testing period tend to be followed by losses in the next. School averages are especially volatile in relatively small schools (with less than 150 students tested per grade). One of the reasons why scores fluctuate is that the Massachusetts state test has been developed using norm-referenced test construction procedures so that items which all students tend to answer correctly (or incorrectly) are excluded from operational versions of the test. This article concludes with a summary of other reasons why results from state tests, like that in Massachusetts, ought not be used in isolation to make high-stakes decisions about

students or schools.Lake Wobegon is the mythical town in Minnesota popularized by Garrison Keillor in his National Public Radio program "A Prairie Home Companion."It is the town where "all the children are above average" (and "all the women are strong, and all the men, good-looking").In the late 1980s, it became apparent that Lake Wobegon had come to schools nationwide.For according to a 1987 report by John Cannell, the vast majority of school districts and all states were scoring above average on nationally normed standardized tests (Cannell, 1987).Since it is logically impossible for all of any population to be above average on a single measure, it was clear that something was amiss, that something about nationally normed standardized tests or their use had been leading to false inferences about the test scores of students in the nation's schools.As a result, people came to refer to inflated test results as the Lake Wobegon phenomenon.I do not try here to recap the story of Cannell's work on the Lake Wobegon phenomenon and how independent researchers came to verify the phenomenon.(The story is recounted in chapter 7 of Haney, Madaus & Lyons, 1993, for anyone interested).
Rather, my purpose is to introduce a place considerably east of Lake Wobegon; namely, Lake Woebeguaranteed.In this place, the use of state test results in isolation to make important decisions about schools and students pretty well guarantees woes will follow.For as I will explain, such uses of results from what is essentially a norm-referenced test constitute ill-conceived misuses of test results.Before proceeding to this larger story, I recap how the work reported here evolved.
After reading Kane &Staiger (2001), andBolon (2001), I undertook an analysis of school average scores on the Massachusetts Comprehensive Assessment System (MCAS) grade 4 mathematics tests for 1998, 1999, 2000 and 2001.After summarizing these previous works, I describe the sources of data used in the present analysis, the means by which data were merged from different sources, the analyses undertaken, and the results.The latter confirm the findings by Kane & Staiger (2001) and Bolon (2001); namely, that changes in school average test scores from one year to the next are unreliable indicators of school quality.Next I discuss three reasons this is so, and why misuse of results of the Massachusetts state test virtually guarantees woes for schools and students. 2

Background
The works that prompted the analyses reported here were Kane &Staiger (2001), andBolon (2001).The first of these works focused on state test results in North Carolina.North Carolina has an extensive system of testing students, not just with state "competency" tests in grades 3-11, but also with norm-referenced tests in grades 5 and 8. (CCSO, 1998, pp. 19, 21, 22, 24).Students must pass state competency tests in reading and math to graduate from high school.Schools in North Carolina are publicly rated in terms of student test results.However, the paper by Kane and Staiger (2001) from the National Bureau of Economic Research (http://www.nber.org/papers/w8156)shows how misleading these ratings tend to be.
Kane and Staiger analyzed six years worth of student assessment data from the entire state of North Carolina (for nearly 300,000 students in grades 3 through 5).They showed that, regardless of whether results were analyzed in terms of annual results or year to year changes, the test results are mainly random noise-resulting from the particular samples of students who are in tested grades in particular years, and the vagaries of annual test content and administration-not meaningful indication of school quality.
Kane and Staiger concluded with the following four "lessons": Incentives targeted at schools with test scores at either extreme-rewards for those with very high scores or sanctions for those with very low scores-primarily affect small schools and imply very weak incentives for large schools.

1.
Incentive systems establishing separate thresholds for each racial/ethnic subgroup present a disadvantage to racially integrated schools.In fact, they can generate perverse incentives for districts to segregate their students.

2.
As a tool for identifying best practice or fastest improvement, annual test scores are generally quite unreliable.There are more efficient ways to pool information across schools and across years to identify those schools that are worth emulating.

3.
When evaluating the impact of policies on changes in test scores over time, one must take into account the fluctuations in test scores that are likely to occur naturally.(Kane & Staiger, April 2001).

4.
The second work prompting the analyses reported below is Bolon's "Significance of test-based ratings for metropolitan Boston schools" (2001, in Education Policy Analysis Archives, http://epaa.asu.edu/epaa/v9n42/.Also, see Michelson, 2002;Willson &Kellow, 2002 andBolon, 2002 for discussion of the orginal Bolon article).In this study Bolon examined 1998Bolon examined , 1999Bolon examined , and 2000 MCAS mathematics scores for 47 academic high schools in 32 metropolitan communities in the greater Boston area (vocational high schools were excluded on the grounds that they have a substantially different mission than academic high schools).Bolon found that school average grade 10 MCAS math scores generally changed little over this interval (+1.3 points from 1998 to 1999; and +5.9 points from 1999 to 2000) relative to the range in school average scores (in 1999, for example, school averages ranged from 203 to 254, on the MCAS scale of 200 to 280.) Bolon does note, however, that according to data released by the Massachusetts Department of Education, between 1998 and 2000 grade 10 MCAS math scores rose substantially more than English or science scores (see Bolon's Table 1-1).
Bolon then examined the extent to which seven school characteristics, plus community income (1989), might be used to predict school average grade 10 MCAS math scores.He found that three variables (percent Asian or Pacific Islander, percent limited English proficiency, and per-capita community income) were the only ones statistically significantly related to school average scores (Table 2-12).Together these three variables accounted for 80% of the variance in school average scores.After excluding schools in Boston (for which separate community income data were not available), Bolon found that "by far the strongest factor in predicting tenth grade MCAS mathematics scores is 'per capita community income (1989).'For the schools outside the City of Boston, this factor alone performed nearly as well as all available factors combined, associating 84 percent of the variance compared with 88 percent when all available factors were used." The study reported here builds on both of the works just discussed.For example, an analytical approach applied by Kane and Staiger to data from North Carolina, namely comparing school size with changes in annual score averages, is employed here.Additionally, while Bolon examined average MCAS scores for Massachusetts high schools, this inquiry addresses MCAS averages for elementary schools.
There are three broad reasons why elementary school average test scores might be more useful indicators of school quality than test averages for high schools.First is the simple fact that there are more elementary schools than high schools.In his study, Bolon analyzed test scores for less than 50 high schools.In contrast, MCAS scores are available for around 1000 elementary schools in Massachusetts.A larger sample offers greater potential to discern meaningful differences in school quality.
The second reason for hypothesizing that grade 4 test scores may be better indicators of school quality than grade 10 test scores is the extent of institutional experience that they may reflect.Children typically enter school in Massachusetts in kindergarten.This means that by spring of grade 4, they have almost five years of education in a particular elementary school (presuming, of course, they did not switch schools).In contrast, grade 10 test scores typically reflect just two years' experience in high school.So on this count, grade 4 test score averages clearly have more potential to reflect differences in school quality than grade 10 score averages.
The third reason for thinking that grade 4 test scores may be better indicators of school quality than grade 10 test scores is that by grade 10 (roughly age 16) individuals' standardized test scores have become relatively fixed, whereas test scores of young children are relatively malleable.This may be illustrated by reference to Benjamin Bloom's classic (1964) work, Stability and Change in Human Characteristics.In this book, Bloom reviewed a wide range of evidence on how a number of human characteristics, including height, weight and test scores, tend to change as people age.He showed, for example, that height in the early childhood years tends to be a moderately good predictor of height at maturity, with correlations between height at ages 6 -10 years and height at age 18 falling in the range of 0.75 to 0.85 for both males and females.Interestingly, height at ages 11-13 for females and 13-15 for males is a less good predictor of height at maturity.This is, of course, due to variation in the ages at which children experience growth spurts as they go through puberty.
In contrast to the physical characteristic of height, mental abilities of young children as measured by standardized tests show relatively little power to predict mental abilities at maturity.Not until around grade 3 or 4 (or age 8 -9) do children's test scores become relatively reliable predictors of future performance.To provide one example, reading test scores at age 6 (or grade 1) correlate with reading test scores in grade 8 only about 0.65 (Bloom, 1964, p. 98).As Bloom himself put it, "We may conclude from our results on general achievement, reading comprehension and vocabulary development that by age 9 (grade 3) . . .50% of the general achievement pattern at age 18 (grade 12) has been developed" (Bloom, 1964, p. 105).The relative malleability of young children's test scores suggests that there may be more potential for grade 4 test scores to be affected by school quality, as compared with high schools' effects on grade 10 test scores.
In sum, while Bolon found that school average scores on the Massachusetts' grade 10 state test (MCAS) were not sound indicators of schools quality, there are several reasons for hypothesizing that school average scores for grade 4 might be better indicators of school quality.To test this possibility, the data and analyses described below were employed.

Data Sources
The data used in this study were drawn from four sources.MCAS results for 1998, 1999 and 2000 were drawn from CD data disks issued by the Massachusetts Department of Education entitled "School, District and State MCAS Results, Grades 4, 8 and10, Tests of May 1998," "School, District and State MCAS Results, Grades 4, 8 and10, Tests of May 1999," and "School, District and State MCAS Results, Grades 4, 8 and10, Tests of Spring 2000."The MCAS results for 2001 were drawn from an Excel file named "MCAS2001pub_g4sch01.xls" downloaded from http://boston.com/mcas/ on November 9, 2001.The files from these four sources contain MCAS results for all schools and districts in Massachusetts for 1998Massachusetts for , 1999Massachusetts for , 2000Massachusetts for and 2001. .From these results, grade 4 MCAS math averages were extracted for all schools in Massachusetts.
Math rather than English Language Arts (ELA) test scores were selected for study for two reasons.First, it is reasonably well-established that schools have more influence on math test scores than on English (or at least reading) test scores (Haney, Madaus & Lyons, 1993).Second, it is apparent that there have been a number of problems in past years in the scaling of MCAS grade 4 ELA scores.The reason for more records in 1999 and 2000 than in 1998 is the creation of a number of new elementary schools (mostly charter schools).The file for 2001 is smaller than those for previous years because it included only school average, but not district average scores.Merging records from these four data files proved more difficult than anticipated.Labels for some variables were changed across the years and names for some schools are reported inconsistently in these four sets of data.Nonetheless after examining pairs of records for 1998, 1999, 2000 and 2001, I was  Figure 1 shows a scatter plot of how 1999 school averages compared with those from 1998.As can be seen, there is a fairly strong relationship between 1998 and 1999 averages.Schools with higher MCAS grade 4 test score averages in 1998 tended to have higher averages in 1999.The correlation between score averages in 1998 and 1999 was 0.860.The regression relationship between score averages in 1999 and 1998 is: For statistically inclined readers, it may be noted that these correlation and regression relationships are statistically significant-that is, extremely unlikely that they might occur by chance.Figure 4 shows the relationship between change in average MCAS grade 4 scores between 1998 and 1999 and school size (defined as the average of the numbers of students tested in the two years).As can be seen, schools with less than Figure 5 shows analogous results for 1999 to 2000 score changes.As can be seen, the pattern shown here is similar to that shown in Figure 4. Schools with smaller numbers of students tested tended to have much more "volatility" (to use Kane and Staiger's phrase) in average scores than schools with larger numbers of students tested.Given the political prominence of high stakes testing in Massachusetts (as elsewhere), it is not surprising that various observers have tried to use changes in school MCAS scores from one year to the next to identify high quality or "exemplary" schools.For example, in a high profile ceremony at the Massachusetts State House in December 1999, five school principals were presented with gifts of $10,000 each "for helping their students make significant gains on the MCAS" ( http://www.doe.mass.edu/news/archive99/Dec99/122299pr.html)(accessed November 15, 2001).Though the cash awards were donated by a private foundation, the ceremony recognizing the five schools was attended by the Massachusetts Governor, Lieutenant Governor and Commissioner of Education.The press release for the event stated: "The schools were recognized as having the highest percentage improvement in overall MCAS scores between 1998 and 1999 in English Language Arts, Mathematics and Science and Technology" ( http://www.doe.mass.edu/news/archive99/Dec99/122299pr.html).
Anyone with even a modest knowledge of statistics will note the absurdity of this statement.Since the MCAS scale of 200 to 280 is arbitrary and has no meaningful zero point, it is meaningless to calculate percentage increases in scores.This indicates that whoever in the Massachusetts Department of Education wrote this press release is fundamentally ignorant of statistics-or to be less politically incorrect, in need of improvement in knowledge of statistics.For anyone who has not studied statistics lately and hence may not appreciate the absurdity of calculating percentage increases on arbitrary test score scales, I suggest the following exercise.Calculate the percentage increase in temperature going from 50 degrees Fahrenheit to 68 degrees Fahrenheit.Next, figure out the equivalent temperatures on the Celsius scale and calculate the percentage increase on the Celsius scale.Finally, ask yourself which "percentage increase" is correct.
Four of the five schools receiving the so-called Edgerly awards in 1999 were elementary schools, namely, Riverside Elementary School in Danvers, Franklin D. Roosevelt Elementary School in Boston, Abraham Lincoln Elementary School in Revere, Kensington Elementary School in Springfield.Figure 7 is a recasting of Figure 3, but with these four 1999 Edgerly award schools shown with circles.As can be seen, the four award schools share two characteristics.First, they are all relatively small schools, each with less than 100 students tested.Second, they showed unusually large score changes from 1998 to 1999.This is not surprising since large MCAS score gains from 1998 to 1999 served as basis for their receiving awards.
Figure 8 recasts Figure 4, again with the 1999 "award" schools marked with circles.More recently, Gallagher (2001) undertook a review of grade 10 MCAS math questions from the 2000 and 2001 test administrations.Gallagher, a professor of Environmental, Coastal and Ocean Sciences, at the University of Massachusetts, Boston, concluded that there were serious problems with 10 to 15% of the grade 10 MCAS math questions.He identified some questions as having wrong answers, some as having more than one correct answer and some as misaligned with the Massachusetts curriculum frameworks."Overall, my review of these tests indicates that there are serious failures in the choice and review of MCAS questions" (Gallagher, 2001, p. 5, http://www.es.umb.edu/edg/MCAS/mcasproblems.pdf,accessed December 3, 2001).
More generally, the MCAS is not a good indicator of school quality because it has been constructed as a norm-referenced test.Many people assume that the MCAS (and other state-sponsored tests) are criterion referenced tests-that is, tests of well-specified bodies of knowledge and skills.In a paper prepared for an October, 2001 conference at the John F. Kennedy Institute at Harvard University , for example, Kurtz wrote: The MCAS, which is known as a "criterion-referenced exam," tests knowledge of a set curriculum and gives students scores based on their level of mastery, in contrast to national "norm-referenced tests," which grade a student's performance in relation to other students.(Kurtz, 2001, p. 6).
However examination of the technical manuals for the MCAS tests, reveals that items have been selected for inclusion on MCAS tests by using norm-referenced test The numbers of records for which MCAS grade 4 average results are available from each of the sources mentioned above are as follows:

Figure 4 .
Figure 4. Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. School Size 100 or so students tested show changes in MCAS average scores of as much as 15-20 points.However schools with more than 150 students tested per year show much smaller changes-generally less than 5 points.

Figure 5 .
Figure 5. Change in MCAS Grade 4 Math Average Score 1999 to 2000 v. School Size Figure 6 shows the relationship between school size and change in average grade 4 MCAS scores between 2000 and 2001.The pattern is very similar to that apparent in the previous two figures.Schools with less that 100 students tested showed much larger swings in test score averages than schools with larger numbers of students tested.

Figure 6 .
Figure 6.Change in MCAS Grade 4 Math Average Score 2000 to 2001 v. School Size

Figure 7 .
Figure 7. Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. School Size, with Award Schools Marked

Figure 8 .
Figure 8. Change in MCAS Grade 4 Math Average Scores 1999 to 2000 v. School Size, with 1999 Award Schools MarkedAs can be seen in Figure8, three out of the four 1999 award schools showed declines in average grade 4 MCAS math scores from 1999 to 2000.

Figure 9
Figure 9 is a variant of Figure 5, showing the relationship between average numbers of students tested in 1999 and 2000 versus the change in average grade 4 math scores between 1999 and 2000.In Figure 9, all of the schools showing a 10 or more point gain in average MCAS scores are marked with circles.

Figure 9 .Figure 10 .
Figure 9. Change in MCAS Grade 4 Math Average Score 1999 to 2000 v. School Size, with Schools showing Gain of 10 Points or More Highlighted with Circles What happened to these schools the next year?Figure 10 shows change from 2000 to 2001, but with the schools having largest gains from 1999 to 2000 again marked with circles.As can be seen, there were a few schools showing largest gains from 1999 to 2000 that continued to show gains in 2001.But most of the large gain schools from 1999 to 2000, showed declines in 2001.Several of them showed declines from 2000 to 2001 that were just about as large (9-10 points) as were the gains from 1999 to 2000

Figure 11 .
Figure 11.Change in MCAS Grade 4 Math Average Score 1998 to 1999 v. Change 1999 to 2000, with Schools showing Gain of 10 Points or More Highlighted with Circles able to create a merged data file of MCAS grade 4 math results (and numbers of students tested) for 1998-2001.A copy of this data file is appended to this article for anyone interested in secondary analysis (see Appendix)..5 point increase, on average, between 1998 and 1999, and an increase of 0.5 of a point between 1999 and 2000, but then level scores between 2000 and 2001.Changes in score averages for individual schools from year to year ranged from a low of-22 to +18 points.As Bolon found with regard to grade 10 MCAS scores, these school average changes in grade 4 MCAS scores are considerably smaller than the range in school average scores, which varied by 50 points or more in all four years of test administration.
The merged data file of grade 4 MCAS math school averages, after deletion of district averages, contained records for 977 schools.Table1shows summary descriptive statistics for this data set.As can be seen, the numbers of fourth graders tested per school in these three years ranged from just 10 to 328.The school average MCAS scores ranged from a low of 206 in 1998 to a high of 263 in 2001.Over the four years of MCAS testing, on average, there were initially slight increases in average MCAS scores-a 1

are MCAS score averages poor indicators of school quality?
These results are simply a manifestation of the kind of volatility thatKane and Staiger  (2001)found in school average test scores in other states.As they found for North Carolina, we have seen above with MCAS scores.School average test scores are particularly volatile for relatively small schools.Moreover schools that show relatively large gains in score averages from one year to the next tend to show losses the following year.sThus, it is clear that school average test scores, or changes in averages from one year to the next, represent poor measures of school quality.School average test results fluctuate from year to year for several reasons.The most obvious is that one year's class of students will differ from the next.Especially in relatively small schools, with less than 100 students tested per grade, having a few especially test savvy, or not so savvy, students may skew results from one year to the next.A second likely cause of volatility in school average scores on the Massachusetts test is that the MCAS is of dubious technical merit.When I first examined the 1998 grade 4 English Language Arts (ELA) test, for example, I was surprised to find many poorly worded questions and reading questions for which one did not actually have to read the passage on which they were ostensibly based in order to answer the question (that is, the questions lacked passage dependency).More recently the Massachusetts DOE implicitly acknowledged defects in the 2001 grade 10 ELA and math exams when it dropped one item from each from scoring (http://www.doe.mass.edu/MCAS/01results/threshscore.html).The defective items on the 2001 test were discovered not by the test's developer or state officials but by students(Lindsay, 2001).