This article has been retrieved
times since March 19, 2001
Education Policy Analysis Archives | ||
Volume 9 Number 8 |
March 19, 2001 |
ISSN 1068-2341 |
|
Editor: Gene V Glass, College of Education Arizona State University
Copyright 2001, the
EDUCATION POLICY ANALYSIS ARCHIVES. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Another Look at the Florida Data |
|
Abstract
This report re-analyzes test score data from Florida public schools. In response to a recent report from the Manhattan Institute, it offers a different perspective and an alternative explanation for the pattern of test score improvements among low scoring schools in Florida. |
IntroductionA recent report from the Manhattan Institute think tank (Greene, 2001) examined test scores of Florida public schools in 1999 and 2000 to determine the effects of vouchers on student performance. The report ends with a conclusion: The most plausible interpretation of the evidence is that the Florida A-Plus system relies upon a valid system of testing and produces the desired incentives to failing schools to improve their performance. My own analyses of the Florida data lead to no such conclusion. Instead, I found the evidence telling a more interesting, and to my mind a more believable, story. I will argue that the evidence suggests that the voucher effect follows different patterns in the three tested subject areas: reading, math, and writing. Moreover, I will show that the most dramatic improvements in failing schools were realized by targeting and achieving a minimum passing score on the writing test, thereby escaping the threat of losing their students to vouchers.BackgroundThe Florida A-Plus school accountability program is based on tracking schools' performance and progress toward the educational goals set in the Sunshine State Standards. The main source of information on school performance is a series of standardized test in reading, math, and writing, known collectively by the somewhat redundant name FCAT (Florida Comprehensive Assessment Tests). All elementary, middle, and high school students are tested annually (different subjects in different grades) and the results are used to assign a grade to each school, from A to F, according to a formula that weighs the number of students performing below and above pre-defined markers along the test score scales. An F grade assignment has a variety of consequences and a great deal of attention is directed toward F schools in the Florida system.One of the most visible and politically contested consequences of failing the State's tests is the voucher provision. If a school received another F grade in a four-year period, its students become eligible to take their public funding elsewhere to a private or better-performing public school. In 1999, 78 schools have received an F grade. Greene's report examines the gains these schools made on the FCAT between 1999 and 2000, and the executive summary offers a précis of the evidence: The results show that schools receiving a failing grade achieved test score gains more than twice as large as those achieved by other schools. While schools with lower previous test scores across all state-assigned grades improved their test scores, schools with failing grades that faced the prospects of vouchers exhibited especially large gains (Greene, 2001, p. ii). The report itself compares the average score gains of higher-scoring F schools to lower-scoring D schools, serving as a control group. Standardized group differences constitute Greene's estimated effect sizes of the voucher effect0.12 in reading, 0.30 in math, and 0.41 in writing. Other analyses in the report calculate the correlations between FCAT and other standardized test administered in Florida schools, to gauge the validity of the FCAT. These findings lead Greene not only to the conclusions cited above, but also to strong public commentary in the local and national press in favor of Florida's voucher system and similar proposals in President Bush's school reform plan. The moderate voucher effect estimates and relatively cautious language of the report were replaced in the media by strong statements, emphasizing the magnitude of the raw score gains achieved by F schools. In an interview to the St. Petersburg Times (February 16, 2001), after the release of his report, Greene asserted: "The F schools showed tremendous gains because they faced a particularly concrete outcome that they wished to avoid: embarrassment, loss of revenue, vouchers. Even more boldly, generalizing from the Florida findings, Greene offered the following proclamation in a guest commentary in The New York Post (February 21, 2001): So the improvement by Florida's failing schools was real. So, as debate proceeds over President Bush's education proposals, know this: Testing, accountability and choice are powerful tools to improve education - and, in particular, to turn around chronically failing schools. That's not a theory, but proven fact. My re-analyses of the Florida data suggest that Greene might have over-stated the case for the simple explanation he promoted in his report and in the press. A more careful examination of the patterns of gains reveals that failing schools responded with a more sophisticated strategy than the undifferentiated, gross voucher effect gave them credit for. The key element of the strategy was to achieve a particular score on the writing test, in order to elevate their grades. The strategy was extremely successful and all failing schools were able to escape the threat of vouchers by achieving a grade of D or better in 2000. |
DataThe data for the analyses are school mean scores on the FCAT reading, math, and writing tests from 1999 and 2000. They include all curriculum groups in both years (available on-line from the Florida Department of Education web site: http://www.firn.edu/doe/sas/fcat.htm). These data are slightly different from the data Greene used in his analyses, but as he comments (Greene, 2001, Note 10), the difference is inconsequential and similar conclusions will be reached using either dataset. The analyses below address issues that Greene either paid no attention to in his report or dismissed as unimportant. The first example of the latter is regression toward the mean.An elusive regression artifactOn page 10 of his report, Greene alerts his readers to the potential biasing affect of regression to the mean:As another alternative explanation critics might suggest that F schools experienced larger improvements in FCAT scores because of a phenomenon known as regression to the mean. There may be a statistical tendency of very high and very low-scoring schools to report future scores that return to being closer to the average for the whole population. This tendency is created by non-random error in the test scores, which can be especially problematic when scores are "bumping" against the top or bottom of the scale for measuring results. If a school has a score of 2 on a scale from 0 to 100, it is hard for students to do worse by chance but easier for them to do better by chance. Low-scoring schools that are near the bottom of the scale are very likely to improve, even if it is only a statistical fluke.He then dismisses the threat because "the scores of those [F] schools were nowhere near the bottoms of the scale of possible scores" (p. 10). Greene seems to confuse regression toward the mean with floor and ceiling effectscompletely different phenomena. Scores "'bumping' against the top or bottom of the scale" colorfully characterizes ceiling and floor effects but is an inadequate description of the regression effect. Regression toward the mean operates whenever the correlation between two variables (the 1999 and 2000 test scores, in our case) is less than perfect. It influences the entire range of scoresnot just the very extremewith a force proportional to their distance from the sample mean. Therefore, the fact that F schools where far from the bottom of the score scale is a poor indication that regression effects are absent. The two relevant pieces of information are how far the group is from the sample mean and the magnitude of the correlation between the two variables involved. Knowing these two quantities allows us to forecast the expected magnitude of the pull toward the sample mean. Using standardize scores aids interpretation, as the predicted standardized Y equals Zy = rZx (X and Y are the 1999 and 2000 test scores, respectively). For example, a school 2 standard deviation below the mean in 1999 will be expected to score only .85(2) = 1.7 standard deviations below the mean in 2000, assuming a correlation of .85 (a value compatible with the typical correlation is the Florida data)an effect size of .3! In 1999, F schools were 1.9SDs below the mean in reading, 1.7SDs below the mean in math, and 1.8SDs below the mean in writing. This simple analysis shows that the excepted magnitude of the regression effect warrants serious attention. Using a slightly more complicated formula (see, e.g., Campbell & Kenny, 1999, p. 28, Table 2.1), and the regression coefficient instead of the correlation, one can calculate the expected 2000 score or the expected score gain, given a particular level of performance in 1999. Table 1 gives the expected score gains, if regression toward the mean was the only factor responsible for these gains, for the three FCAT tests, alongside with the observed gains for schools with different grades in 1999 [Note 1]. Figure 1 shows the same findings graphically. |
Table 1
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
|
|
|
|
||
| |
|
|
||||
| |
|
|
||||
The writing on the wallThe seemingly curious pattern of gains for writing has, in fact, a simple explanation. If there was a clear mark on the writing score scale that D and F schools set up to reach, not more nor less, then lower scoring schools would have to close a wider gap to reach the mark, giving rise to a strong negative correlation between where they started and how far they had to go (their gain). Figure 4 clearly demonstrates this phenomenon. It shows, for the entire school population, the relationships between 1999 scores and 2000 mean scores and gains. The lines represent the best fitted nonlinear trend lines (using the "loess" technique, see Chambers & Hastie, 1991, pp. 309-376).
Figure 4. Writing 2000 Scores and Gains as a Function of 1999 Scores
ConclusionOn June 21, 2000, long before the release of the Manhattan Institute report, the St. Petersburg Times ran a story entitled Why are Florida children writing so much better? Noting the impressive improvement in the writing score, the story offered an explanation: How could so many kids suddenly become competent writers? Many educators were not completely surprised at the improvement. Out of fear and necessity, Florida educators have figured out how the state's writing test works and are gearing instruction toward itwith constant writing and, in many cases, a shamelessly formulaic approach. For some struggling schools, the writing test has helped them avoid an F rating. My findings are consisted with this explanation.The pattern of score improvements on the FCAT ought to give Florida officials pause and trigger a serious research effort to identify potentially harmful imbalances and deficiencies in the A-Plus program. Until a far better understanding of and experience with the Florida accountability system is at hand, Greene's brave generalization from the Florida data he examined to the desirability of a nation-wide implementation is premature at best. It appears that the program's strong attention to the lower portion of the score distribution and the aggressive efforts to improve test scores in that region have produced substantial unintended consequences. Much more evidence is needed to arrive at a sufficiently detailed account of the program's operations and impact. The short list will include documentation of instructional practices in response to the incentive system in place for high and low scoring schools; an examination of the implementation and utility of school improvement plans; and data on possible program effects on retention, drop-out, and inter-school mobility patterns. If vouchers were a dominant influence in motivating failing schools to act, the action they produced cannot be considered desirable by anyone who aims to raise the bar for students and schools. A minimum performance level in writing should not be considered a worthy educational goal for an ambitious accountability system such as the Florida A-Plus program. Yet, this appears to be the main achievement of the program in F schools. Coupled with a pattern of stagnation in other grade groups, especially in reading, these findings point to aspects of the program that deserve closer scrutiny. However, the reader of the Manhattan Institute laudatory report is offered a false sense of a dramatic success. It is, therefore, appropriate to recall Cronbach's advice to the evaluator: Disillusion is the bitter aftertaste of saccharine illusion. It is self-defeating to aspire to deliver an evaluative conclusion as precise and as safely beyond dispute as an operational language from the laboratory . When the evaluator aspires only to provide clarification that would not otherwise be available, he has chosen a task he can manage and one that have social benefits. (Cronbach, 1980, p. 318) Notes
AcknowledgmentThe work reported here was supported under the Educational Research and Development Centers Program, PR Award Number R305B600002, as administered by the Office of Educational Research and Improvement, U.S. Department of Education. The finding and opinions expressed in this report do not reflect positions or polices of the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, or the U.S. Department of Education. My thanks go to Greg Camilli, Sherman Dorn, Steve Lang, Bob Linn, Lorrie Shepard, and Kevin Welner for helpful comments. |
ReferencesChambers, J. M. and T. J. Hastie, Eds. (1991). Statistical models in S. Pacific Grove, CA: Wadsworth & Brooks/Cole. Cronbach, L.J. and Associates. (1980). Toward reform of program evaluation. San Francisco CA: Jossey-Bass Greene, J. P. (2001). An Evaluation of the Florida A-Plus Accountability and School Choice Program. New York: The Manhattan Institute. About the AuthorHaggai KupermintzSchool of Education University of Colorado at Boulder Email: haggai.kupermintz@colorado.edu Haggai Kupermintz is an Assistant Professor of research and evaluation methodology at the University Colorado at Boulder, School of Education. His specializations are educational measurement, statistics, and research methodology. His current work examines the structure, implementation, and effects of large-scale educational accountability systems. |
Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-0211. (602-965-9644). The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu . EPAA Editorial Board
|