This article has been retrieved
times since March 4, 2001
Education Policy Analysis Archives | ||
Volume 9 Number 7 |
March 4, 2001 |
ISSN 1068-2341 |
|
Editor: Gene V Glass, College of Education Arizona State University
Copyright 2001, the
EDUCATION POLICY ANALYSIS ARCHIVES. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Critique of
|
|
Abstract In 1999, Florida adopted the "A-Plus" accountability system, which included a provision that allowed students in certain low-performing schools to receive school vouchers. In a recently released report, An Evaluation of the Florida A-Plus Accountability and School Choice Program (Greene, 2001a), the author argued that early evidence from this program strongly implies that the program has led to significant improvement on test scores in schools threatened with vouchers. However, a careful analysis of Greene's findings and the Florida data suggests that these strong effects may be largely due to sample selection, regression to the mean, and problems related to the aggregation of test score results. |
|
One of the most closely watched state reforms in recent years is the use of school vouchers as a part of the accountability system for Florida's public schools. This program is of particular interest because of its strong similarities with proposals put forward by President George W. Bush. As a New York Times article noted, "Gov. Jeb Bush's educational program in Florida has been held up as a model for its combination of aggressive testing of schools' performance, backed by taxpayer-financed vouchers, which his brother President Bush is proposing for the nation as a whole" (Schemo, 2001). A recently published report purports to show a convincing link between the threat of school vouchers for students in certain low-performing schools in Florida and achievement gains in those schools. An Evaluation of the Florida A-Plus Accountability and School Choice Program (Greene, 2001a) documents gains in achievement on the Florida Comprehensive Assessment Test (FCAT) in the areas of reading, mathematics, and writing. (This evaluation will be referred to as Evaluation of Florida's A-Plus Program, for short.) These findings, not surprisingly, have received a substantial amount of attention in the popular press (cf. Schemo, 2001; Lopez, 2001; Greene, 2001b). The gains reported are attributed to incentives implemented under Title XVI (section 229.0535 "Authority to enforce school improvement") of the 2000 Florida Statutes: It is the intent of the Legislature that all public schools be held accountable for students performing at acceptable levels. A system of school improvement and accountability that assesses student performance by school, identifies schools in which students are not making adequate progress toward state standards, institutes appropriate measures for enforcing improvement, and provides rewards and sanctions based on performance shall be the responsibility of the State Board of Education.In the A- Plus accountability system, schools are evaluated and assigned one of five grades (A, B, C, D, F) based primarily on FCAT scores, and to a lesser extent, the percent of eligible students tested and dropout rates (Florida Department of Education, 2001). If a school receives two grades of "F" in any four-year period, it becomes eligible for state board action. Contrary to the implication in Greene's title, such action is not limited to school choice; rather, actions may include providing additional resources, implementing a school plan or reorganization, hiring a new principal or staff, and other unspecified remedies designed to improve performance. However, the possibility of public schools losing children to either private schools or higher-performing public schools is clearly the area of most interest and controversy. In the 1999-2000 school year, two Pensacola elementary schools met the eligibility criteria (Note 1), and as a result, lost 53 children to private schools and 85 to other public schools. Greene argued that his report "shows that the performance of students on academic tests improves when public schools are faced with the prospect that their students will receive vouchers" (p. 2). At the center of his argument is the fact that all 78 schools that received an "F" in 1999 received a higher grade in 2000. His claim that the threat of vouchers was responsible for the improvement of "F" schools (from the 1998-1999 to the 1999-2000 school year) includes several important elements. First, an attempt was made to show the validity of the FCAT by showing a strong correlation to another test (Stanford-9) given in Florida in 2000. Given this evidence, he then proceeded to show the average gains for each school receiving a particular grade. Based on the latter results, it was concluded that: The most obvious explanation for these findings is that an accountability system with vouchers as the sanction for repeated failure really motivates schools to improve. (p. 9)However, Greene also wrote: While the evidence presented in the report supports the claims of advocates of an accountability system and advocates of choice and competition in education, the results cannot be considered definitive. (p. 9)The A-Plus accountability system was duly noted as being relatively new, with the voucher options used in only two schools in the state, and possiblethough not likelymanipulation of FCAT scores. It is an additional alternative that Greene mentions, commonly known as regression to the mean, that is one main concern of this report. This paper also examines three other issues: (1) sample selection, (2) the combining of gain scores across grade levels, and (3) the use of the school as the unit of analysis. Below, we subsume the latter two items under the category of "aggregation." The potential policy importance of the findings Greene reports places a heavy burden on his study to demonstrate that the improved scores in schools that had previously received one "F" are in fact meaningful improvements and a result of school changes linked to the threat of vouchers. We argue here that the evidence does not support this conclusion. We show that there may have been some small achievement gains in Florida from 1999-2000, but these effects were vastly overestimated in Greene's analysis. However, even if these modest outcomes withstand further investigation, it is not at all clear that they resulted from the threat of vouchers as opposed to other aspects of the accountability program. |
BackgroundSeveral recent reforms have similar components to the Florida effort. It is not the purpose of this report to review that literature, but two well-known reforms deserve mention. One of these, which Greene specifically addresses, is the Texas accountability system and its use of the Texas Assessment of Academic Skills (TAAS). Another is the public voucher program in the city of Milwaukee. Comparisons between each of these reforms and the Florida's A-Plus accountability system are limited for a variety of reasons. The accountability system in Texas varies in critical ways from the model in Florida, especially in the use of vouchers as a sanction in the latter state but not the former. Greene did, however, address an important methodological concern (discussed below) that arose in a recent study of the TAAS (Klein, Hamilton, McCaffrey, and Stecher, 2000). In the area of publicly-funded vouchers, students in Milwaukee who met certain income requirements are eligible to receive vouchers allowing them to attend local private schools. Several evaluations have been done of this program (i.e. Witte, 1996; Greene, Peterson and Du, 1998). These evaluations are not comparable to the Florida evaluation because they examined the test scores of individual students who either received vouchers or applied for vouchers but did not receive one; the Greene study focuses on the school impact on test scores of the threat of vouchers, not the actual provision of vouchers.Summary of the Evaluation of Florida's A-Plus ProgramIn Evaluation of Florida's A-Plus Program (Greene, 2001a, Table 2), the main results were obtained by aggregating across grade for school types A, B, C, D, and F. These results are reproduced in Table 1 below.
Table 1
|
While one cannot anticipate or rule out all plausible alternative explanations for the findings reported in this study, one should follow the general advice to expect horses when one hears hoof beats, not zebras. The most plausible interpretation of the evidence is that the Florida A-Plus system relies upon a valid system of testing and produces the desired incentives to failing schools to improve their performance. (p. 14)
Critique of the Evaluation of Florida's A-Plus ProgramOur critique of Greene's evaluation focuses primarily on two problematic issues: aggregation and regression to the mean. We do not examine in detail Greene's validation argument for the FCAT based on its correlations with the Stanford-9 (the latter given in 2000). Greene's correlational analysis was conducted partly in response to concerns raised by Klein and his colleagues (2000) about the validity of the TAAS in Texas. However, it is worth noting that while the two tests have substantial correlations (in the range .85-.95), correlation coefficients computed on aggregate scores typically have much higher values than those computed with student scores. For example, school means on the reading and mathematics sections of the FCAT in 8th grade have a correlation of about .96. This correlation should not be interpreted as meaning that the FCAT reading and mathematics tests are statistically indistinguishable, but rather that correlations on aggregate score tend to be much higher than those for individual scores.Sample SelectionGreene (2001a) used the school means of "standard curriculum" students to obtain school-level gains scores. Here "standard" defines a subset of students who tend to score higher on the FCAT (i.e., it does not include certain types of students with disabilities). An alternative method of choosing a sample is to use the results for all curriculum groups, and these data are available on the Florida Department of Education web pages. While there is nothing intrinsically wrong with using standard curriculum students, for the purposes of evaluation, however, it would seem preferable to look at the potential impact of the A-Plus program on all curriculum groups. Florida administrative statues allow for (or require) nontrivial variation in populations selected for determining school grades (Note 2).AggregationIn the analyses below, we disaggregate results by grade. This is useful because overall state gains (Florida Department of Education, 2001) vary by grade as shown in Table 2.
Table 2
|
Regression to the meanCampbell & Stanley (1966) in their classic volume Experimental and Quasi-Experimental Designs for Research defined the internal validity of an experiment as:The basic minimum without which any experiment is uninterpretable: Did in fact the experimental treatments make a difference in this specific experimental instance? (p. 5)In a very simple investigation, there are only two measurements taken: the pretest (O1) and, after the experimental intervention, the posttest (O2). Campbell and Stanley (1966) listed five definite weaknesses of this "One-Group Pretest-Posttest Design" and one potential concern which is of central importance to Greene's evaluation: regression to the mean or, alternatively, regression artifacts. They explained: If, for example, in a remediation experiment, students are picked for a special experimental treatment because they do particularly poorly on an achievement test (which becomes for them O1), then on a subsequent testing using a parallel form or repeating the same test, O2 for this group will almost surely average higher than O1. This dependable result is not due to any genuine effect of [the intervention], and test-retest practice effect, etc. It is a rather tautological aspect of the imperfect correlation between O1 and O2. (p. 10)In short, experimental units chosen on the basis of extreme scores tend to drift toward the mean upon posttest: low scores drift upward and high score drift downward. Campbell and Stanley (1966) then gave an extended treatment to this topic because "errors of inference due to overlooking regression effects have been so troublesome in educational research," and "the fundamental insight into their nature is so frequently missed" (p. 10). The regression phenomenon emerged from Francis Galton's studies of inheritance in biology, and this subject provides the most common phrasing of the regression to the mean effect: tall fathers tend to have tall sons, but not as tall on average as the fathers; while short fathers have short sons, but not as short on average as the fathers. It can be seen in Table 1 for all three FCAT subjects that the trend is for higher achievement schools to gains less and lower achievement schools to gain more. This is a tell-tale sign of statistical regression, that is, scores in the tails of the distribution tend to drift toward the mean. Higher scores drift downward and lower scores drift upward relative to average gains. Greene (2001a) did consider this possibility, but rejected it as a potential explanation, arguing that: Regression to the mean is not a likely phenomenon for the exceptional improvement made by the F schools because the scores for those schools were nowhere near the bottom of the scale for possible results. The average F school reading score was 254.70 in 1999, far above the lowest possible score of 100.Likewise, the average FCAT mathematics and writing scores of the F schools were 272.5 on a scale of 100-500 and 2.40 on a scale form 1-6, respectively. Greene thus concluded that regression to the mean was not a problem because the scores of the F schools were not at all extreme. This is an inaccurate notion of regression to the mean because "extremeness" should be evaluated in terms of distance (in standard deviation units) below the overall group mean, rather than relative to the lowest possible score. A good measure of "distance below the mean" can be given in z-score units which are interpreted as "standard deviations below the mean" in the distribution of school means; z-scores of 3.00 and lower generally indicate substantial distance below the mean. To check for extremeness, we calculated the z-scores of the lowest performing school in 4th, 8th, and 10th grade reading, and 5th, 8th and 10th grade mathematics. These z-scores ranged from a high of 3.2 to a low of 4.5, indicating a strong likelihood of obtaining a regression artifact in simple difference scores; however, the writing scores tended to be less extreme for the "F" schools. In North Carolina, it was recognized that "Students who are proficient may grow faster" and "students who score low one year may score higher the next year, partly due to 'regression to the mean'" (Public Schools of North Carolina, 2000, p. 2). Both influences on achievement are explicitly taken into account in the North Carolina system when computing expected growth for schools. As noted by Campbell and Stanley (1966) the incorrect interpretation of regression effects has plagued educational research for decades. To give an example, consider a study by Glass and Robbins (1967) in which the SAT was given to a group of students, and researchers then took the high scorers as the control group and the low scorers as the treatment group. Predictably, the treatment showed a positive effect that disappeared when regression effects were taken into account (Glass & Robbins, 1967) MethodsData SourcesThe state of Florida has an exceptional policy of granting the public full access to state, district, school level test scores, and other variable such as class size, per pupil expenditures, and the like. These data files containing school means for all curriculum students can be downloaded in the form of Excel spreadsheets at the Florida Department of Education website. For the present analysis, reading and mathematics, and writing FCAT scores at the school level were downloaded for both the 1998-1999 and 1999-2000 school years. Department staff provided a spreadsheet containing school grades, with district and school identification numbers, for the 1998-1999 school year.
Residual gain score analysis
Since we strongly suspected that the statistics in Table 1
were affected by at least two sources of error (regression
to the mean and incorrect definition of net effect), we
reanalyzed the data using the technique of residual gain
scores. Glass and Hopkins (1996) described the context
for residual gains:
|
ResultsAverage residual gains for the FCAT reading and mathematics tests, disaggregated by grade, are given in Tables 3 (reading), 4 (mathematics), and 5 (writing) below.
Table 3
|
Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-0211. (602-965-9644). The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu . EPAA Editorial Board
|