A peer-reviewed scholarly journal  
Editor: Gene V Glass
College of Education
Arizona State University
epaa home
abstracts
complete articles
editors
submit
article
submit commentary
receive publication notices
search
epaa
 

Copyright is retained by the first or sole author, who grants right of first publication to the EDUCATION POLICY ANALYSIS ARCHIVES. EPAA is a project of the Education Policy Studies Laboratory.

Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education.

PDF file of this article available: pdf.gif

 

This article has been retrieved   times since January 5, 2004

Volume 12 Number 1
January 5, 2004
ISSN 1068-2341

Reconsidering the Impact of High-stakes Testing

Henry Braun
Educational Testing Service

Citation: Braun, H. (2004, January 5). Reconsidering the impact of high-stakes testing, Education Policy Analysis Archives, 12(1). Retrieved [Date] from http://epaa.asu.edu/epaa/v12n1/.

Abstract

Over the last fifteen years, many states have implemented high-stakes tests as part of an effort to strengthen accountability for schools, teachers, and students. Predictably, there has been vigorous disagreement regarding the contributions of such policies to increasing test scores and, more importantly, to improving student learning. A recent study by Amrein and Berliner (2002a) has received a great deal of media attention. Employing various databases covering the period 1990-2000, the authors conclude that there is no evidence that states that implemented high-stakes tests demonstrated improved student achievement on various external measures such as performance on the SAT, ACT, AP, or NAEP. In a subsequent study in which they conducted a more extensive analysis of state policies (Amrein & Berliner, 2002b), they reach a similar conclusion. However, both their methodology and their findings have been challenged by a number of authors. In this article, we undertake an extended reanalysis of one component of Amrein and Berliner (2002a). We focus on the performance of states, over the period 1992 to 2000, on the NAEP mathematics assessments for grades 4 and 8. In particular, we compare the performance of the high-stakes testing states, as designated by Amrein and Berliner, with the performance of the remaining states (conditioning, of course, on a state’s participation in the relevant NAEP assessments). For each grade, when we examine the relative gains of states over the period, we find that the comparisons strongly favor the high-stakes testing states. Moreover, the results cannot be accounted for by differences between the two groups of states with respect to changes in percent of students excluded from NAEP over the same period. On the other hand, when we follow a particular cohort (grade 4, 1992 to grade 8, 1996 or grade 4, 1996 to grade 8, 2000), we find the comparisons slightly favor the low-stakes testing states, although the discrepancy can be partially accounted for by changes in the sets of states contributing to each comparison. In addition, we conduct a number of ancillary analyses to establish the robustness of our results, while acknowledging the tentative nature of any conclusions drawn from highly aggregated, observational data.

Introduction

Since its passage in January 2002, the No Child Left Behind (NCLB) Act has already had a substantial influence on state and local education agencies as they develop accountability plans to win the approval of the U.S. Department of Education. In addition to the operational concerns of these agencies, as well as those of principals and teachers, there is considerable debate about the efficacy of externally mandated high-stakes testing in improving learning (Elmore, 2002; Lewis, 2002; Steinberg, 2003; Wolf, 2003). Indeed, most of the education and educational measurement community is doubtful that high-stakes testing will have a generally salutary effect on the quality of student learning (Linn, 2000; Mehrens, 1998), although there are contrasting views (Grissmer, Flanagan, Kawata, & Williamson, 2000). Given the level of disagreement, it is natural for both supporters and opponents to look to extant data to buttress their positions. Inasmuch as a number of states have instituted high-stakes testing policies of various kinds over the last decade or more, there is a record of results that, presumably, can yield some insights into the likely impact of such policies.

A recent, and much cited, example of this approach is the article by Amrein and Berliner (2002a). Employing some general criteria, they identify 18 states as having high-stakes testing policies and examine the achievement of their students on a number of measures, including the SAT and the ACT, NAEP results in mathematics and reading, and Advanced Placement Program. The rationale is that trends on the state tests cannot be relied upon as valid indicators of student learning (Linn, 2000) and, if learning is indeed taking place, then similar trends should be seen in other, related measures. Their overall conclusion was that “At the present time, there is no compelling evidence… that those policies result in transfer to the broader domains of knowledge and skill for which high-stakes test scores must be indicators” (p. 54).

The authors are careful to point out some of problems with each of the measures as a basis for drawing conclusions about the impact of the states’ policies. NAEP results, for reasons discussed below, are perhaps the least objectionable, as well as being the most relevant to considerations related to the consequences of NCLB for elementary and middle schools. A close reading of the article, however, raises a number of methodological concerns and it is the purpose of this paper to examine those concerns through a reanalysis of the NAEP mathematics data and to explore the policy implications of the findings.

Amrein and Berliner also produced a follow-up report (Amrein & Berliner, 2002b). In that paper, they carried out a more extensive policy analysis and identified 28 states as high-stakes states. We defer discussion of this second report, as well as alternative views (e.g., Carnoy & Loeb, 2003; Raymond & Hanushek, 2003) to the Discussion section.

There is always a danger that, in carrying out these analyses, we will forget the very real limitations on the conclusions we can draw. Accordingly, we enumerate them at the outset. First, we are working with observational data so that causal inferences are not warranted. Second, these 18 states (and the other 32) have engaged in a number of education initiatives in addition to their testing policies so that ascribing differences in NAEP results (solely or principally) to the impact of high-stakes testing is problematic. A similar difficulty arises in trying to explain the results of some states in terms of (apparent) attempts to “game the system” by, for example, increasing the proportion of SD/LEP students who are excluded from participating in NAEP. (SD/LEP refers to students with disabilities and/or students with limited English proficiency—with both groups expected to perform below average.)

Perhaps the most important consideration from the policy perspective is that for all the differences among state NAEP scores and state NAEP score changes, there is much greater variability within states—and probably more to be learned from trying to understand the sources of such within state differences. For an example, see Raudenbush, Fotiu, Cheong, and Ziazi (1995). That said, the current controversy about state level results demands that we address the question in as balanced a fashion as possible.

We begin with a review the methodology of Amrein and Berliner (2002a) and then carry out a reanalysis of the cross-sectional data, followed by a reanalysis of the cohort data. Both reanalyses are repeated using the states’ 25th percentiles (rather than the states’ means) to ascertain the robustness of the results. The final section relates the findings to those in the recent literature, offers some interpretations, as well as some cautions on drawing policy conclusions from data of this type.

Reviewing Amrein and Berliner

Amrein and Berliner (2002a) identified 18 states as having “… the most severe consequences, that is, the highest stakes associated with their K-12 testing policies” (p. 18). All the states had regulations making high school graduation contingent on passing a high school graduation exam. They also had various combinations of other stakes relating to grade promotion contingent on examination performance, making public annual school or district report cards, as well as rewards and sanctions for schools, teachers and students (see their Table 1, p. 18). One can certainly argue with their classification that, for example, includes Kentucky and Massachusetts among the low-stakes states. In the interests of maintaining comparability, however, we have retained their classification.

The rationale for analyzing NAEP data is that NAEP is the only nationally administered achievement test—and one that students do not explicitly prepare for. Since 1990, states have had the option to participate in “State NAEP,” with scores reported on the same scale as National NAEP. Consequently, states can be compared in terms of their performance on NAEP over time. If increases in state test scores are valid indicators of improved skills, then one would expect to see corresponding increases in NAEP scores.

Of course, there are some weaknesses to this approach. Student motivation to perform well on NAEP is likely not as great as it is on a high-stakes exam. (On the other hand, it is not clear why students in different states would experience differential reductions in motivation from the state test to NAEP.) States can have different policies on excluding SD/LEP students from participation in NAEP and also may differ in the extent to which the state assessment is aligned with the NAEP framework. Presumably, students in states with greater alignment might be expected to do better on NAEP than students in states with lesser alignment. On the other hand, use of SAT, ACT, or AP scores is very problematic. Aside from concerns about self-selection, it is difficult to make the case that the performance standards set by states would have had much impact on college-bound students.

With respect to the timing of policies, Amrein and Berliner (2002a) only provide the year at which the high school graduation requirement became operative in each of the 18 states. They state explicitly that “The usefulness of the NAEP analyses that follow rests on the assumption that states’ other K-12 high-stakes testing policies were implemented at or around the same time as each state’s high school graduation exam” (p. 36).

Their approach to the analysis of data can best be illustrated by an example. Using NAEP mathematics results for grade 4, they compute the change for the nation, and for each state, over the period 1992 to 2000. They then calculate the differential gain for each state as:

State Gain = (change for state ‘92 to ‘00) - (change for nation ’92 to ‘00).

A positive State Gain means that over this time period the state’s improvement on NAEP exceeded that of the nation. Conversely, a negative value means that the nation’s improvement exceeded that of the state. It is important to recognize that in the latter case, the change for the state could be positive but just not as large as the nation’s.

Using rounded values, Amrein and Berliner (2002a) find (see their Table 8) that for the eighteen high-stakes states they selected, there were 8 positive State Gains, 3 negative and 2 zeroes. There were five states where data were declared “not available.” This is curious, as two states (Indiana and Minnesota) do have NAEP data available and we include them in our reanalysis.

Amrein and Berliner (2002a) acknowledged this result appears to support the beneficial impact of high stakes testing. However, they argue that the association between State Gain and the change in the percent of students excluded from NAEP over the same time period (r = 0.39) undercuts the interpretability of the result. When, further, they combine the analyses for 1992 to 1996 and 1996 to 2000 with the one for 1992 to 2000 (ignoring the dependency induced by the overlap in time), they reach the conclusion that “In short, when compared to the nation as a whole, high-stakes testing policies did not usually lead to improvement in the performance of students on the grade 4 NAEP math tests between 1992 and 2000” (p. 40).

In the case of grade 8 NAEP mathematics, they find (see their Table 9) that, over the period 1990–2000, five states posted gains, four losses and one remained the same. Note that eight states are missing, so these results reflect the experiences of only slightly more than half the states of interest. After aggregating results over the periods 1990 to 1992, 1992 to 1996 and 1996 to 2000 (again ignoring the overlap) and pointing to the problem of differential changes in exclusion rates, they conclude again that “there is no compelling evidence that high-stakes testing policies have improved the performance of students on the grade 8 NAEP math tests” (p. 43).

Reanalysis

Our approach to the question differs in a number of ways from Amrein and Berliner (2002a):

  • In addition to carrying out an analysis for the eighteen high-stakes states that were the focus of Amrein and Berliner (2002a), we carry out a parallel analysis for the other 32 states.
  • We augment our analysis by including a more comprehensive measure of states’ educational reform efforts.
  • Our interpretation of the State Gain statistics is informed by consideration of the corresponding estimated standard errors. (Since the State Gain is a “difference of differences,” these standard errors are not negligible, with a typical value of 2.5 points on the NAEP scale.)
  • In the analysis of the grade 8 data, we look at changes over the period 1992 to 2000, rather than 1990 to 2000. Our choice makes the analyses for grades 4 and 8 more comparable, and provides slightly more data.

The data were obtained from the National Center for Education Statistics Web site (2003). The data extracted comprise grade 4 and grade 8 NAEP mathematics results in the years 1992 and 2000 for the states and the nation (public schools only). For each jurisdiction, grade and year, we recorded the average score, the corresponding estimated standard error, and the percent of students excluded. The data are displayed in Table A1 of the appendix. We note that relevant NAEP data is available for 15 of 18 high-stakes states and 18 of 32 of the other or “low-stakes” states.

For each state and grade, we compute the State Gain and its estimated standard error. Specifically, let

d4 (state) = [state(’00) – state(’92)] – [nat’l(’00) – nat’l(’92)]

where the quantities on the right hand side of the equation represent the average results for grade 4. Further, for each state let

s.e.(d4) = (estimated) standard error of d4.

Since the four quantities contributing to d4 are derived from independent samples, s.e.(d4) is simply the square root of the sum of the (estimated) variances of the four quantities. We also compute, for each grade and state, the changes in the percent of excluded students over the period, denoted c%ex.

Now let

D4 = d4/s.e (d4)

and


with a parallel set of definitions for d8, D8, and V8 for the grade 8 results. Finally, we let

V = V4 + V8.

Table 1
Basic Results for Analysis of NAEP Mathematics Scores: Grades 4 and 8 Trends for 1992 to 2000

State Policy
score
s.e.
()
Changes
in %
excluded
Gr. 4
s.e.
()
Changes
in %
excluded
Gr. 8
Hi-stakes
states
AL 2.20 1.96 2.45 0.80 1 1.27 2.42 2.75 0.88 1 -0.51 2
GA 0.66 -3.69 2.05 -1.80 -2 1.28 -0.57 2.13 -0.27 -1 2.52 -3
IN 0.90 5.73 1.95 2.94 2 3.46 5.41 2.24 2.41 2 2.72 4
LA -0.03 6.17 2.38 2.59 2 3.68 1.45 2.57 0.56 1 1.49 3
MD 2.46 -2.66 2.20 -1.21 -2 4.88 3.64 2.30 1.58 2 5.88 0
MN -0.40 -0.88 2.04 -0.43 -1 2.44 -2.29 2.15 -1.06 -2 2.03 -3
MS 0.55 1.49 1.97 0.76 1 -0.59 0.03 2.17 0.01 1 0.30 2
NM 0.78 -7.09 2.42 -2.93 -2 4.95 -7.31 2.33 -3.13 -2 6.21 -4
NY 0.09 0.46 2.21 0.21 1 6.24 2.29 3.21 0.71 1 4.64 2
NC 1.60 11.92 1.94 6.14 2 9.48 14.18 2.07 6.86 2 10.59 4
OH 1.15 4.21 2.17 1.94 2 3.91 7.00 2.47 2.83 2 2.61 4
SC 0.90 0.27 2.16 0.12 1 2.65 -1.96 2.12 -0.93 -1 0.94 0
TN 0.32 1.23 2.37 0.52 1 -0.10 -2.94 2.55 -1.15 -2 -0.33 -1
TX -0.66 7.09 2.12 3.34 2 7.86 2.71 2.34 1.16 2 2.99 4
VA 0.55 1.98 2.21 0.89 1 5.55 1.27 2.29 0.55 1 4.69 2
Lo-stakes
states
AZ -0.40 -4.14 2.17 -1.91 -2 6.92 -2.20 2.36 -0.93 -1 3.37 -3
AR -0.27 -0.80 1.91 -0.42 -1 1.15 -2.49 2.21 -1.13 -2 1.94 -3
CA 0.09 -2.49 2.72 -0.91 -1 -3.24 -6.27 2.92 -2.14 -2 0.42 -3
CT 1.29 -0.21 2.05 -0.10 -1 3.37 0.62 2.19 0.28 1 3.60 0
HI 0.32 -5.86 2.14 -2.73 -2 4.42 -2.18 2.04 -1.07 -2 2.42 -4
ID -0.27 -2.32 1.99 -1.17 -2 2.46 -4.72 1.97 -2.39 -2 1.59 -4
KY 1.97 -1.71 1.99 -0.86 -1 4.95 1.77 2.20 0.81 1 4.91 0
ME 1.29 -8.73 1.85 -4.72 -2 4.46 -2.54 2.00 -1.27 -2 4.13 -4
MA 0.32 0.71 2.05 0.34 1 3.45 2.79 2.07 1.35 2 3.99 3
MI 0.43 3.35 2.56 1.31 2 3.11 3.55 2.47 1.44 2 0.48 4
MO 1.02 -1.32 2.10 -0.63 -1 5.34 -5.10 2.28 -2.24 -2 4.11 -3
NE -1.61 -7.04 2.46 -2.87 -2 3.45 -4.58 2.02 -2.26 -2 -0.57 -4
ND -0.03 -5.42 1.71 -3.18 -2 3.98 -7.69 2.02 -3.81 -2 1.40 -4
OK 0.43 -2.93 2.03 -1.45 -2 3.18 -4.03 2.27 -1.78 -2 2.31 -4
RI 0.09 1.53 2.32 0.66 1 5.89 -0.03 1.84 -0.01 -1 6.67 0
UT 1.15 -4.40 2.00 -2.20 -2 2.63 -6.45 1.87 -3.45 -2 1.45 -4
WV 0.90 1.92 2.03 0.95 1 5.65 4.15 1.91 2.17 2 5.29 3
WY -0.95 -3.78 2.03 -1.86 -2 2.55 -5.94 1.93 -3.07 -2 0.01 -4

Table 1 displays the relevant quantities. (Note: The policy score will be defined presently.) We observe that for high-stakes states, d4 ranges from –7.09 to 11.92, with a median of 1.49 and a mean of 1.88; d8 ranges from –7.31 to 14.18, with a median of 1.45 and a mean of 1.69. For low-stakes states, d4 ranges from –8.73 to 3.35, with a median of –2.41 and a mean of –2.42; d8 ranges from –7.69 to 4.15, with a median of  and a mean of –2.30. Thus, we see that the typical State Gain for high-stakes states is substantially larger than the typical State Gain for low-stakes states in both grades 4 and 8.

At grade 4, the difference in means between the high-stakes and low-stakes states is 4.3 score points and at grade 8 it is 3.99 score points. Note that in computing the difference in means, the gain of the nation over the period 1992 to 2000 is eliminated. Consequently, such differences provide a direct comparison between the typical gains for high-stakes states and low-stakes states. Some might prefer such comparisons because the results for the nation are influenced by all the states we are considering, as well as the states that did not participate in both NAEP administrations. However, we have chosen to follow the approach of Amrein and Berliner (2002a) in order to facilitate comparisons between our results and theirs.

While there certainly is interest in the State Gains (ds) themselves, we believe there is also value in comparing states in terms of the Vs, which are essentially discretized effect sizes. Specifically, Vk (k = 4 or 8) gives a state 2 “credits” if dk exceeds one standard error (in one direction or the other). While the usual criterion for statistical significance (which is not particularly appropriate in this setting) would require exceeding two standard errors, there is practical interest in identifying states whose relative gain is at least greater than one standard error—given the magnitude of the standard errors, the level of dispersion in the dks among the states, and the fact that the national gain (although statistically independent of the state gains) is influenced by the educational policies of the various states.

A state that presents what one might term a strongly consistent picture of relative improvement over the nation (i.e., D4 > 1 and D8 > 1) is awarded 4 credits. One that presents a moderately consistent picture (e.g., D4  > 1 and 1 > D8 >0) is awarded 3 credits and one that presents a mildly consistent picture (i.e., 1 > D4 > 0 and 1 > D8 > 0) is awarded 2 credits. Note that this coding scheme limits the influence of outliers and allows us to distinguish most configurations of D4 and D8.

The distributions of V for the high-stakes testing states and the remaining states are presented in Table 2. For the first group, we have values of V for 15 out of the 18 states, while for the second group we have values of V for 18 out of 32 states. There is a striking difference between the two groups of states: High-stakes states are more likely to show strongly consistent improvement relative to the nation (V = 4) than low-stakes states (4/15 vs. 1/18) and much less likely to show strongly consistent lack of improvement (V = –4) relative to the nation (1/15 vs. 8/18). The story remains qualitatively the same if we compare the groups with less stringent cut-offs.

Table 2
Distribution of V for Hi-stakes and Lo-stakes States

# Lo-stakes states
Total=18
V # Hi-stakes states
Total=15
1 4 4
2 3 1
0 2 4
0 1 0
3 0 2
0 -1 1
0 -2 0
4 -3 2
8 -4 1

In summary, high-stakes testing states that participated in the NAEP mathematics assessment in both 1992 and 2000 typically showed improvement relative to the nation while low-stakes testing states that participated in the NAEP mathematics assessment in both 1992 and 2000 typically showed lack of improvement relative to the nation. (We must be careful to condition on participation in NAEP since a large number of low-stakes states did not participate in NAEP in one or both of the years under study.) The question is how to interpret the comparison.

With respect to the results for the high-stakes states, Amrein and Berliner (2002a) discount the finding, in part, because of the empirical association between State Gain and the change in percent excluded. This is a reasonable argument but one that deserves further scrutiny for at least two reasons. First, the observed correlation may be unduly influenced by an outlying observation and, second, there are other, observable and unobservable characteristics of states that may also account for some of the differences among states. (It also should be noted that the 1992 exclusion rates are not strictly comparable to those in 1996 and 2000. The former were calculated as an average over mathematics and reading, while the latter two are reported for mathematics only.)

image020.gif

Figure 1a. Grade 4: image004.gif vs. Change in % Excluded (1992 to 2000).

image022.gif

Figure 1b. Grade 8: image010.gif vs. Change in % Excluded (1992 to 2000).

Figure 1a displays a plot of  against c%ex and Figure 1b displays a plot of d8 against c%ex. North Carolina is a clear outlier on both plots, while Texas is an outlier in Figure 1a. Table 3 presents the correlations for the two groups of states, including the case for the high-stakes states with North Carolina removed. In the fourth grade, we see that the correlation for the high-stakes states is indeed substantial, but markedly reduced when North Carolina is deleted. For the eighth grade, the reduction is even more dramatic. On the other hand, for low stakes states in the eighth grade, the correlation is quite high. One might, therefore, plausibly argue that the results for the low-stakes states would be further depressed (relative to those for the high-stakes states) if their apparent relationship to c%ex were somehow taken into account.

Table 3
Correlations Between State Gains and Change in % Excluded for Years 1992 to 2000

  State gains
  Grade 4 () Grade 8 ()
Hi-stakes (# = 15) 0.44 0.49
Hi-stakes w/o NC (# = 14) 0.17 -0.01
Lo-stakes (# = 18) 0.02 0.55

It is often the case that gain scores are negatively correlated with the base year score. Accordingly, in Figures 2a and 2b we plot d4 against the grade 4 state score (’92) and d8 against the grade 8 state score (’92). In both cases, we observe the expected negative correlations. Plots of d4 and d8 against their standard errors were not informative and are not presented.

Figure 2a. Grade 4:  vs. State Scale Score (1992).

Figure 2b. Grade 8:  vs. State Scale Score (1992).

As was noted in the Introduction, changes in state NAEP scores over time can be the result of many factors in addition to percent of students excluded and testing policies. In particular, other educational interventions that have been adopted by the state with the intention of raising the academic achievement of its students may well have their intended effect, at least to some degree. It would be helpful, therefore, to have a broader measure of each state’s educational policy efforts to incorporate into an explanatory framework. Fortunately, one such a measure has been formulated and quantified as part of a study of the influence of standards-based reform on changes in classroom practice (Swanson & Stevenson, 2002).

Drawing on studies conducted by the Council of Chief State School Officers, Swanson and Stevenson graded each state on each of 22 policy activities organized into four categories: content standards, performance standards, aligned assessments and professional standards. Grades were assigned on a three point scale: does not have such a policy (0), is developing one (1), or has enacted such a policy as of 1996 (2). They then carried out a Rasch analysis using this 50 x 22 data array, yielding a “state (policy) activism score” for each state. They report a low level of item misfit. For more details, consult their article.

In view of the comprehensiveness of the policy information employed and that 1996 falls in the middle of the period of interest, we propose to use the policy activism scale as another possible explanatory variable in our effort to account for differences among states in State Gains. The policy activism scores are located in the second column of Table 1. Figures 3a and 3b display plots of d4 and d8 against activism scores.

Figure 3a. Grade 4:  vs. Policy Score.

Figure 3b. Grade 8:  vs. Policy Score.

Since the mean policy score over the 50 states is 0.28, we note that the 33 states that we are examining tend to have scores above the mean. The median for the high-stakes states is 0.66 and the median for the low-stakes states is 0.32. Correlations between State Gain and policy scores for the two groups of states are presented in Table 4. We observe the relationship is moderately strong and positive in grade 8, but rather mixed in grade 4. Again, North Carolina exerts considerable leverage on the results for the high-stakes states.

Table 4
Correlations Between State Gains for Years 1992 to 2000 and Policy Score

  State gains
  Grade 4 () Grade 8 ()
Hi-stakes (# = 15) -0.07 0.37
Hi-stakes w/o NC (# = 14) -0.29 0.26
Lo-stakes (# = 18) 0.22 0.38

Before proceeding to the next stage of the comparison between the high-stakes and low-stakes states, it might be of interest to compare the V distributions of high activism and low activism states, defined by whether they are above or below the mean policy score of 0.28, respectively. The results are presented in Table 5, which is analogous to Table 2. This comparison involves 21 out 27 high activism states and 12 out of 23 low activism states. While the comparison favors the high activism states, it is less clear-cut than the one in Table 2. Note that the V values of the high activism states fall about equally above and below zero. On the other hand, the V values of the low activism states are more likely to be negative. Thus, somewhat surprisingly, the categorization employed in Amrein and Berliner (2002a) seems to provide a sharper contrast than the categorization based on the broader policy analysis employed by Swanson and Stevenson (2002).

Table 5
Distribution of V for Hi-policy Score and Lo-policy Score States

# Lo-policy
score states
Total = 12
V # Hi-policy
score states
Total = 21
1 4       4
1 3       2
1 2       3
0 1       0
1 0       4
0 -1       1
0 -2       0
4 -3       2
4 -4       5

Returning to the main thread of our reanalysis, we carry out a multiple regression of d4 on three explanatory variables: state score (’92), c%ex and activism score, and an analogous regression for d8. In both regressions, we leave out North Carolina and Texas because they are outliers in one or both panels of Figure 1. The essential elements of the regression output are presented in Tables 6a and 6b.

Table 6a
Grade 4: Regression of  on Policy Score, 1992 State Scale Score, Change in % Excluded for Years 1992 to 2000

ANOVA

  df SS MS F   p-value
Regression 3 67.1 22.4 1.7 0.2
Residual 27 350.3 13.0    
Total 30 417.4      

  Coefficients Standard error t stat p-value  
Intercept 40.6 20.9 1.9 0.1  
Policy score 0.5 0.8 0.6 0.5  
Score -0.2 0.1 -2.0 0.1  
Change in % excluded 0.1 0.3 0.3 0.8  

 
Summary Output          
Regression statistics          
Multiple R 0.40        
R square 0.16        
Adjusted R square 0.07        
Standard error 3.60        
Observations 31.00        

Table 6b
Grade 8: Regression of  on Policy Score, 1992 State Scale Score, Change in % Excluded for Years 1992 to 2000

ANOVA
  df SS MS F p-value
Regression 3 113.3 37.8 2.9 0.1
Residual 27 353.8 13.1    
Total 30 467.2      
           

  Coefficients Standard error t stat p-value  
Intercept 21.3 20.2 1.1 0.3  
Policy score 1.5 0.9 1.7 0.1  

Score

-0.1  0.1 -1.2  0.2  
Change
in %
excluded
0.3 0.3 0.9 0.4  
           

Summary Output          
Regression statistics          
Multiple R 0.49        
R square 0.24        
Adjusted R square 0.16        
Standard error 3.62        
Observations 31.00        

For Grade 4, the R2 = 0.16 (adjusted R2 = 0.07) so clearly the three explanatory variables do not account for very much of between-state variation; only state score (’92) is marginally significant. Overall, residual plots against each of the explanatory variables do not reveal any patterns. However, the residuals for the 13 high-stakes states (i.e. not including Texas and North Carolina) tend be more positive than the residuals for the 18 low-stakes states. This is to be expected given the results in Tables 1, 2 and 6.

Figure 4a presents the residual plot against c%ex. The residuals for Texas and North Carolina were obtained by substituting their values for the three explanatory variables into the regression equation presented in Table 6a (which was estimated using the other 31 states). We note that Texas and North Carolina are outliers in the sense that they have both the largest values on c%ex and the largest positive residuals. On the other hand, for the other states there appears to be no association (linear or otherwise) between c%ex and state gain.

image032.gif

Figure 4a. Grade 4: Plot of residuals vs. Change in % Excluded (1992 to 2000). Residuals obtained from a regression of image004.gif on state score ('92), c%ex and policy score.

image035.gif

Figure 4b. Grade 8: Plot of residuals vs. Change in % Excluded (1992 to 2000). Residuals obtained from a regression of  on state score ('92), c%ex and policy score.

Turning to Grade 8 (Table 6b), we note that the R2 = 0.24 (adjusted R2 = 0.16) and that the only explanatory variable that approaches significance is policy score. Overall, the residual plots again reveal no interesting patterns, except that high-stakes states tend to have more positive residuals than do low-stakes states. Figure 4b presents the residual plot against c%ex, with the residuals for Texas and North Carolina added. North Carolina remains an outlier, but not Texas. For the other states, there does not appear to be an association between c%ex and state gain.

In view of the above analysis, it is not appropriate to discount the differences in results between the high-stakes and low-stakes states (e.g. Table 2) by arguing they are strongly influenced by differences in changes in percent of students excluded over the period 1992 to 2000. That argument is simply not supported by the data.

One might want to distinguish the results for North Carolina from those of the other states, arguing that the unusually large value of c%ex “explains” the unusually large value of State Gain. If that were the case, then school officials in North Carolina would have been much more adept than officials in other states in excluding SD/LEP students who would have done poorly on NAEP. In particular, school officials in New Mexico, which also experienced a large increase in percent of students excluded (particularly in Grade 8) but large negative State Gains, would have much to learn from their counterparts in North Carolina! A more circumspect statement about North Carolina is that its State Gain may well be a consequence of both its reform policies and the increase in excluded students—but that with the data availaleble we are neither able to determine the relative contributions of these two factors nor those of other factors.

Cohort Analyses

Amrein and Berliner (2002a) correctly point out that a weakness of the repeated cross-sectional studies described above is that real changes over time in student test performance are confounded with changes in the characteristics of successive cohorts that are unrelated to school effects but associated with performance. For example, in a particular state, grade 4 students in 2000 might be more disadvantaged than were grade 4 students in 1992 and, therefore, perform more poorly on NAEP even if the productivity of the state’s schools remained unchanged.

The structure of the NAEP system makes possible another way of looking at a state’s performance. Since NAEP tested students in mathematics in both grades 4 and 8 every four years, we can determine the gains of the cohort tested in grade 4 in 1992 and again in grade 8 in 1996, as well as the gains of the cohort tested in grade 4 in 1996 and again in grade 8 in 2000. Although the actual students tested four years apart are not the same students (i.e., this is not a true longitudinal study like High School and Beyond), each group is a probability sample of their respective cohorts. Thus, the observed gain is an approximately unbiased estimate of the population gain over the period in question. The word “approximately” is appropriate since there are inflows and outflows over the four years, as well as differential rates of exclusions and non-response at school and student levels. Nonetheless, the results should be sufficiently accurate for our purposes.

Others have also studied cohort gains and obtained results that cast a different light on between state comparisons. Examining data for 1992 and 1996, Barton and Coley (1998) concluded that “Most of the states are not significantly different from each other in terms of cohort growth from the fourth to the eighth grade.” They point out, for example, Maine ranks near the top for grade 4 in 1992 and for grade 8 in 1996, while Arkansas ranks near the bottom in both years. Nevertheless, both cohorts gained 52 points over the four-year period.

We now carry out an analysis that parallels the one described in the previous section. The data extracted from the NCES Web site comprise grade 4 NAEP mathematics results for 1992 and 1996 and grade 8 NAEP mathematics results for 1996 and 2000, for the states and the nation (public schools only). For each jurisdiction, for the indicated grade and year, we recorded the average score, the corresponding estimated standard error, and the percent of students excluded. The data are displayed in Table A2 of the appendix.

For each state and grade, we compute the State Cohort Gain (1992 to 1996) as

g1 = [state(grade 8, 1996) – state(grade 4, 1992)] – [national (grade 8, 1996)
        – national (grade 4, 1992)]

where the quantities on the right hand side of the equation represent the average results for the indicated grade and year. Further, for each state let

s.e. (g1) = (estimated) standard error of g1.

As before, since the four quantities contributing to g1 are derived from independent samples, s.e. is the square root of the sum of their (estimated) variances. We also computed the changes from 1992 to 1996 in the percent of excluded students in the cohort. Now let

G1 = g1 / s.e. (g1)

and

There is a set of analogous definitions for g2, G2, and W2 based on the cohort gains from grade 4 in 1996 to grade 8 in 2000. Finally, we let

image042.gif

Table 7 displays the relevant quantities. For high-stakes states, g1 ranges from –5.06 to 3.63, with a median of –1.98 and a mean of -1.18. For low-stakes states, g1 ranges from –3.86 to 5.51, with a median of 0.75 and a mean of 0.68. Turning to the second cohort, for high-stakes states, g2 ranges from –7.81 to 3.73, with a median of –1.20 and a mean of –1.08. For low-stakes states, g2 ranges from –6.56 to 7.00, with a median of 0.12 and a mean of 0.06.

Thus, the difference in means for the earlier cohort between high-stakes and low-stakes states is image044.gif  and for the later cohort the difference is –1.14. As before, the growth of the nation over the relevant four-year period is eliminated when we consider these differences in means. Interestingly, the results for low-stakes states are now somewhat better than those for high-stakes states—a reversal of what we found when we looked at change over time in a particular grade.

Note also that W1 and W2 are based on independent samples, so that W (when it is defined) is a reasonable choice as a summary measure of the state’s relative performance over the period 1992 to 2000. On the other hand, there is value in studying W1 and W2 separately, to see if there are any trends over time and to examine patterns of association with c%ex and policy score.

Table 7
Basic Results for Cohort Analysis of NAEP Mathematics Scores

    Chohort 1992
to 1996
Cohort 1996
to 2000
 
State Policy
score
s.e. ( ) Changes
in %
excluded
s.e. ( ) Changes
in %
excluded
Hi-stakes
states
AL 2.20 -3.66 3.03 -1.21 -2 2.64 -1.56 2.54 -0.62 -1 -1.34 -3
FL -0.27 -1.98 2.78 -0.71 -1 1.56 *** *** *** *** *** ***
GA 0.66 -5.06 2.52 -2.01 -2 1.76 -1.20 2.36 -0.51 -1 -0.03 -3
IN 0.90 2.56 2.29 1.12 2 2.29 1.58 2.23 0.71 1 2.02 3
LA -0.03 -3.69 2.59 -1.42 -2 1.96 -2.11 2.29 -0.92 -1 -1.90 -3
MD 2.46 0.43 2.88 0.15 1 2.62 3.25 2.50 1.30 2 2.88 3
MN -0.40 3.63 2.17 1.67 2 -0.40 3.39 2.24 1.51 2 -0.63 4
MS 0.55 -3.54 2.17 -1.64 -2 1.87 -6.47 2.22 -2.91 -2 1.50 -4
NV 0.32 *** *** *** *** ***  -1.51 2.08 -0.73 -1 1.27 ***
NM 0.78 -3.26 2.38 -1.37 -2 0.45 -6.07 2.80 -2.17 -2 -0.39 -4
NY 0.09 -0.14 2.54 -0.06 -1 2.34 1.56 2.77 0.56 1 5.33 0
NC 1.60 3.02 2.31 1.31 2 0.59 3.73 2.11 1.77 2 6.96 4
SC 0.90 -3.65 2.38 -1.54 -2 0.98 1.09 2.32 0.47 1 1.20 -1
TN 0.32 0.24 2.43 0.10 1 0.46 -7.81 2.58 -3.02 -2 -1.84 -1
TX -0.66 0.35 2.37 0.15 1 1.06 -5.94 2.41 -2.47 -2 -0.75 -1
VA 0.55 -2.94 2.50 -1.18 -2 2.02 1.96 2.42 0.81 1 3.29 -1
Lo-stakes
states
AZ -0.40 0.69 2.38 0.29 1 3.59 1.07 2.66 0.40 1 -3.34 2
AR -0.27 -0.48 2.28 -0.21 -1 1.58 -6.56 2.40 -2.73 -2 1.49 -3
CA 0.09 2.44 2.82 0.87 1 -2.21 0.97 3.05 0.32 1 -7.07 2
CO 0.66 2.66 2.06 1.29 2 -0.83 *** *** *** *** *** ***
CT 1.29 0.86 2.15 0.40 1 1.70 -2.20 2.20 -1.00 -1 2.10 0
DE 0.21 -3.10 1.90 -1.63 -2 3.34 *** *** *** *** *** ***
HI 0.32 -3.86 2.18 -1.77 -2 -0.54 -4.27 2.38 -1.79 -2 1.50 -4
IA -1.61 2.17 2.20 0.99 1 1.97 *** *** *** *** *** ***
KY 1.97 -0.39 2.06 -0.19 -1 1.40 -0.50 2.21 -0.23 -1 3.75 -2
ME 1.29 0.49 2.18 0.23 1 -0.95 -0.64 2.06 -0.31 -1 1.04 0
MA 0.32 -0.96 2.55 -0.38 -1 1.03 2.08 2.27 0.92 1 3.11 0
MI 0.43 5.06 2.87 1.76 2 -0.14 0.12 2.44 0.05 1 0.37 3
MO 1.02 -0.86 2.33 -0.37 -1 2.73 -3.23 2.24 -1.44 -2 3.64 -3
MT -1.26 *** ***  *** *** ***  7.00 2.18 3.20 2 0.67 ***
NE -1.61 5.51 2.16 2.55 2 0.20 1.00 2.10 0.48 1 -1.52 3
ND -0.03 3.63 1.88 1.93 2 1.58 0.10 2.10 0.05 1 0.24 3
OR 0.66 *** ***  *** *** ***  5.09 2.51 2.03 2 -2.58 ***
RI 0.09 1.50 2.30 0.65 1 1.37 0.94 2.22 0.42 1 5.61 2
UT 1.15 0.80 2.02 0.40 1 2.03 -3.15 2.10 -1.50 -2 0.03 -1
VT -0.27 *** ***  *** *** ***  6.45 2.11 3.06 2 3.46 ***
WV 0.90 -2.33 2.06 -1.13 -2 4.07 -4.64 1.95 -2.39 -2 2.74 -4
WI -0.40 2.23 2.37 0.94 1 1.99 *** *** *** *** *** ***
WY -0.95 -2.53 1.95 -1.30 -2 -1.73 1.42