Reconsidering the Impact of High-stakes Testing
Henry Braun
Educational Testing Service
Citation: Braun, H.
(2004, January 5). Reconsidering the impact of high-stakes testing,
Education Policy
Analysis Archives, 12(1). Retrieved [Date] from
http://epaa.asu.edu/epaa/v12n1/.
|
Abstract
Over the last fifteen years, many states have implemented
high-stakes tests as part of an effort to strengthen
accountability for schools, teachers, and students. Predictably,
there has been vigorous disagreement regarding the contributions
of such policies to increasing test scores and, more importantly,
to improving student learning. A recent study by Amrein and
Berliner (2002a) has received a great deal of media attention.
Employing various databases covering the period 1990-2000, the
authors conclude that there is no evidence that states that
implemented high-stakes tests demonstrated improved student
achievement on various external measures such as performance on
the SAT, ACT, AP, or NAEP. In a
subsequent study in which they conducted a more extensive analysis
of state policies (Amrein & Berliner, 2002b), they reach a
similar conclusion. However, both their methodology and their
findings have been challenged by a number of authors. In this
article, we undertake an extended reanalysis of one component of
Amrein and Berliner (2002a). We focus on the performance of
states, over the period 1992 to 2000, on the NAEP mathematics
assessments for grades 4 and 8. In particular, we compare the
performance of the high-stakes testing states, as designated by
Amrein and Berliner, with the performance of the remaining states
(conditioning, of course, on a state’s participation in the
relevant NAEP assessments). For each grade, when we examine the
relative gains of states over the period, we find that the
comparisons strongly favor the high-stakes testing states.
Moreover, the results cannot be accounted for by differences
between the two groups of states with respect to changes in
percent of students excluded from NAEP over the same period. On
the other hand, when we follow a particular cohort (grade 4, 1992
to grade 8, 1996 or grade 4, 1996 to grade 8, 2000), we find the
comparisons slightly favor the low-stakes testing states, although
the discrepancy can be partially accounted for by changes in the
sets of states contributing to each comparison. In addition, we
conduct a number of ancillary analyses to establish the robustness
of our results, while acknowledging the tentative nature of any
conclusions drawn from highly aggregated, observational data.
|
Introduction
Since its passage in January 2002, the No Child Left Behind
(NCLB) Act has already had a substantial influence on state and
local education agencies as they develop accountability plans to
win the approval of the U.S. Department of Education. In addition
to the operational concerns of these agencies, as well as those of
principals and teachers, there is considerable debate about the
efficacy of externally mandated high-stakes testing in improving
learning (Elmore, 2002; Lewis, 2002; Steinberg, 2003; Wolf, 2003).
Indeed, most of the education and educational measurement
community is doubtful that high-stakes testing will have a
generally salutary effect on the quality of student learning
(Linn, 2000; Mehrens, 1998), although there are contrasting views
(Grissmer, Flanagan, Kawata, & Williamson, 2000). Given the
level of disagreement, it is natural for both supporters and
opponents to look to extant data to buttress their positions.
Inasmuch as a number of states have instituted high-stakes testing
policies of various kinds over the last decade or more, there is a
record of results that, presumably, can yield some insights into
the likely impact of such policies.
A recent, and much cited, example of this approach is the
article by Amrein and Berliner (2002a). Employing some general
criteria, they identify 18 states as having high-stakes testing
policies and examine the achievement of their students on a number
of measures, including the SAT and the ACT, NAEP
results in mathematics and reading, and Advanced Placement
Program. The rationale is that trends on the
state tests cannot be relied upon as valid indicators of student
learning (Linn, 2000) and, if learning is indeed taking place,
then similar trends should be seen in other, related measures.
Their overall conclusion was that “At the present time,
there is no compelling evidence… that those policies result
in transfer to the broader domains of knowledge and skill for
which high-stakes test scores must be indicators” (p.
54).
The authors are careful to point out some of problems with each
of the measures as a basis for drawing conclusions about the
impact of the states’ policies. NAEP results, for reasons
discussed below, are perhaps the least objectionable, as well as
being the most relevant to considerations related to the
consequences of NCLB for elementary and middle schools. A close
reading of the article, however, raises a number of methodological
concerns and it is the purpose of this paper to examine those
concerns through a reanalysis of the NAEP mathematics data and to
explore the policy implications of the findings.
Amrein and Berliner also produced a follow-up report (Amrein
& Berliner, 2002b). In that paper, they carried out a more
extensive policy analysis and identified 28 states as high-stakes
states. We defer discussion of this second report, as well as
alternative views (e.g., Carnoy & Loeb, 2003; Raymond &
Hanushek, 2003) to the Discussion section.
There is always a danger that, in carrying out these analyses,
we will forget the very real limitations on the conclusions we can
draw. Accordingly, we enumerate them at the outset. First, we are
working with observational data so that causal inferences are not
warranted. Second, these 18 states (and the other 32) have engaged
in a number of education initiatives in addition to their testing
policies so that ascribing differences in NAEP results (solely or
principally) to the impact of high-stakes testing is problematic.
A similar difficulty arises in trying to explain the results of
some states in terms of (apparent) attempts to “game the
system” by, for example, increasing the proportion of SD/LEP
students who are excluded from participating in NAEP. (SD/LEP
refers to students with disabilities and/or students with limited
English proficiency—with both groups expected to perform
below average.)
Perhaps the most important consideration from the policy
perspective is that for all the differences among state NAEP
scores and state NAEP score changes, there is much greater
variability within states—and probably more to be learned
from trying to understand the sources of such within state
differences. For an example, see Raudenbush, Fotiu, Cheong, and
Ziazi (1995). That said, the current controversy about state level
results demands that we address the question in as balanced a
fashion as possible.
We begin with a review the methodology of Amrein and Berliner
(2002a) and then carry out a reanalysis of the cross-sectional
data, followed by a reanalysis of the cohort data. Both reanalyses
are repeated using the states’ 25th percentiles
(rather than the states’ means) to ascertain the robustness
of the results. The final section relates the findings to those in
the recent literature, offers some interpretations, as well as
some cautions on drawing policy conclusions from data of this
type.
Reviewing Amrein and Berliner
Amrein and Berliner (2002a) identified 18 states as having
“… the most severe consequences, that is, the highest
stakes associated with their K-12 testing policies” (p. 18).
All the states had regulations making high school graduation
contingent on passing a high school graduation exam. They also had
various combinations of other stakes relating to grade promotion
contingent on examination performance, making public annual school
or district report cards, as well as rewards and sanctions for
schools, teachers and students (see their Table 1, p. 18). One can
certainly argue with their classification that, for example,
includes Kentucky and Massachusetts among the low-stakes states.
In the interests of maintaining comparability, however, we have
retained their classification.
The rationale for analyzing NAEP data is that NAEP is the only
nationally administered achievement test—and one that
students do not explicitly prepare for. Since 1990, states have
had the option to participate in “State NAEP,” with
scores reported on the same scale as National NAEP. Consequently,
states can be compared in terms of their performance on NAEP over
time. If increases in state test scores are valid indicators of
improved skills, then one would expect to see corresponding
increases in NAEP scores.
Of course, there are some weaknesses to this approach. Student
motivation to perform well on NAEP is likely not as great as it is
on a high-stakes exam. (On the other hand, it is not clear why
students in different states would experience differential
reductions in motivation from the state test to NAEP.) States can
have different policies on excluding SD/LEP students from
participation in NAEP and also may differ in the extent to which
the state assessment is aligned with the NAEP framework.
Presumably, students in states with greater alignment might be
expected to do better on NAEP than students in states with lesser
alignment. On the other hand, use of SAT, ACT, or AP scores is
very problematic. Aside from concerns about self-selection, it is
difficult to make the case that the performance standards set by
states would have had much impact on college-bound students.
With respect to the timing of policies, Amrein and Berliner
(2002a) only provide the year at which the high school graduation
requirement became operative in each of the 18 states. They state
explicitly that “The usefulness of the NAEP analyses that
follow rests on the assumption that states’ other K-12
high-stakes testing policies were implemented at or around the
same time as each state’s high school graduation exam”
(p. 36).
Their approach to the analysis of data can best be illustrated
by an example. Using NAEP mathematics results for grade 4, they
compute the change for the nation, and for each state, over the
period 1992 to 2000. They then calculate the differential gain for
each state as:
State Gain = (change for state ‘92 to
‘00) - (change for nation ’92 to ‘00).
A positive State Gain means that over this time period the
state’s improvement on NAEP exceeded that of the nation.
Conversely, a negative value means that the nation’s
improvement exceeded that of the state. It is important to
recognize that in the latter case, the change for the state could
be positive but just not as large as the nation’s.
Using rounded values, Amrein and Berliner (2002a) find (see
their Table 8) that for the eighteen high-stakes states they
selected, there were 8 positive State Gains, 3 negative and 2
zeroes. There were five states where data were declared “not
available.” This is curious, as two states (Indiana and
Minnesota) do have NAEP data available and we include them
in our reanalysis.
Amrein and Berliner (2002a) acknowledged this result appears to
support the beneficial impact of high stakes testing. However,
they argue that the association between State Gain and the change
in the percent of students excluded from NAEP over the same time
period (r = 0.39) undercuts the interpretability of the result.
When, further, they combine the analyses for 1992 to 1996 and 1996
to 2000 with the one for 1992 to 2000 (ignoring the dependency
induced by the overlap in time), they reach the conclusion that
“In short, when compared to the nation as a whole,
high-stakes testing policies did not usually lead to improvement
in the performance of students on the grade 4 NAEP math tests
between 1992 and 2000” (p. 40).
In the case of grade 8 NAEP mathematics, they find (see their
Table 9) that, over the period 1990–2000, five states posted
gains, four losses and one remained the same. Note that eight
states are missing, so these results reflect the experiences of
only slightly more than half the states of interest. After
aggregating results over the periods 1990 to 1992, 1992 to 1996
and 1996 to 2000 (again ignoring the overlap) and pointing to the
problem of differential changes in exclusion rates, they conclude
again that “there is no compelling evidence that high-stakes
testing policies have improved the performance of students on the
grade 8 NAEP math tests” (p. 43).
Reanalysis
Our approach to the question differs in a number of ways from
Amrein and Berliner (2002a):
- In addition to carrying out an analysis for the eighteen
high-stakes states that were the focus of Amrein and Berliner
(2002a), we carry out a parallel analysis for the other 32
states.
- We augment our analysis by including a more comprehensive
measure of states’ educational reform efforts.
- Our interpretation of the State Gain statistics is informed by
consideration of the corresponding estimated standard errors.
(Since the State Gain is a “difference of
differences,” these standard errors are not negligible, with
a typical value of 2.5 points on the NAEP scale.)
- In the analysis of the grade 8 data, we look at changes over
the period 1992 to 2000, rather than 1990 to 2000. Our choice
makes the analyses for grades 4 and 8 more comparable, and
provides slightly more data.
The data were obtained from the National Center for Education
Statistics Web site (2003). The data extracted comprise grade 4
and grade 8 NAEP mathematics results in the years 1992 and 2000
for the states and the nation (public schools only). For each
jurisdiction, grade and year, we recorded the average score, the
corresponding estimated standard error, and the percent of
students excluded. The data are displayed in Table A1 of the
appendix. We note that relevant NAEP data is available for 15 of
18 high-stakes states and 18 of 32 of the other or
“low-stakes” states.
For each state and grade, we compute the State Gain and its
estimated standard error. Specifically, let
d4 (state) = [state(’00)
– state(’92)] – [nat’l(’00) –
nat’l(’92)]
where the quantities on the right hand side of the equation
represent the average results for grade 4. Further, for each state
let
s.e.(d4) = (estimated)
standard error of d4.
Since the four quantities contributing to d4
are derived from independent samples, s.e.(d4)
is simply the square root of the sum of the (estimated) variances
of the four quantities. We also compute, for each grade and state,
the changes in the percent of excluded students over the period,
denoted c%ex.
Now let
D4 = d4/s.e
(d4)
and

with a parallel set of definitions for d8,
D8, and V8 for the grade 8
results. Finally, we let
V = V4 + V8.
Table 1 Basic Results for Analysis of NAEP
Mathematics Scores: Grades 4 and 8 Trends for 1992 to 2000
| State |
Policy score |
 |
s.e.
( ) |
 |
 |
Changes in
% excluded Gr. 4 |
 |
s.e.
( ) |
 |
 |
Changes in
% excluded Gr. 8 |
 |
Hi-stakes
states |
AL |
2.20 |
1.96 |
2.45 |
0.80 |
1 |
1.27 |
2.42 |
2.75 |
0.88 |
1 |
-0.51 |
2 |
| GA |
0.66 |
-3.69 |
2.05 |
-1.80 |
-2 |
1.28 |
-0.57 |
2.13 |
-0.27 |
-1 |
2.52 |
-3 |
| IN |
0.90 |
5.73 |
1.95 |
2.94 |
2 |
3.46 |
5.41 |
2.24 |
2.41 |
2 |
2.72 |
4 |
| LA |
-0.03 |
6.17 |
2.38 |
2.59 |
2 |
3.68 |
1.45 |
2.57 |
0.56 |
1 |
1.49 |
3 |
| MD |
2.46 |
-2.66 |
2.20 |
-1.21 |
-2 |
4.88 |
3.64 |
2.30 |
1.58 |
2 |
5.88 |
0 |
| MN |
-0.40 |
-0.88 |
2.04 |
-0.43 |
-1 |
2.44 |
-2.29 |
2.15 |
-1.06 |
-2 |
2.03 |
-3 |
| MS |
0.55 |
1.49 |
1.97 |
0.76 |
1 |
-0.59 |
0.03 |
2.17 |
0.01 |
1 |
0.30 |
2 |
| NM |
0.78 |
-7.09 |
2.42 |
-2.93 |
-2 |
4.95 |
-7.31 |
2.33 |
-3.13 |
-2 |
6.21 |
-4 |
| NY |
0.09 |
0.46 |
2.21 |
0.21 |
1 |
6.24 |
2.29 |
3.21 |
0.71 |
1 |
4.64 |
2 |
| NC |
1.60 |
11.92 |
1.94 |
6.14 |
2 |
9.48 |
14.18 |
2.07 |
6.86 |
2 |
10.59 |
4 |
| OH |
1.15 |
4.21 |
2.17 |
1.94 |
2 |
3.91 |
7.00 |
2.47 |
2.83 |
2 |
2.61 |
4 |
| SC |
0.90 |
0.27 |
2.16 |
0.12 |
1 |
2.65 |
-1.96 |
2.12 |
-0.93 |
-1 |
0.94 |
0 |
| TN |
0.32 |
1.23 |
2.37 |
0.52 |
1 |
-0.10 |
-2.94 |
2.55 |
-1.15 |
-2 |
-0.33 |
-1 |
| TX |
-0.66 |
7.09 |
2.12 |
3.34 |
2 |
7.86 |
2.71 |
2.34 |
1.16 |
2 |
2.99 |
4 |
| VA |
0.55 |
1.98 |
2.21 |
0.89 |
1 |
5.55 |
1.27 |
2.29 |
0.55 |
1 |
4.69 |
2 |
Lo-stakes
states |
AZ |
-0.40 |
-4.14 |
2.17 |
-1.91 |
-2 |
6.92 |
-2.20 |
2.36 |
-0.93 |
-1 |
3.37 |
-3 |
| AR |
-0.27 |
-0.80 |
1.91 |
-0.42 |
-1 |
1.15 |
-2.49 |
2.21 |
-1.13 |
-2 |
1.94 |
-3 |
| CA |
0.09 |
-2.49 |
2.72 |
-0.91 |
-1 |
-3.24 |
-6.27 |
2.92 |
-2.14 |
-2 |
0.42 |
-3 |
| CT |
1.29 |
-0.21 |
2.05 |
-0.10 |
-1 |
3.37 |
0.62 |
2.19 |
0.28 |
1 |
3.60 |
0 |
| HI |
0.32 |
-5.86 |
2.14 |
-2.73 |
-2 |
4.42 |
-2.18 |
2.04 |
-1.07 |
-2 |
2.42 |
-4 |
| ID |
-0.27 |
-2.32 |
1.99 |
-1.17 |
-2 |
2.46 |
-4.72 |
1.97 |
-2.39 |
-2 |
1.59 |
-4 |
| KY |
1.97 |
-1.71 |
1.99 |
-0.86 |
-1 |
4.95 |
1.77 |
2.20 |
0.81 |
1 |
4.91 |
0 |
| ME |
1.29 |
-8.73 |
1.85 |
-4.72 |
-2 |
4.46 |
-2.54 |
2.00 |
-1.27 |
-2 |
4.13 |
-4 |
| MA |
0.32 |
0.71 |
2.05 |
0.34 |
1 |
3.45 |
2.79 |
2.07 |
1.35 |
2 |
3.99 |
3 |
| MI |
0.43 |
3.35 |
2.56 |
1.31 |
2 |
3.11 |
3.55 |
2.47 |
1.44 |
2 |
0.48 |
4 |
| MO |
1.02 |
-1.32 |
2.10 |
-0.63 |
-1 |
5.34 |
-5.10 |
2.28 |
-2.24 |
-2 |
4.11 |
-3 |
| NE |
-1.61 |
-7.04 |
2.46 |
-2.87 |
-2 |
3.45 |
-4.58 |
2.02 |
-2.26 |
-2 |
-0.57 |
-4 |
| ND |
-0.03 |
-5.42 |
1.71 |
-3.18 |
-2 |
3.98 |
-7.69 |
2.02 |
-3.81 |
-2 |
1.40 |
-4 |
| OK |
0.43 |
-2.93 |
2.03 |
-1.45 |
-2 |
3.18 |
-4.03 |
2.27 |
-1.78 |
-2 |
2.31 |
-4 |
| RI |
0.09 |
1.53 |
2.32 |
0.66 |
1 |
5.89 |
-0.03 |
1.84 |
-0.01 |
-1 |
6.67 |
0 |
| UT |
1.15 |
-4.40 |
2.00 |
-2.20 |
-2 |
2.63 |
-6.45 |
1.87 |
-3.45 |
-2 |
1.45 |
-4 |
| WV |
0.90 |
1.92 |
2.03 |
0.95 |
1 |
5.65 |
4.15 |
1.91 |
2.17 |
2 |
5.29 |
3 |
| WY |
-0.95 |
-3.78 |
2.03 |
-1.86 |
-2 |
2.55 |
-5.94 |
1.93 |
-3.07 |
-2 |
0.01 |
-4 |
Table 1 displays the relevant quantities. (Note: The policy
score will be defined presently.) We observe that for high-stakes
states, d4 ranges from –7.09 to 11.92, with a
median of 1.49 and a mean of 1.88; d8 ranges
from –7.31 to 14.18, with a median of 1.45 and a mean of
1.69. For low-stakes states, d4 ranges from
–8.73 to 3.35, with a median of –2.41 and a mean
of –2.42; d8 ranges from –7.69 to 4.15, with a
median of and
a mean of –2.30. Thus, we see that the typical State Gain
for high-stakes states is substantially larger than the typical
State Gain for low-stakes states in both grades 4 and 8.
At grade 4, the difference in means between the high-stakes and
low-stakes states is 4.3 score points and at grade 8 it is 3.99 score points. Note that
in computing the difference in means, the gain of the nation over
the period 1992 to 2000 is eliminated. Consequently, such
differences provide a direct comparison between the typical gains
for high-stakes states and low-stakes states. Some might prefer
such comparisons because the results for the nation are influenced
by all the states we are considering, as well as the states that
did not participate in both NAEP administrations. However, we have
chosen to follow the approach of Amrein and Berliner (2002a) in
order to facilitate comparisons between our results and
theirs.
While there certainly is interest in the State Gains
(ds) themselves, we believe there is also value in
comparing states in terms of the Vs, which are essentially
discretized effect sizes. Specifically, Vk
(k = 4 or 8) gives a state 2 “credits” if
dk exceeds one standard error (in one direction
or the other). While the usual criterion for statistical
significance (which is not particularly appropriate in this
setting) would require exceeding two standard errors, there is
practical interest in identifying states whose relative gain is at
least greater than one standard error—given the magnitude of
the standard errors, the level of dispersion in the
dks among the states, and the fact that the
national gain (although statistically independent of the state
gains) is influenced by the educational policies of the various
states.
A state that presents what one might term a strongly consistent
picture of relative improvement over the nation (i.e.,
D4 > 1 and D8 > 1) is
awarded 4 credits. One that presents a moderately consistent
picture (e.g., D4 > 1 and 1 >
D8 >0) is awarded 3 credits and one that
presents a mildly consistent picture (i.e., 1 >
D4 > 0 and 1 > D8 >
0) is awarded 2 credits. Note that this coding scheme limits the
influence of outliers and allows us to distinguish most
configurations of D4 and
D8.
The distributions of V for the high-stakes testing
states and the remaining states are presented in Table 2. For the
first group, we have values of V for 15 out of the 18
states, while for the second group we have values of V for
18 out of 32 states. There is a striking difference between the
two groups of states: High-stakes states are more likely to show
strongly consistent improvement relative to the nation (V =
4) than low-stakes states (4/15 vs. 1/18) and much less likely to
show strongly consistent lack of improvement (V = –4)
relative to the nation (1/15 vs. 8/18). The story remains
qualitatively the same if we compare the groups with less
stringent cut-offs.
Table 2 Distribution of V for Hi-stakes and
Lo-stakes States
# Lo-stakes
states Total=18 |
V |
# Hi-stakes
states Total=15 |
| 1 |
4 |
4 |
| 2 |
3 |
1 |
| 0 |
2 |
4 |
| 0 |
1 |
0 |
| 3 |
0 |
2 |
| 0 |
-1 |
1 |
| 0 |
-2 |
0 |
| 4 |
-3 |
2 |
| 8 |
-4 |
1 |
In summary, high-stakes testing states that participated in the
NAEP mathematics assessment in both 1992 and 2000 typically showed
improvement relative to the nation while low-stakes testing states
that participated in the NAEP mathematics assessment in both 1992
and 2000 typically showed lack of improvement relative to the
nation. (We must be careful to condition on participation in NAEP
since a large number of low-stakes states did not participate in
NAEP in one or both of the years under study.) The question is how
to interpret the comparison.
With respect to the results for the high-stakes states, Amrein
and Berliner (2002a) discount the finding, in part, because of the
empirical association between State Gain and the change in percent
excluded. This is a reasonable argument but one that deserves
further scrutiny for at least two reasons. First, the observed
correlation may be unduly influenced by an outlying observation
and, second, there are other, observable and unobservable
characteristics of states that may also account for some of the
differences among states. (It also should be noted that the 1992
exclusion rates are not strictly comparable to those in 1996 and
2000. The former were calculated as an average over mathematics
and reading, while the latter two are reported for mathematics
only.)

Figure 1a. Grade 4: vs. Change in % Excluded
(1992 to 2000).

Figure 1b. Grade 8: vs. Change in % Excluded
(1992 to 2000).
Figure 1a displays a plot of against c%ex and Figure 1b
displays a plot of d8 against c%ex. North
Carolina is a clear outlier on both plots, while Texas is an
outlier in Figure 1a. Table 3 presents the correlations for the
two groups of states, including the case for the high-stakes
states with North Carolina removed. In the fourth grade, we see
that the correlation for the high-stakes states is indeed
substantial, but markedly reduced when North Carolina is deleted.
For the eighth grade, the reduction is even more dramatic. On the
other hand, for low stakes states in the eighth grade, the
correlation is quite high. One might, therefore, plausibly argue
that the results for the low-stakes states would be further
depressed (relative to those for the high-stakes states) if their
apparent relationship to c%ex were somehow taken into account.
Table 3 Correlations Between State Gains and
Change in % Excluded for Years 1992 to 2000
| |
State
gains |
| |
Grade 4 ( ) |
Grade 8 ( ) |
| Hi-stakes (# = 15) |
0.44 |
0.49 |
| Hi-stakes w/o NC (# = 14) |
0.17 |
-0.01 |
| Lo-stakes (# = 18) |
0.02 |
0.55 |
It is often the case that gain scores are negatively correlated
with the base year score. Accordingly, in Figures 2a and 2b we
plot d4 against the grade 4 state score
(’92) and d8 against the grade 8 state
score (’92). In both cases, we observe the expected negative
correlations. Plots of d4 and
d8 against their standard errors were not
informative and are not presented.

Figure 2a. Grade 4: vs. State Scale Score
(1992).

Figure 2b. Grade 8: vs. State Scale Score
(1992).
As was noted in the Introduction, changes in state NAEP scores
over time can be the result of many factors in addition to percent
of students excluded and testing policies. In particular, other
educational interventions that have been adopted by the state with
the intention of raising the academic achievement of its students
may well have their intended effect, at least to some degree. It
would be helpful, therefore, to have a broader measure of each
state’s educational policy efforts to incorporate into an
explanatory framework. Fortunately, one such a measure has been
formulated and quantified as part of a study of the influence of
standards-based reform on changes in classroom practice (Swanson
& Stevenson, 2002).
Drawing on studies conducted by the Council of Chief State
School Officers, Swanson and Stevenson graded each state on each
of 22 policy activities organized into four categories: content
standards, performance standards, aligned assessments and
professional standards. Grades were assigned on a three point
scale: does not have such a policy (0), is developing one (1), or
has enacted such a policy as of 1996 (2). They then carried out a
Rasch analysis using this 50 x 22 data array, yielding a
“state (policy) activism score” for each state. They
report a low level of item misfit. For more details, consult their
article.
In view of the comprehensiveness of the policy information
employed and that 1996 falls in the middle of the period of
interest, we propose to use the policy activism scale as another
possible explanatory variable in our effort to account for
differences among states in State Gains. The policy activism
scores are located in the second column of Table 1. Figures 3a and
3b display plots of d4 and d8 against
activism scores.

Figure 3a. Grade 4: vs. Policy Score.

Figure 3b. Grade 8: vs. Policy Score.
Since the mean policy score over the 50 states is 0.28, we note
that the 33 states that we are examining tend to have scores above
the mean. The median for the high-stakes states is 0.66 and the
median for the low-stakes states is 0.32. Correlations between
State Gain and policy scores for the two groups of states are
presented in Table 4. We observe the relationship is moderately
strong and positive in grade 8, but rather mixed in grade 4.
Again, North Carolina exerts considerable leverage on the results
for the high-stakes states.
Table 4 Correlations Between State Gains
for Years 1992 to 2000 and Policy Score
| |
State gains |
| |
Grade 4
( ) |
Grade 8
( ) |
| Hi-stakes (# = 15) |
-0.07 |
0.37 |
| Hi-stakes w/o NC (# =
14) |
-0.29 |
0.26 |
| Lo-stakes (# = 18) |
0.22 |
0.38 |
Before proceeding to the next stage of the comparison between
the high-stakes and low-stakes states, it might be of interest to
compare the V distributions of high activism and low activism
states, defined by whether they are above or below the mean policy
score of 0.28, respectively. The results are presented in Table 5,
which is analogous to Table 2. This comparison involves 21 out 27
high activism states and 12 out of 23 low activism states. While
the comparison favors the high activism states, it is less
clear-cut than the one in Table 2. Note that the V values of the
high activism states fall about equally above and below zero. On
the other hand, the V values of the low activism states are more
likely to be negative. Thus, somewhat surprisingly, the
categorization employed in Amrein and Berliner (2002a) seems to
provide a sharper contrast than the categorization based on the
broader policy analysis employed by Swanson and Stevenson
(2002).
Table 5 Distribution of V for Hi-policy
Score and Lo-policy Score States
# Lo-policy score
states Total = 12 |
V |
# Hi-policy score
states Total = 21 |
| 1 |
4
|
4 |
| 1 |
3
  |
2 |
| 1 |
2
  |
3 |
| 0 |
1
  |
0 |
| 1 |
0
  |
4 |
| 0 |
-1
  |
1 |
| 0 |
-2
  |
0 |
| 4 |
-3
  |
2 |
| 4 |
-4
  |
5 |
Returning to the main thread of our reanalysis, we carry out a
multiple regression of d4 on three explanatory
variables: state score (’92), c%ex and activism score, and
an analogous regression for d8. In both regressions, we
leave out North Carolina and Texas because they are outliers in
one or both panels of Figure 1. The essential elements of the
regression output are presented in Tables 6a and 6b.
Table 6a Grade 4: Regression of
on Policy Score, 1992 State
Scale Score, Change in % Excluded for Years 1992 to 2000
ANOVA |
| |
df |
SS |
MS |
F |
p-value |
| Regression |
3 |
67.1 |
22.4 |
1.7 |
0.2 |
| Residual |
27 |
350.3 |
13.0 |
|
|
| Total |
30 |
417.4 |
|
|
|
|
| |
Coefficients |
Standard
error |
t
stat |
p-value |
|
| Intercept |
40.6 |
20.9 |
1.9 |
0.1 |
|
| Policy score |
0.5 |
0.8 |
0.6 |
0.5 |
|
| Score |
-0.2 |
0.1 |
-2.0 |
0.1 |
|
| Change in % excluded |
0.1 |
0.3 |
0.3 |
0.8 |
|
|
|
| Summary Output |
|
|
|
|
|
| Regression
statistics |
|
|
|
|
|
| Multiple R |
0.40 |
|
|
|
|
| R square |
0.16 |
|
|
|
|
| Adjusted R square |
0.07 |
|
|
|
|
| Standard error |
3.60 |
|
|
|
|
| Observations |
31.00 |
|
|
|
|
Table 6b Grade 8: Regression of
on Policy Score, 1992 State Scale Score, Change in % Excluded for Years 1992 to 2000
| ANOVA |
| |
df |
SS |
MS |
F |
p-value |
| Regression |
3 |
113.3 |
37.8 |
2.9 |
0.1 |
| Residual |
27 |
353.8 |
13.1 |
|
|
| Total |
30 |
467.2 |
|
|
|
| |
|
|
|
|
|
|
| |
Coefficients |
Standard
error |
t
stat |
p-value |
|
| Intercept |
21.3 |
20.2 |
1.1 |
0.3 |
|
| Policy score |
1.5 |
0.9 |
1.7 |
0.1 |
|
Score |
-0.1 |
0.1 |
-1.2 |
0.2 |
|
Change in
% excluded |
0.3 |
0.3 |
0.9 |
0.4 |
|
| |
|
|
|
|
|
|
| Summary Output |
|
|
|
|
|
| Regression
statistics |
|
|
|
|
|
| Multiple R |
0.49 |
|
|
|
|
| R square |
0.24 |
|
|
|
|
| Adjusted R square |
0.16 |
|
|
|
|
| Standard error |
3.62 |
|
|
|
|
| Observations |
31.00 |
|
|
|
|
For Grade 4, the R2 = 0.16 (adjusted
R2 = 0.07) so clearly the three explanatory
variables do not account for very much of between-state variation;
only state score (’92) is marginally significant. Overall,
residual plots against each of the explanatory variables do not
reveal any patterns. However, the residuals for the 13 high-stakes
states (i.e. not including Texas and North Carolina) tend be more
positive than the residuals for the 18 low-stakes states. This is
to be expected given the results in Tables 1, 2 and 6.
Figure 4a presents the residual plot against c%ex. The
residuals for Texas and North Carolina were obtained by
substituting their values for the three explanatory variables into
the regression equation presented in Table 6a (which was estimated
using the other 31 states). We note that Texas and North Carolina
are outliers in the sense that they have both the largest values
on c%ex and the largest positive residuals. On the other hand, for
the other states there appears to be no association (linear or
otherwise) between c%ex and state gain.

Figure 4a. Grade 4: Plot of residuals vs. Change in %
Excluded (1992 to 2000). Residuals obtained from a regression of
on state score ('92), c%ex and
policy score.

Figure 4b. Grade 8: Plot of residuals vs. Change in % Excluded
(1992 to 2000). Residuals obtained from a regression of
on state score ('92), c%ex and
policy score.
Turning to Grade 8 (Table 6b), we note that the
R2 = 0.24 (adjusted R2 = 0.16)
and that the only explanatory variable that approaches
significance is policy score. Overall, the residual plots again
reveal no interesting patterns, except that high-stakes states
tend to have more positive residuals than do low-stakes states.
Figure 4b presents the residual plot against c%ex, with the
residuals for Texas and North Carolina added. North Carolina
remains an outlier, but not Texas. For the other states, there
does not appear to be an association between c%ex and state
gain.
In view of the above analysis, it is not appropriate to
discount the differences in results between the high-stakes and
low-stakes states (e.g. Table 2) by arguing they are strongly
influenced by differences in changes in percent of students
excluded over the period 1992 to 2000. That argument is simply not
supported by the data.
One might want to distinguish the results for North Carolina
from those of the other states, arguing that the unusually large
value of c%ex “explains” the unusually large value of
State Gain. If that were the case, then school officials in North
Carolina would have been much more adept than officials in other
states in excluding SD/LEP students who would have done poorly on
NAEP. In particular, school officials in New Mexico, which also
experienced a large increase in percent of students excluded
(particularly in Grade 8) but large negative State Gains, would
have much to learn from their counterparts in North Carolina! A
more circumspect statement about North Carolina is that its State
Gain may well be a consequence of both its reform policies and the
increase in excluded students—but that with the data
availaleble we are neither able to determine the relative
contributions of these two factors nor those of other factors.
Cohort Analyses
Amrein and Berliner (2002a) correctly point out that a weakness
of the repeated cross-sectional studies described above is that
real changes over time in student test performance are confounded
with changes in the characteristics of successive cohorts that are
unrelated to school effects but associated with performance. For
example, in a particular state, grade 4 students in 2000 might be
more disadvantaged than were grade 4 students in 1992 and,
therefore, perform more poorly on NAEP even if the productivity of
the state’s schools remained unchanged.
The structure of the NAEP system makes possible another way of
looking at a state’s performance. Since NAEP tested students
in mathematics in both grades 4 and 8 every four years, we can
determine the gains of the cohort tested in grade 4 in 1992 and
again in grade 8 in 1996, as well as the gains of the cohort
tested in grade 4 in 1996 and again in grade 8 in 2000. Although
the actual students tested four years apart are not the same
students (i.e., this is not a true longitudinal study like High
School and Beyond), each group is a probability sample of their
respective cohorts. Thus, the observed gain is an approximately
unbiased estimate of the population gain over the period in
question. The word “approximately” is appropriate
since there are inflows and outflows over the four years, as well
as differential rates of exclusions and non-response at school and
student levels. Nonetheless, the results should be sufficiently
accurate for our purposes.
Others have also studied cohort gains and obtained results that
cast a different light on between state comparisons. Examining
data for 1992 and 1996, Barton and Coley (1998) concluded that
“Most of the states are not significantly different from
each other in terms of cohort growth from the fourth to the eighth
grade.” They point out, for example, Maine ranks near the
top for grade 4 in 1992 and for grade 8 in 1996, while Arkansas
ranks near the bottom in both years. Nevertheless, both cohorts
gained 52 points over the four-year period.
We now carry out an analysis that parallels the one described
in the previous section. The data extracted from the NCES Web site
comprise grade 4 NAEP mathematics results for 1992 and 1996 and
grade 8 NAEP mathematics results for 1996 and 2000, for the states
and the nation (public schools only). For each jurisdiction, for
the indicated grade and year, we recorded the average score, the
corresponding estimated standard error, and the percent of
students excluded. The data are displayed in Table A2 of the
appendix.
For each state and grade, we compute the State Cohort Gain
(1992 to 1996) as
g1 = [state(grade 8, 1996) –
state(grade 4, 1992)] – [national (grade 8, 1996)
– national (grade 4, 1992)]
where the quantities on the right hand side of the equation
represent the average results for the indicated grade and year.
Further, for each state let
s.e. (g1) = (estimated) standard error of
g1.
As before, since the four quantities contributing to
g1 are derived from independent samples, s.e.
is the square root of the sum of
their (estimated) variances. We also computed the changes from
1992 to 1996 in the percent of excluded students in the cohort.
Now let
G1 = g1 / s.e.
(g1)
and

There is a set of analogous definitions for
g2, G2, and
W2 based on the cohort gains from grade 4 in
1996 to grade 8 in 2000. Finally, we let

Table 7 displays the relevant quantities. For high-stakes
states, g1 ranges from –5.06 to 3.63, with a
median of –1.98 and a mean of -1.18. For low-stakes states,
g1 ranges from –3.86 to 5.51, with a median of
0.75 and a mean of 0.68. Turning to the second cohort, for
high-stakes states, g2 ranges from –7.81
to 3.73, with a median of –1.20 and a mean of –1.08.
For low-stakes states, g2 ranges from
–6.56 to 7.00, with a median of 0.12 and a mean of 0.06.
Thus, the difference in means for the earlier cohort between
high-stakes and low-stakes states is
and for the later cohort the
difference is –1.14. As before, the growth of the nation
over the relevant four-year period is eliminated when we consider
these differences in means. Interestingly, the results for
low-stakes states are now somewhat better than those for
high-stakes states—a reversal of what we found when we
looked at change over time in a particular grade.
Note also that W1 and W2
are based on independent samples, so that W (when it is defined)
is a reasonable choice as a summary measure of the state’s
relative performance over the period 1992 to 2000. On the other
hand, there is value in studying W1 and W2
separately, to see if there are any trends over time and to
examine patterns of association with c%ex and policy score.
Table 7 Basic Results for Cohort
Analysis of NAEP Mathematics Scores
| |
|
Chohort 1992 to 1996 |
Cohort 1996 to 2000 |
|
| State |
Policy score |
 |
s.e. ( ) |
 |
 |
Changes in % excluded |
 |
s.e. ( ) |
 |
 |
Changes in % excluded |
 |
Hi-stakes
states |
AL |
2.20 |
-3.66 |
3.03 |
-1.21 |
-2 |
2.64 |
-1.56 |
2.54 |
-0.62 |
-1 |
-1.34 |
-3 |
| FL |
-0.27 |
-1.98 |
2.78 |
-0.71 |
-1 |
1.56 |
*** |
*** |
*** |
*** |
*** |
*** |
| GA |
0.66 |
-5.06 |
2.52 |
-2.01 |
-2 |
1.76 |
-1.20 |
2.36 |
-0.51 |
-1 |
-0.03 |
-3 |
| IN |
0.90 |
2.56 |
2.29 |
1.12 |
2 |
2.29 |
1.58 |
2.23 |
0.71 |
1 |
2.02 |
3 |
| LA |
-0.03 |
-3.69 |
2.59 |
-1.42 |
-2 |
1.96 |
-2.11 |
2.29 |
-0.92 |
-1 |
-1.90 |
-3 |
| MD |
2.46 |
0.43 |
2.88 |
0.15 |
1 |
2.62 |
3.25 |
2.50 |
1.30 |
2 |
2.88 |
3 |
| MN |
-0.40 |
3.63 |
2.17 |
1.67 |
2 |
-0.40 |
3.39 |
2.24 |
1.51 |
2 |
-0.63 |
4 |
| MS |
0.55 |
-3.54 |
2.17 |
-1.64 |
-2 |
1.87 |
-6.47 |
2.22 |
-2.91 |
-2 |
1.50 |
-4 |
| NV |
0.32 |
*** |
*** |
*** |
*** |
*** |
-1.51 |
2.08 |
-0.73 |
-1 |
1.27 |
*** |
| NM |
0.78 |
-3.26 |
2.38 |
-1.37 |
-2 |
0.45 |
-6.07 |
2.80 |
-2.17 |
-2 |
-0.39 |
-4 |
| NY |
0.09 |
-0.14 |
2.54 |
-0.06 |
-1 |
2.34 |
1.56 |
2.77 |
0.56 |
1 |
5.33 |
0 |
| NC |
1.60 |
3.02 |
2.31 |
1.31 |
2 |
0.59 |
3.73 |
2.11 |
1.77 |
2 |
6.96 |
4 |
| SC |
0.90 |
-3.65 |
2.38 |
-1.54 |
-2 |
0.98 |
1.09 |
2.32 |
0.47 |
1 |
1.20 |
-1 |
| TN |
0.32 |
0.24 |
2.43 |
0.10 |
1 |
0.46 |
-7.81 |
2.58 |
-3.02 |
-2 |
-1.84 |
-1 |
| TX |
-0.66 |
0.35 |
2.37 |
0.15 |
1 |
1.06 |
-5.94 |
2.41 |
-2.47 |
-2 |
-0.75 |
-1 |
| VA |
0.55 |
-2.94 |
2.50 |
-1.18 |
-2 |
2.02 |
1.96 |
2.42 |
0.81 |
1 |
3.29 |
-1 |
Lo-stakes states |
AZ |
-0.40 |
0.69 |
2.38 |
0.29 |
1 |
3.59 |
1.07 |
2.66 |
0.40 |
1 |
-3.34 |
2 |
| AR |
-0.27 |
-0.48 |
2.28 |
-0.21 |
-1 |
1.58 |
-6.56 |
2.40 |
-2.73 |
-2 |
1.49 |
-3 |
| CA |
0.09 |
2.44 |
2.82 |
0.87 |
1 |
-2.21 |
0.97 |
3.05 |
0.32 |
1 |
-7.07 |
2 |
| CO |
0.66 |
2.66 |
2.06 |
1.29 |
2 |
-0.83 |
*** |
*** |
*** |
*** |
*** |
*** |
| CT |
1.29 |
0.86 |
2.15 |
0.40 |
1 |
1.70 |
-2.20 |
2.20 |
-1.00 |
-1 |
2.10 |
0 |
| DE |
0.21 |
-3.10 |
1.90 |
-1.63 |
-2 |
3.34 |
*** |
*** |
*** |
*** |
*** |
*** |
| HI |
0.32 |
-3.86 |
2.18 |
-1.77 |
-2 |
-0.54 |
-4.27 |
2.38 |
-1.79 |
-2 |
1.50 |
-4 |
| IA |
-1.61 |
2.17 |
2.20 |
0.99 |
1 |
1.97 |
*** |
*** |
*** |
*** |
*** |
*** |
| KY |
1.97 |
-0.39 |
2.06 |
-0.19 |
-1 |
1.40 |
-0.50 |
2.21 |
-0.23 |
-1 |
3.75 |
-2 |
| ME |
1.29 |
0.49 |
2.18 |
0.23 |
1 |
-0.95 |
-0.64 |
2.06 |
-0.31 |
-1 |
1.04 |
0 |
| MA |
0.32 |
-0.96 |
2.55 |
-0.38 |
-1 |
1.03 |
2.08 |
2.27 |
0.92 |
1 |
3.11 |
0 |
| MI |
0.43 |
5.06 |
2.87 |
1.76 |
2 |
-0.14 |
0.12 |
2.44 |
0.05 |
1 |
0.37 |
3 |
| MO |
1.02 |
-0.86 |
2.33 |
-0.37 |
-1 |
2.73 |
-3.23 |
2.24 |
-1.44 |
-2 |
3.64 |
-3 |
| MT |
-1.26 |
*** |
*** |
*** |
*** |
*** |
7.00 |
2.18 |
3.20 |
2 |
0.67 |
*** |
| NE |
-1.61 |
5.51 |
2.16 |
2.55 |
2 |
0.20 |
1.00 |
2.10 |
0.48 |
1 |
-1.52 |
3 |
| ND |
-0.03 |
3.63 |
1.88 |
1.93 |
2 |
1.58 |
0.10 |
2.10 |
0.05 |
1 |
0.24 |
3 |
| OR |
0.66 |
*** |
*** |
*** |
*** |
*** |
5.09 |
2.51 |
2.03 |
2 |
-2.58 |
*** |
| RI |
0.09 |
1.50 |
2.30 |
0.65 |
1 |
1.37 |
0.94 |
2.22 |
0.42 |
1 |
5.61 |
2 |
| UT |
1.15 |
0.80 |
2.02 |
0.40 |
1 |
2.03 |
-3.15 |
2.10 |
-1.50 |
-2 |
0.03 |
-1 |
| VT |
-0.27 |
*** |
*** |
*** |
*** |
*** |
6.45 |
2.11 |
3.06 |
2 |
3.46 |
*** |
| WV |
0.90 |
-2.33 |
2.06 |
-1.13 |
-2 |
4.07 |
-4.64 |
1.95 |
-2.39 |
-2 |
2.74 |
-4 |
| WI |
-0.40 |
2.23 |
2.37 |
0.94 |
1 |
1.99 |
*** |
*** |
*** |
*** |
*** |
*** |
| WY |
-0.95 |
-2.53 |
1.95 |
-1.30 |
-2 |
-1.73 |
1.42 | |