High-stakes Testing and Student Achievement: Updated Analyses with NAEP Data

The present research is a follow-up study of earlier published analyses that looked at the relationship between high-stakes testing pressure and student achievement in 25 states. Using the previously derived Accountability Pressure Index (APR) as a measure of state-level policy pressure for performance on standardized tests, a series of correlation analyses was conducted to explore relationships between high-stakes testing accountability pressure and student achievement as measured by the National Assessment for Education Progress (NAEP) in reading and math. Consistent with earlier work, stronger positive correlations between the pressure index and NAEP performance in fourth grade math and weaker connections between pressure and fourth and eighth grade reading performance were found. Policy implications and future directions for research are discussed.


Introduction
The present study adds to a growing literature on the relationship of high-stakes testing accountability and student achievement.The major goal of federal and state high-stakes testing policies is to improve schools.The theory of action undergirding this approach suggests that by tying negative consequences (e.g., public exposure, external takeover) to standardized test performance, teachers and students in low performing schools will work harder and more effectively, thereby increasing what students learn.Although the practice of high-stakes testing dates back several decades in various districts (Chicago public schools) and states (Texas, New York, Florida), the passage of the No Child Left Behind Act in 2002 mandated high-stakes testing nationwide and at many more grade levels than was customary.The extant literature on high-stakes testing and student achievement can be organized into three types.In the first type, researchers use two-group designs to compare achievement patterns in states with accountability practices versus those without such practices, or in states with a long history of accountability versus those with shorter histories (Amrein & Berliner, 2002a;Amrein-Beardsley & Berliner, 2003;Braun, 2004;Dee & Jacob, 2009).A second approach has analysts ranking states according to some measure of accountability and then using correlation or regression techniques to ascertain the form and significance of the relationship between accountability measures and student achievement (Carnoy & Loeb, 2002;Hanushek & Raymond, 2005).A third type of research focuses on specific aspects of high-stakes testing practice and impact as they affect particular districts, regions, or states (Clarke, Haney, & Madaus, 2000;Jacob, 2001;Winters, Trivitt, & Greene, 2010).
Each approach has limitations of methods, making it difficult to determine with confidence the effects of high-stakes testing.Still, a pattern seems to have emerged that suggests that highstakes testing has little or no relationship to reading achievement, and a weak to moderate relationship to math, especially in fourth grade but only for certain student groups (Braun, Wang, Jenkins, & Weinbaum, 2006;Braun, Chapman, & Vezzu, 2010;Figlio & Ladd, 2008;Nichols, Glass, & Berliner, 2006).This particular pattern of results (only affecting fourth grade math) raises serious questions about whether high-stakes testing increases learning or merely more vigorous test preparation practices (i.e., teaching to the test).
This study is a follow-up to our earlier work in which we used an empirically-derived measure of state level high-stakes testing policy to examine the relationship between accountability policy implementation and student achievement (as measured by the National Assessment for Education Progress, NAEP).In contrast to other research that measures high-stakes testing accountability according to the number of laws passed that are associated with accountability (Clarke et al., 2003;Pedulla et al., 2003), or by estimating the acceptance of accountability based on state level variables such as funding and student demographic characteristics (Braun et al., 2006;Carnoy & Loeb, 2002), our measure was derived from both legislative efforts as well as "on-the-ground" implementation, response, and reaction (see Nichols et al. 2006).In our earlier analyses, we used the unique measure of accountability pressures that we created with NAEP 4th and 8th grade data from 1992-2003.The purpose of this follow-up study is to examine state policy accountability (measured by our state level accountability pressure index) as it relates to more recent (e.g., 2005, 2007, and 2009) NAEP data available for 4th and 8th grade math and reading.

Review of Literature
High-stakes testing is the process of attaching significant consequences to standardized test performance with the goal of incentivizing teacher effectiveness and student achievement (Herman & Haertel, 2005;Ryan, 2004).The rationale is that by attaching significant rewards or serious threats to changes in student test scores, teachers and their students will inevitably be prompted to work harder, better, and learn more.Although most tests students take are arguably "high-stakes" to them (i.e., failing a teacher-made test could result in failing a class or not passing to the next grade),"high-stakes" here refers to standardized tests developed specifically for the purpose of evaluating teachers and students.Performance on these tests may result in important consequences to schools, administrators, teachers, and students.Passing could bring rewards to teachers (bonuses) and schools (positive reviews in local newspapers), whereas failure could bring severe penalties to teachers and principals (termination), schools (closure or "take-over"), and to students (denied diploma or retained in grade).
Although the practice of high-stakes testing gained a prominent position in educational reform with the passage of the No Child Left Behind Act (NCLB) of 2002, its use as a lever for school change preceded NCLB.Tests have been used to distribute rewards and sanctions to teachers in urban schools since the mid 1800s (Tyack, 1974) and for most schools throughout the United States since at least the 1970s (Haertel & Herman, 2005).New York state in particular has led the United States in test-based accountability efforts, "implementing state-developed (1965) and mandated minimal competency testing (MCT) before most other states (1978) and disseminating information to the media about local district performance on the state assessments before it became routinely popular (1985)" (Allington & McGill-Franzen, 1992, p. 398).
Standardized achievement test invention, development, and use paralleled these reform efforts (Giordano, 2005).The evolution of valid and reliable measurement techniques influenced views of how one might gauge educational quality (McDonnell, 2005).The passage of NCLB in 2002, mandated the most intrusive use of tests for influencing how and what teachers would teach and how and what students would learn.In spite of a growing literature indicating that high-stakes testing has had deleterious effects on teaching practices and student motivation, policymakers continue to argue for its effectiveness in increasing student learning as evidenced in newer proposals (e.g., U.S. Department of Education, 2009) and recommendations for the reauthorization of NCLB (U.S.Department of Education, 2010).

High-Stakes Testing and Student Outcomes
The gradual adoption of accountability-based practices in the years leading up to NCLB provided a context in which to study their effects on achievement at the state level.By the late 1990s, increasing numbers of states had begun adopting test-based accountability plans; however, their form and function varied widely.Some states had developed criterion-based standardized tests, others were just starting the process.Some states were actively using tests to hold teachers and students accountable; others were in the process of developing such mechanisms.Around this time it was discovered that scores on virtually all of these tests went up, probably as a function of test preparation and familiarity with the tests, rather than because students were learning much more (Linn, Graue, & Sanders, 1990;Shepard, 1990).This should have been a lesson for those who designed NCLB accountability, but this pervasive finding and explanations of it were ignored.
Most of the research conducted around the time of NCLB provides scant support for the effectiveness of high-stakes tests in increasing student achievement (Amrein & Berliner, 2002a, b;Braun, 2004;Rosenshine, 2003) or graduation rates (Haney et al., 2004;Heubert & Hauser, 1999;Marchant & Paulson, 2005).Since our initial study, no data have emerged to contradict the findings that accountability pressure has some relationship to fourth grade math, virtually no influence on reading (Dee & Jacob, 2009), and only negative influence on student graduation rates (Holme, Richards, Jimerson, & Cohen, 2010;Orfield, Losen, Wald, & Swanson, 2004).Studies focusing on both high-and low-stakes exit exams repeatedly reveal that these types of incentives/threats have little to no impact on student achievement over time (e.g., Bishop, Mane, Bishop, & Moriarty, 2001;Grodsky, Warren, & Kalogrides, 2009;Reardon, Arshan, Atteberry, & Kurlaender, 2008Reardon, Atteberry, Arshan, & Kurlaender, 2009).In addition, the reduction of the achievement gap between income groups and between racial and ethnic groups, a major goal of the high-stakes accountability movement, either did not occur or was only marginally effective in the years these policies have been in place (Reardon, 2011;Timar & Maxwell-Jolly, 2012).

The Initial Study
Our initial study (Nichols et al., 2006) was prompted by our view that existing approaches to the measurement of test-based accountability policies and practice at the state level were largely inadequate because of their reliance on inspection of state level legislation, as opposed to actual practices.That is, most researchers measured testing "pressure" by examining the number of laws that states had passed prior to or up to the enactment of NCLB.Although reliable, the validity of such approaches for capturing the on-the-ground feeling of pressure was doubtful.In our initial study, we addressed this problem by spending considerable time and effort conceptualizing and building a measure that would more closely represent high-stakes testing policy implementation and which we labeled the Accountability Pressure Rating (APR).
Our method for deriving APR values for our 25 1 study states was guided by the method of "comparative judgments" used for ordering complex and abstract psychological data (Torgerson, 1960).This approach seemed ideal for our purposes since our goal was to transform complex qualitative data (state level policy legislation enactment and implementation) into a quantitative indicator that can be used in subsequent analyses.Our work involved three steps.First, we created state-level portfolios that included a range of legislative documentation, state-generated accountability reports, and newspaper articles documenting the range of ways policy changes both impacted and were viewed by the public (see Nichols et al., 2006 for a complete description of portfolio contents).One unique aspect of our approach was the inclusion of newspaper references (both leading stories as well as editorials) that were used to capture the on-the-ground effects of and reactions to local and statewide test-based accountability practices.In contrast to other studies that relied on quantitative estimations of policies (e.g., Braun et al., 2006;Carnoy & Loeb, 2002) our measure included evidence that described how policies played out in local school systems.
Next, we asked 300 graduate students each to view two states' portfolios and to make two judgments-which state exerted more pressure and by about how much (on a scale of 1-7 2 ).Last, we took the ratings provided by our students and applied the least-squares solution for unidimensional scale values due to Mosteller (as outlined in Torgerson, 1960, pp. 170-173).The result was a scale ranging from .54 to 4.78.This rating served as our measure of state-level testing pressure as of 2004 (See Table 1).As can be seen in Table 1, Kentucky's policies and practices were consistently rated below other states in terms of test-based pressure (APR = .54),whereas Texas's policies and practices were consistently viewed as having the highest test-based accountability pressure (APR= 4.78).Using our APR, we performed correlation and regression analyses to examine patterns in the relationships between our APR and fourth and eighth grade reading and math NAEP through 2003.Our findings revealed that APR was connected most consistently with gains in fourth grade math performance, only slightly connected to gains in eighth grade math, and not correlated with gains in reading at either fourth or eighth grade levels (Nichols et al., 2006).

Study Goals and Research Questions
To date there is no consistent and therefore no convincing evidence that high-stakes testing works to increase student achievement, except weakly in certain areas of the math curriculum.Thus, in spite of the claims of some (Raymond & Hanushek, 2003) who argue that the benefits of highstakes testing are well established, it appears that most research fails to support the contention that high-stakes testing increases student learning.Further, the continued emphasis on test-based accountability as a panacea for school reform (e.g., the 2010 Race to the Top initiative) prompts us to reconsider the relationship of high-stakes testing policies with student achievement.Therefore, our goal in this study is to re-examine the relationship between high-stakes testing pressure using our APR measure as it relates to NAEP data emanating from later years of NCLB enactment.
The primary research question guiding this study is: What is the relationship between statelevel high-stakes testing pressure and student achievement?More specifically, we want to know: What is the pattern of correlations between APR and fourth and eighth grade NAEP scores in reading and math from 2005-2009: • over time; • when disaggregated by student ethnicity; • when disaggregated by student socioeconomic status; Additionally, what is the relationship between APR and four-year NAEP gains (both cohort and non-cohort) in math and reading?

Method
Our analyses are organized into two parts.First, we use descriptive statistics to analyze fourth and eighth grade NAEP data during the period 2000-2009 in reading and math. 3Next, we conduct a series of partial, part, and simple bivariate correlations to examine relationships among state level demographic characteristics, APR, and NAEP indicators.

Data
For the achievement data, we used state-level NAEP (scale scores) in fourth and eighth grade math and reading for all students and disaggregated by student socioeconomic status 4 and ethnicity. 5State-level demographic data were drawn from a variety of online databases and include characteristics of students in the state (e.g., percent who are African American), percent of state 3 Although the focus of this study is with NAEP 2005NAEP , 2007NAEP , 2009, we include in these analyses earlier data available (2000 and 2002) in order to look at trends over time from before NCLB and in the year just as it was passed.These years are also important benchmarks since our APR was derived around the time NCLB was just getting started (i.e., 2004). 4Students were characterized according to two categories: Eligible for free and reduced lunch (Low SES) and Not Eligible for free and reduced lunch (High SES). 5 Students self identified according to choices of: African American, Hispanic, White, Asian/Pacific Islander, and Other.We focus on African American, Hispanic, and White student subgroups in this study.population living in poverty, school enrollment characteristics, and total per student revenues. 6This study focuses on 25 study states, some of which did not have large enough African American or Hispanic student populations to generate NAEP data for each group.Thus, throughout our analysis, we report specific sample sizes when data involve NAEP data disaggregated by student ethnicity.

Results
Part I: Descriptive Analysis.Fourth and eighth grade math.
Means and standard deviations in fourth and eighth grade math across time and disaggregated by student ethnicity and socioeconomic status are presented in Table 2.All subgroup averages fall below the level of "proficiency" set by NAEP.Interestingly, except for Hispanic students in 2000 and 2005, fourth graders' math performance demonstrated less variability than eighth graders.Average NAEP performance across time and disaggregated by student ethnicity and socioeconomic status are also displayed in Figures 1 and 2. Across each administration of NAEP, high SES students scored on average consistently higher than any other subgroup.By contrast, African American students consistently posted lowest average NAEP scores in both fourth and eighth grade math.We were also interested in average achievement gap patterns over time and between White and Black (WB: calculated as White subgroup average score in state i -Black subgroup average score in state i), White and Hispanic (WH: calculated as White subgroup average score in state i -Hispanic subgroup average score in state i), and Hispanic and Black (HB: calculated as Hispanic subgroup average score in state i -Black subgroup average score in state i) student sub groups for both fourth and eighth grade math.Average NAEP standard score differences and standard deviations for each subgroup are displayed in Table 3 and in Figures 3 (fourth grade) and 4 (eighth grade).Figure 3 suggests that the WB and WH achievement gaps dropped relatively more steeply in the period 2000-2003, while levelling off but still declining slowly from 2003 to 2009.By contrast, the HB gap seemed to increase over time.An examination of averages in fourth grade math suggests that consistently, Hispanic students outperform African American students.In eighth grade, average math achievement gaps between WB and WH narrowed relatively steadily over time (Figure 4).By contrast, the HB gap stayed flat over time, followed by a steep increase in the period 2007-2009.As with fourth grade, in eighth grade, Hispanic students on average consistently score better than their Black peers.NAEP means and standard deviations for all subgroups over time in fourth and eighth grade reading are displayed in Table 4.An examination of the pattern of standard deviations suggests that eighth graders showed less variability on average than fourth graders.In this section, bivariate, part, and partial correlation coefficients are reported in an examination of the relationships between our APR indicator and NAEP achievement.We began by running correlations to see what state level demographic variables are associated with our APR.Extant research has consistently shown that state poverty rates and racial composition of students are associated with state accountability practices in that poorer states and those with greater numbers of students of color tend to adopt more punitive accountability policies than states with largely white and more affluent student populations (e.g.Carnoy & Loeb, 2002;Nichols et al., 2006).We wanted to see whether APR shared variance with these and/or other state demographic variables and to see how these relationships may change over time.Because our APR was constructed in 2004, we were also interested to see if correlations with state poverty and other characteristics remained stable over time.As shown in Table 6, APR is consistently associated with percent poverty levels as well as percent of students who are African American in the state.APR is not associated with the percent of Hispanic students or state-level revenue/expenditure patterns.Fourth and eighth grade math.
In Table 7, we correlate APR and fourth and eighth grade math NAEP disaggregated by student SES.There is no significant correlation between APR and student performance when divided into high and low socioeconomic status for either grade level.In the next analysis (and a similar one conducted for reading data, Tables 17-20), we used three slightly different approaches to examine the relationship between APR and math NAEP disaggregated by student ethnicity.In the first column of results in Table 8 (Column 1), we report bivariate correlations between APR and NAEP scale scores.Next, we remove the effects of state poverty from APR using regression techniques to generate standardized residuals of APR from state poverty data as of 2004 and use those residuals in correlation with NAEP scale scores.Lastly, we use partial correlation techniques to look at the relationships of APR residuals and NAEP while at the same time partialing out the effects of exclusion rates.
As displayed in Table 8, correlations of APR and NAEP suggest that the correlation of accountability pressure and math achievement ranges from a low of .038(African American, 2009, 4 th grade) to a high of .463(Hispanic, 2005, 8 th grade).For some groups in some years, the relationship is barely evident, whereas for other groups in other years, there seems to be a relatively strong correlation.A closer look at the pattern of correlations shows that the within group strength of the APR-NAEP relationship for Hispanic students increases 2003-2005/2007 followed by a significant decrease in 2009 in both fourth and eighth grade (the same pattern is true for African American students in eighth grade).However, when state poverty is removed from APR, the pattern of results slightly changes.Although correlations still seem to decrease overtime, absolute values across the board are greater when state poverty is removed.An examination of Column 2 suggests that pressure and NAEP are connected more strongly for 4 th and 8 th grade White and Black students but less connected for Hispanic students.This pattern changes little when exclusion rates are removed (Column 3).

Fourth and eighth grade math: Gain and Cohort Analysis.
We were particularly interested in the relationship between APR and NAEP gains across 2003-2007 (Table 9) and 2005-2009 (Table 10).These correlations reflect the degree to which testrelated pressures in various states, as measured in 2004, are related to subsequent NAEP gains.To address regression effects associated with gain score analyses, we used linear regression techniques to generate standardized residuals of NAEP 2007of NAEP (2009) ) from NAEP 2003from NAEP (2005) ) as the estimation of NAEP gain for this and all subsequent analyses involving NAEP standard score gains over time.As displayed in Tables 9 and 10, we estimated the relationship between our APR indicator and NAEP gain in two ways.First, we correlated APR and NAEP gain measured as a regression residual.Second, we again accounted for the shared variance of APR and state poverty by using APR residuals as the correlate with NAEP gain residuals.Lastly, we used partial correlation techniques to examine these relationships while partialing out the exclusion rates.When it comes to relationships between our APR and 2003-2007 NAEP gains, bivariate correlations suggest moderate to low connections for all student groups in fourth grade but stronger connections for African American and Hispanic eighth graders.When poverty is removed, this pattern changes little; however the connection between high stakes testing pressure and both fourth and eighth grade gains diminishes significantly once exclusion rates are partialed out of the question (last column, Table 9).With the exception of African American eighth grade achievement, the relationship between high-stakes testing pressure (APR) and NAEP gains in math is relatively absent in both fourth and eighth grades.When it comes to 2005-2009 NAEP gains (Table 10), the pattern reverses significantly.
As shown in Table 10, our APR is negatively associated with NAEP gains for virtually all student groups in both fourth and eighth grade.Greater pressure in 2004 is associated with decreasing NAEP gains in both fourth and eighth grade math for the 2005-2009 years.
We wanted to see if high-stakes testing pressure was related to changes in achievement among cohorts of students (i.e., "cohort" analyses follow the achievement trends of students as they progress from fourth to eighth grade7).For these, and all subsequent cohort analyses, cohort NAEP gains are calculated as: [eighth-grade achievement year i] -[fourth-grade achievement year (i -4)].To account for regression effects in gain analysis, we generate standardized residuals of 2009 ( 2007) eighth grade math achievement from 2005 (2003) fourth grade math achievement.Using these residuals, we correlate with APR and with APR residuals (removing state poverty).Results displayed in Table 11 suggest that when it comes to the 2003-2007 cohort, pressure is related to achievement for African American students, but not for White or Hispanic students.When it comes to the 2005-2009 cohort, however, all relationships disappear and for African Americans, the relationship inverts such that greater pressure is associated with declines in cohort NAEP performance.

Fourth and eighth grade reading
Correlations of APR and NAEP for fourth grade reading and disaggregated by SES are displayed in Table 12.APR and reading achievement are most strongly and negatively related for low-income eighth graders (2003,2005,2009).Similar to the math analysis, we looked at APR and reading achievement disaggregated by student ethnicity in three ways.As displayed in Table 13, bivariate correlations of APR and NAEP reading show no relationship in fourth or eighth grade.However, when poverty is removed from APR, the pattern of associations shifts such that APR and reading achievement in fourth (2003,2005) and eighth grade (2007) are positively linked for White students.When exclusion rates are partialed out of the relationships, many of the correlations subsequently diminish except for White students in fourth and eighth (2003)(2004)(2005)(2006)(2007) and African American students in fourth grade (2009), and Hispanic students in eighth grade (2003).

Fourth and eighth grade reading: Gain and Cohort Analysis.
As shown in Tables 14 and 15, APR and NAEP gains in reading are positive for Hispanic students in fourth (2003)(2004)(2005)(2006)(2007) and eighth (both 2003-2007 and 2005-2009) grades.By contrast APR and NAEP gains for Hispanic students in fourth grade 2005-2009 are negative.Correlation analysis of NAEP cohort gains in reading and APR suggest there are few meaningful relationships for the 2003-2007 time span; however, APR and Cohort gains among African American and Hispanic students in reading are slightly stronger for the 2005-2009 time frame (Table 16).

Discussion
In this study, we used correlational techniques to look at the relationship of high-stakes testing pressure and student achievement in 25 states.Using our empirically derived measure of state-level high-stakes testing pressure, the Accountability Pressure Rating (APR) developed in an earlier study (Nichols et al., 2006), we looked at the ways in which state level pressure was associated with state level achievement as measured by NAEP in reading and math in fourth and eighth grades since the inception of NCLB.The data tell a very familiar story.

Descriptives
Math and reading NAEP data reveal a few interesting patterns.In math, pre-NCLB achievement gains were greater than post-NCLB gains.Thus, students were progressing in math at a much faster rate before the national high-stakes testing movement spawned by NCLB.By comparison, fourth and eighth grade reading achievement remained relatively stable over time, with the exception of small increases for fourth graders (2005)(2006)(2007) and small decreases for eighth graders (2003)(2004)(2005) after NCLB.When it comes to NAEP achievement from 2002 to 2009, the institution of the NCLB was followed by varied achievement patterns in fourth and eighth grade math.
When disaggregated by ethnicity and SES, White students consistently outscored African American and Hispanic students and richer students consistently outperformed poorer students.Based on these descriptive data, it appears as if the achievement gaps are narrowing, although very slowly.Elsewhere, Braun et al. (2010), using more sophisticated analytic techniques in an attempt to isolate the effects of NCLB on these trends in 10 states, conclude the following regarding the achievement gap problem, Although the ten states certainly differed in their outcomes, the general picture is quite clear: The introduction of high stakes test-based accountability through NCLB has had, at best, a very modest impact on the rates of improvement for Black students and on the pace of reductions in the achievement gaps between Black and White students (pp.41-42).Our data here cannot explain the nature of these gap trends over time; however, our analysis seems to reiterate the point that achievement gaps have insignificantly changed as a result of the policies emanating from NCLB.

Correlation Analysis
Our correlation analysis revealed several notable patterns.For example, our data suggest that test related pressure is significantly and positively correlated with state poverty index (percent poverty in state).That is, states with greater number of individuals living in poverty also tended to employ test-related practices that exerted greater amounts of pressure.The nation's poorest children, and the teachers who teach them, tend to feel more pressure when it comes to high-stakes tests than their more privileged contemporaries.When disaggregated by SES and race, data suggest that the relationship between APR and NAEP performance is mixed.In terms of SES, high-stakes testing pressure has no connection to NAEP performance in math.By contrast, APR is more strongly and negatively connected with NAEP performance in reading, especially for low-income students.Thus, high-stakes testing pressure seems to have no measurable connection to NAEP math performance for either rich or poor students, but pressure is deleterious for poor students' NAEP reading performance.
Our data also show that APR is positively correlated with fourth and eighth grade math performance among all groups of students at different points in time (Table 8).Notably, when exclusion rates are removed from the relationship, the APR-NAEP connection diminishes for Hispanic students, raising the question of the role exclusion rate practices play in facilitating the connection between pressure and test performance.By contrast, when it comes to reading, the only substantive connection between pressure and NAEP performance emerges for White students in both fourth and eighth grades (and in particular, in the earlier years of implementation, 2003 and 2005, see Table 13).Overall, these correlations suggest that test-related pressure connects more strongly with increases in math performance than in reading (in both 4 th and 8 th grades), a pattern that seems more prevalent for White students than for African American or Hispanic students.
We looked at the relationship between APR and NAEP gain scores across two time periods (2003-2007 and 2005-2009) for math and reading.Starting with math, an interesting pattern emerged.When it comes to the 2003-2007 gain years, pressure emerged as a more positive correlate with student math gains-especially among eighth graders (Table 9).By contrast, when it came to the later 2005-2009 span, these correlations virtually all transformed into negative relationships.Thus, as time goes by, it seemed as if earlier levels of pressure in state policy enactments led to later decreases in math gain achievement.For reading, a mixed picture emerges.Pressure is positively related to Hispanic student performance in both fourth and eighth grade and for both time spans examined.However, the relationships are more mixed for other groups: APR is weakly connected to White student reading performance across both time spans and inconsistently related to African American performance (i.e., sometimes positive, sometimes negative, sometimes strongly, sometimes faintly; see Tables 14 and 15).
In terms of cohort achievements in math, APR is positively related to African American cohort achievement 2003-2007 and negatively related to African American cohort achievement 2005-2009.Overtime, pressure has diminishing returns for African American students.When it comes to reading performance, APR has a positive connection with Hispanic reading performance in both 2003-2007 and 2005-2009.From these results, it is very difficult to come to any simple conclusion regarding the relationship of pressure and student achievement.In some cases there are positive connections, whereas in other cases, there are negative connections.In our earlier study, we rank ordered all of our correlation coefficients to try to ascertain a meaningful pattern.In that study, our pattern of correlations revealed that the strongest positive associations between APR and NAEP gain was in fourth grade math (Nichols et al., 2006).In Table 17, we rank order by absolute value all correlations emerging from analyses where student achievement was disaggregated by ethnicity.These 48 correlations reveal an interesting pattern.Among the first 24 (or half) of these correlations, 19 come from math and 5 from reading.By contrast, among the bottom 24 correlations, there are 19 in reading and 5 in math.Positive relationships between pressure and NAEP performance exist primarily in math across all subgroups.In contrast to our previous work in which we found significant relationships between APR and NAEP in fourth grade math only, these correlations seem evenly spread among fourth and eighth grade performance.

Implications
The research on the impact of accountability-based policies and student achievement is varied, limited, and relatively inconclusive.One explanation for this state of affairs is that it is very difficult to isolate cause-effect relationships between complex policy implementation and subsequent student achievement.Still, our data here and elsewhere, as well as work by others reiterate a familiar story: Increased testing pressure is related to increases in achievement in math more consistently than in reading.Differences in the nature of the mathematics and reading curriculum, and /or differences in the ways one can prepare for assessments in these two areas may have something to do with the fact that state level pressure to perform well on high-stakes tests is more strongly and positively related to math achievement and negatively related to reading achievement.
Although our overall correlations reveal that pressure is more connected with math achievement than with reading, our gain and cohort analyses tell a slightly different story.When it comes to math, pressure has no relationship to NAEP changes over time (for either cohorts of students or cross sectional groups of students).By contrast, pressure is positively associated with some student group gain scores in reading.This reversal is perplexing and difficult to interpret.If our APR holds up over time (which is questionable, see the limitations section next), then these data suggest that pressure has diminishing returns for math achievement over time, but slightly positive returns for reading achievement over time (but only for some students groups).Some of these trends may be explained by the fact that correlations in both math and reading diminish when exclusion rates are partialed out-schools may be excluding lower scoring students at greater rates in later years.
We still contend from these data, as we did in our earlier study (Nichols et al., 2006), that the overall pattern of correlations (math more strongly connected to pressure than reading), points to the likelihood that under pressure, teachers grow more efficient at training students for the test.The math curriculum (versus the reading/language arts curriculum) is structured in such a way as to make it much more amenable for teachers to teach to the test.However, as our data suggest, as time passes, pressure seems to play more of a role in increasing reading scores raising the question of how increasing pressure translates into practice in reading classrooms.What are teachers and students doing differently when it comes to preparing for reading assessments as a result of this increasing pressure?Although more difficult, it is possible that as time progresses, teachers become more skilled at deconstructing the reading curriculum to help students prepare for test questions.Of course, as NCLB persists, it becomes increasingly important but more difficult to disentangle the effects of pressure on student ability to take tests from pressure that genuinely affects student learning.This pattern of results is too varied to make any definitive claims regarding how test-based practices in the classroom may connect with these achievement outcomes.

Limitations of Study
There are a few limitations to this study.First of all, correlation data reveal nothing about the causal nature of relationships.Therefore, although we detect certain consistent patterns in the relationship of APR and NAEP, we cannot make claims regarding casual direction.Further, we also recognize that as time passes, it becomes more difficult to ascertain the meaningfulness of correlations between our APR derived in 2004 and subsequent NAEP data.We acknowledge that as time passes, states' test-based accountability practices have likely changed such that the validity of our 2004-derived index may be less relevant.Since we have no measure of accountability practices beyond 2004, we have no way of knowing the relative accuracy of APR for capturing these changes over time.Still, we believe that since there have been no new federal laws that have mandated sweeping accountability changes until 2010 when the Race to the Top was first signed into law, we believe that states' test-based policies as they were in 2004 have likely changed very little over the course of our study years (2005)(2006)(2007)(2008)(2009).Although we have no reason to think states have changed so dramatically that their pressure-based ranking may have altered in any significant way (e.g., Guisbond, Neill, & Schaeffer, 2012), the question left unexamined is to what degree states' policies under NCLB have evened out over time

Future Directions
In light of the fact that high-stakes testing mandates are not going away anytime soon, it seems important that more research is done to understand how the pressures of testing influence classroom-based teaching practices in different curriculum areas.The areas most affected by cultural practices at home, reading, for example, may require a different policy approach than that required by school subjects more related to in-school learning, such as mathematics or science.It is clear that the policy positions of the Obama administration support test-based accountability and the government is pleased with the pressure those policies exert on the functioning of our schools.Because of that, it seems incumbent upon policy researchers to continue work that sheds light on the ways in which test-based instructional practices affected by accountability pressures impact students' motivation, development, and achievement.
Fourth and eighth grade math average scores rose more dramatically from 2000 to 2003 (pre-NCLB) than from 2003 to 2009 (post-NCLB).

Figure 4 .
Figure 4. Eighth grade average math NAEP scale score differences: 2000-2009 grade) and 6 (eighth grade).At both the fourth and eighth grade levels, high SES students and White students outperformed low SES, African American, and Hispanic student groups consistently over time.Average fourth grade reading NAEP scores for all student subgroups (except for Hispanics) were relatively flat in the period 2002-2005, followed by a rise in the period 2005-2007.High SES and White students leveled off in the period 2007-2009 whereas Hispanic, low SES and African American students' performance continued to rise moderately.In eighth grade, NAEP reading performance for all subgroups of students (except low SES) dropped steadily in the period 2002-2005 following by a steady rise in the period 2005-2009 (Figure6).

Table 4
Fourth and Eighth Grade Reading NAEP: Means and Standard Deviations Disaggregated by Student  Ethnicity and Student Socioeconomic Status (25 Study States: 2002-2009) Average reading performance over time and disaggregated by student ethnicity and socioeconomic status are also displayed in Figures5 (fourth

Table 5
Average Reading NAEP Achievement Gap: 2000-2009In eighth grade, the gap trends vary.The WH gap trends downward over time, whereas the WB and BH gap increases in the period 2002-2005 followed by declines in the period 2005-2007 and then another increase for the HB gap during the period 2007 to 2009.

Table 6
Correlations of APR and State Level Variables over Time Note: * = p<.05,n=25, + = data not available at the time of analysis

Table 7
Correlations APR and math NAEP disaggregated by student SES

Table 8
Bivariate, Part, and Partial Correlations: APR and Math NAEP Disaggregated by Student Ethnicity

Table 9
Correlations of APR and 2003-2007 Math NAEP Gain Scores Disaggregated by Student Ethnicity Note: Partial correlation between APR indicator and NAEP gain indicator, holding 2007 exclusion rates constant (for fourth and eighth grade respectively)

Table 10
Correlations of APR and 2005-2009 Math NAEP Gain Scores Disaggregated by Student Ethnicity in Note: Partial correlation between APR and NAEP gain, holding 2009 exclusion rates constant (for fourth and eighth grade respectively).

Table 11
Note: Partial correlations represent the association of APR and NAEP cohort gain indicator while holding 2007 (2009) eighth grade math exclusion rates constant.

Table 13
Correlations: Fourth and Eighth Grade Reading Disaggregated by Student Ethnicity, 2003-2009

Table 14
Correlations and Partial Correlations of APR and 2003-2007Reading NAEP GAIN Scores Disaggregated by Student Ethnicity Note: Partial correlations represent the association of APR and NAEP cohort gain indicator while holding 2007 eighth grade reading exclusion rates constant.*=p<.05,**=p<.01(two-tailed).

Table 16
Correlations and Partial Correlations-Cohort Analysis for Reading : Partial correlations represent the association of APR and NAEP cohort gain indicator while holding 2007 (2009) eighth grade reading exclusion rates constant.*=p<.05,**=p<.01(two-tailed). Note

Table 17
Rank ordering of APR-NAEP correlations disaggregated by student ethnicity