Over the last fifteen years, many states have implemented high-stakes tests as part of an effort to strengthen accountability for schools, teachers, and students. Predictably, there has been vigorous disagreement regarding the contributions of such policies to increasing test scores and, more importantly, to improving student learning. A recent study by Amrein and Berliner (2002a) has received a great deal of media attention. Employing various databases covering the period 1990-2000, the authors conclude that there is no evidence that states that implemented high-stakes tests demonstrated improved student achievement on various external measures such as performance on the SAT, ACT, AP, or NAEP. In a subsequent study in which they conducted a more extensive analysis of state policies (Amrein & Berliner, 2002b), they reach a similar conclusion. However, both their methodology and their findings have been challenged by a number of authors. In this article, we undertake an extended reanalysis of one component of Amrein and Berliner (2002a). We focus on the performance of states, over the period 1992 to 2000, on the NAEP mathematics assessments for grades 4 and 8. In particular, we compare the performance of the high-stakes testing states, as designated by Amrein and Berliner, with the performance of the remaining states (conditioning, of course, on a state’s participation in the relevant NAEP assessments). For each grade, when we examine the relative gains of states over the period, we find that the comparisons strongly favor the high-stakes testing states. Moreover, the results cannot be accounted for by differences between the two groups of states with respect to changes in percent of students excluded from NAEP over the same period. On the other hand, when we follow a particular cohort (grade 4, 1992 to grade 8, 1996 or grade 4, 1996 to grade 8, 2000), we find the comparisons slightly favor the low-stakes testing states, although the discrepancy can be partially accounted for by changes in the sets of states contributing to each comparison. In addition, we conduct a number of ancillary analyses to establish the robustness of our results, while acknowledging the tentative nature of any conclusions drawn from highly aggregated, observational data.