This article has been retrieved
times since October 16, 2001
Education Policy Analysis Archives | ||
Volume 9 Number 42 |
October 16, 2001 |
ISSN 1068-2341 |
|
Editor: Gene V Glass, College of Education Arizona State University
Copyright 2001, the
EDUCATION POLICY ANALYSIS ARCHIVES . Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Significance of
Test-based Ratings
|
|
Abstract In 1998 Massachusetts began state-sponsored, annual achievement testing of all students in three public school grades. It has created a school and district rating system for which scores on these tests are the sole factor. It proposes to use tenth-grade test scores as a sole criterion for high school graduation, beginning with the class of 2003. The state is treating scores and ratings as though they were precise educational measures of high significance. A review of tenth-grade mathematics test scores from academic high schools in metropolitan Boston showed that statistically they are not. Community income is strongly correlated with test scores and accounted for more than 80 percent of the variance in average scores for a sample of Boston-area communities: ![]() Once community income was included in models, other factors--including percentages of students in disadvantaged populations, (Note 1) percentages receiving special education, percentages eligible for free or reduced price lunch, percentages with limited English proficiency, school sizes, school spending levels, and property values--all failed to associate substantial additional variance. Large uncertainties in residuals of school-averaged scores, after subtracting predictions based on community income, tend to make the scores ineffective for rating performance of schools. Large uncertainties in year-to-year score changes tend to make the score changes ineffective for measuring performance trends. |
|
B. Schools in the Boston Metropolitan Area Metropolitan Boston is diverse. Besides the City of Boston it includes many smaller municipalities, all operating their own school systems. These studies consider communities inside Route 128, a highway designed in the late 1940s (now an Interstate), enclosing areas within about 9-12 miles from Boston's government center--that is, Boston and its inner and middle suburbs. They share a public transit system, several public and private utilities, and an economy dominated by service industries. They include poverty areas, concentrations of wealth, middle-income communities, prosperous suburban towns, a few medium-sized cities and one large city. The areas are bounded by the Massachusetts Bay and Atlantic Ocean to the east, Salem and Peabody to the north, Waltham and Newton to the west, and Braintree and Quincy to the south (see Metropolitan Area, 1997). Schools in the Boston metropolitan area are also diverse. These studies, focusing on testing for graduation, consider only high schools. While a majority of the area's population of high-school age attends public schools, (Note 2) a substantial proportion attends parochial schools that began to be established by the Roman Catholic Church more than 150 years ago. A smaller fraction is taught in other private schools or though home schooling. The Metropolitan Council for Educational Opportunities (METCO), founded in 1963, uses state funding to help send over 3,000 Boston minority students to suburban schools (Orfield, et al., 1997). Within public school systems there is also substantial diversity. All communities must support regional "vocational," "technical" and "agricultural" high schools. Some such schools began as "manual training" schools in the 1800s. Some communities have closed their local vocational schools; some have merged them with their academic schools. These studies look in detail only at academic schools, because the curriculum of vocational schools is substantially different and is not designed to prepare students for MCAS tests, an issue of controversy (Nicodemus, 2000). For purposes of these studies there are difficulties with a few communities, including Cambridge, Quincy, Revere and Waltham, which provide vocational education in the same facilities as academic programs (Mass. DoE, 2000f). I chose to include such schools in these studies while noting their special characteristics. Several communities also operate experimental schools, including "pilot schools" in Boston and "charter schools" in several communities (Partee, 1997, and Wood, 1999), as regulated under the Massachusetts Education Reform Act of 1993. All that offer ninth-grade curriculum and above are smaller than the regular academic schools. These schools provide motivational environments and may exercise indirect forms of student selection that differentiate them from other public schools. Primarily because of concerns about small sample sizes, schools with fewer than 100 students per grade are excluded from these studies. So far no experimental school is that large. The City of Boston presents a unique situation. Of its large academic high schools, three are exam schools: the Boston Latin School (founded in 1635) and the more recent Latin Academy (formerly Girls Latin) and O'Bryant School of Mathematics and Science (formerly Boston Technical). These draw away many Boston students who tend to score well on achievement tests, promoting a longstanding social stratification in Boston schools. Over half the students at Boston Latin come to it from parochial and other private schools (Daley, 1997); some say those students would not otherwise attend Boston schools. However, other public school students who are not admitted leave the district for high school. Starting in 1975, because of federal court orders to desegregate, exam school admission policies included a 35 percent set-aside for African American and Latino students, maintained voluntarily after 1987. As a result of another federal court decision (McLaughlin, 1996), this approach was weakened in 1997. As with academic schools that provide vocational education, the Boston exam schools are included in these studies, but their special characteristics are noted. C. Statewide MCAS Test Results Table 1-1 shows that statewide, tenth-grade MCAS test scores have remained nearly constant in English language arts and in science and technology for the years 1998-2000, while scores in mathematics have risen substantially (Mass. DoE, 2000h). (Tenth-grade tests were not given in history and social science.)
|
|
Table 1-1 reflects Massachusetts Department of Education practice of recording students absent for a test section as scoring 200 and in Level 1, the lowest level (Mass. DoE, 2000h, Table 11 footnote). An undisclosed fraction of students were excluded from testing because of special conditions and are not counted in this report; others may have been provided with an alternative assessment. As currently planned, students in Level 1 will be ineligible to graduate from high school as of 2003. Based on this record of scores, about half of all Massachusetts public school students are at risk of being denied graduation. The labels of the four
"performance levels" designated by the 1998 Board of
Education(Note 3) for
reporting MCAS results are:
Although these levels have qualitative descriptions (Mass. DoE, 1998b), there are no quantitative links to levels of achievement specified in academic standards; content of standards has not been prioritized; nor have standards been promulgated through state regulations, as anticipated by law. (Note 4) Although Massachusetts law requires "competency determination" in mathematics, science and technology, history and social science, foreign languages and English, (Note 5) Massachusetts laws and regulations continue to require only US history and physical education as subjects of instruction. Massachusetts tries to set legal standards for learning indirectly (Note 6) through MCAS tests, procedures to set scale factors, and regulations for minimum scaled scores. It lacks corresponding legal commitments for instruction. It has made major changes to "curriculum frameworks" every few years (Mass. DoE, 2000k) and has not provided reasonable spans of time for instruction to catch up before using revised "curriculum frameworks" as a basis for revised MCAS tests. Its teachers, parents and students cannot find out exactly what must be learned in order to meet minimum standards for high school graduation. The 1993 Education Reform Act left several such problems; few have been addressed yet by the Massachusetts legislature or Board of Education. Students with disabilities (also called special education students) and students with limited English proficiency (LEP students) tend to receive drastically lower MCAS scores than other students, although some students with disabilities are soon to be provided alternate assessments (Mass. DoE, 2000l), and some LEP students have been able to take tests in Spanish (Mass. DoE, 2000d). The Department of Education has not disclosed the fractions of students who are eligible for or have utilized its special accommodations, although it has published statewide summary data using these student categories (Mass. DoE, 2000i, Table 14.5). Most minority students also receive lower scores than other students. The Department of Education has published 1999 statewide and district summary data for students categorized as "African American / Black," "Asian or Pacific Islander," "Hispanic / Latino," "Native American," "White" and "Mixed" (Mass. DoE, 2000c, Tables 5-10). As previously noted, most students in vocational programs receive lower MCAS scores than students in academic programs; this can readily be shown for the state's more than 30 vocational, technical and agricultural high schools (Appendix 2). Based on the sources of information cited, Table 1-2 shows statewide impacts of these known risk factors on average 1999 tenth-grade mathematics scores and rates of failure.
The Department of Education has not reported scores classified by other potential risk factors on which it collects information. These include: Gender of students, Tests taken in Spanish or as alternate assessments, Free or reduced price lunches, as indicators of poverty, Schools with large class sizes, especially in early grades, Students retained below grade or placed below grade level, Teachers who lack certification in their subjects of instruction. There is also little published information about combinations of risk factors. However, since the Department of Education lists regional vocational, technical and agricultural schools as separate districts in its reports of MCAS results, it is possible to use their categories of minority students (Appendix 2). For those schools for which categories are reported, results are shown in Table 1-3.
While the results in Table 1-3 are not strictly comparable with Table 1-2, because not all the schools and categories can be found in published data, they indicate that factors can combine to worsen the scores of students with more than one risk factor. Recent studies question assumptions that "high-stakes" tests like MCAS can provide valid measures of either student achievement or school performance, showing gains on them that are not matched by gains on other tests for closely related educational content (Haney, 2000, and Klein, et al., 2000). Political environments of "high-stakes" tests create heavy pressure to improve scores, regardless of underlying educational progress. For "low-stakes" tests aimed at measuring long-term trends, like those of the federal NAEP, it has been shown that "family variables explain most of the variance across scores in states" (Grissmer, et al., 2000, Chapter 9). Individual and longitudinal studies demonstrate strong influences of parenting practices, family structure, parent education and degrees of poverty on cognitive development (for example, Smith, et al., 1997). Other longitudinal and cross-sectional studies show cumulative responses of test scores to educational environments (for example, Phillips, et al., 1998, and Ferguson, 1998). However, the data generally available for test score research fail to capture much of the critical information needed to understand development of cognitive abilities and educational achievement in the settings of public schools. MCAS test scores have already been the subject of several attempts to explain, predict or interpret them (Mass. DoE, 2001, Gaudet, 2001, Tuerck, 2001a, and Tuerck, 2001b). These prior MCAS test score studies fall into three main categories: 1) Trends studies of year-to-year and multi-year changes; 2) Effects studies involving social factors for the population; 3) Effects studies involving operating factors for the schools. Research on scores from school-based standard tests suggests that many such studies are likely to yield results of low significance. Grissmer, et al., 2000, among others, show that:
The MCAS test score studies cited use scores and statistical data to estimate the performance of schools or districts according to simple formulas, unsupported by other evidence. They frequently present results in a table that is ranked or can be ranked like the teams in a sports league. The "league table" approach to presenting such results begs the question of whether the ordering of schools or districts and the differences in performance estimates have educational significance, that is, whether such rankings may instead be largely matters of chance or be associations with factors other than school performance. This article presents a trends study and an effects study I conducted to explore the significance that can be associated with such results. The school characteristics used in these studies are taken from information reported by public schools to the Massachusetts Department of Education for 1999 and published by the Department (Mass. DoE, 2000f). MCAS test scores summarized by schools are from 1998-2000 Department reports (Mass. DoE, 2000h). Other information is published by the Department for school districts, including program budgets and percentages of special education students. Information for census tracts and communities is available from the US Bureau of the Census and other sources. Data analysis for these studies focuses on information associated with individual schools because aggregate information for school districts or general populations can mask school characteristics. Data used in these studies are reproduced in Appendix 3 and Appendix 4; interested readers can confirm them at the sources and can repeat these studies or perform other analysis with them. The Department of Education and the school districts collect other potentially useful information that is not currently published. Of particular interest are data on class size and teacher preparation. Recent research has shown significant association of educational achievement as measured by "low-stakes" tests with small class size in elementary schools (Nye, et al., 1999, and Krueger, 1999) and with teacher certification and education (Darling-Hammond, 2000), after adjustments for student backgrounds. Studies of the development of cognitive abilities cast doubt on whether other information currently published by government sources about population and economic characteristics in large geographical areas would substantially improve the understanding of test scores. |
|
A. Trends Study of Variability
This study considers 47 academic high schools in 32 metropolitan Boston communities through the average tenth-grade MCAS mathematics test scores recorded for years 1998-2000. Achievement tests in mathematics typically require substantial skill at language interpretation (see, for example, Gipps and Murphy, 1994, Chapter 6, p. 183). Haney, 2000, in a study of another state, found stronger correlations of state mathematics test scores with grades in English than with grades in math. As previously noted, the tenth-grade mathematics test is used in this study of significance because it sets a graduation threshold for most students. Test boycotts have been organized by students in several schools each year (Steinberg, 2000), involving 10 to 31 percent of students in 19 cases out of the 141 test samples. To be able to compare average scores of schools more accurately, the average scores reported by the Department of Education have been adjusted by removing the scores of 200 that were assigned to students who did not take the test, averaging only scores of students who participated. Table 2-1 shows changes in schools' average scores (Appendix 3) between 1998 and 1999 and between 1999 and 2000, expressed in units of scale points and of standard deviations.
A "delta" is a change in scale points between two years, minus the average change for the year span, divided by a standard error of the change that is estimated from test reliability and number of participants. The uncertainty for one test score is based on an average standard error of 6.7 scale points, from reliability estimated by the Department of Education for the tenth-grade mathematics test of 1998, using randomized split-half comparisons (Mass. DoE, 1999c). The variance of an average score for a school is estimated by the square of this quantity, plus variance contributed by roundings, divided by the number of test participants. A delta expresses the change for a particular school beyond the average change, in units of estimated standard errors. When standard errors are accurately estimated, deltas outside +/- 2 are statistically significant at the p< .05 level, but here about half of the schools in both year spans have deltas well outside this range. Especially large deltas were those shown in Table 2-2.
Statistically large deltas occur about ten times as frequently as they would by chance, were test reliability the only factor in standard errors. Since deltas adjust for changes in test difficulty and in average efforts at teaching, learning and test preparation, there appear to be other factors affecting test score changes that differ from school to school and from year to year. However, the Department of Education has not published any studies on other variability factors. Table 2-3 reflects data for all 47 schools included in this study, showing the average score changes in scale points for the 1998-2000 years, weighted by numbers of test participants.
The average score change for all schools would be highly significant for both year spans at the p<.0001 level and better if test reliability were the only factor in variability. Scatterplots of score changes in Figure 2-1 and Figure 2-2, showing deltas for a span versus average scores in 1998, do not indicate strong relationships but show possible outliers; they are schools noted in Table 2-2. These plots of year-to-year changes provide a picture of the variability in school-averaged test scores, which is much greater than estimates based on test reliability and number of test participants. Evaluating year-to-year point score changes, excluding three cases that have deltas beyond +/- 10 as outliers, yields a practical measurement of variability for school-averaged MCAS test scores. Figure 2-1: Changes in MCAS Grade 10 Math Test Scores, 1998-1999
![]() . Figure 2-2: Changes in MCAS Grade 10 Math Test Scores, 1999-2000
![]() Standard deviations of school-averaged score change, less average score change for a year span for all schools, are 2.9 scale points for 1998-1999, 3.1 scale points for 1999-2000 and 3.0 scale points for both spans combined. These are about five times larger than the uncertainty estimated on the basis of test reliability. As one might expect, scatterplots of score changes in scale points versus school sizes show greater variability for smaller schools. An analysis found that an equivalent standard error in scale points for school averages of grade 10 MCAS mathematics test scores can be estimated as the constant 33. divided by the square root of the number of test participants. This estimate of standard error combines contributions from test reliability with random variations in student mental alertness, student backgrounds and school performance. For typical metropolitan Boston schools from these studies, school-averaged test score gains, losses or differences of less than about five scale points are not statistically significant at the p<.05 level. Average mathematics score increases from 1998 to 2000 for all schools in these studies combined substantially exceed the rates of change that other studies have found to reflect genuine and sustainable improvements in learning. Based on test reliability as a measure of variance, the change is more than 20 standard deviations; based on calculated variance, it is more than four standard deviations. As noted, other tenth-grade MCAS tests showed little score change. Statistical magnitudes of mathematics test score changes strongly suggest causes other than or in addition to improvements in learning. Anecdotal accounts report heavy efforts at test preparation in some schools, but the general upsweep in scores indicates that the 2000 mathematics test may also have been significantly easier than corresponding 1998 and 1999 tests. Sparse published information from the Department of Education about test calibration makes this issue difficult to trace. B. Effects Study Involving Social Factors This study, like the trends study in Section 2A, considers 47 academic high schools in 32 metropolitan Boston communities, through average tenth-grade MCAS mathematics test scores as recorded for 1999. School-specific social factors considered in this study (Appendix 4), as of 1999, are listed in Table 2-4.
The factors in Table 2-4 were used as independent variables in linear models for a dependent variable of 1999 school-averaged tenth-grade MCAS mathematics test scores. (Note 7) Residuals from the models are considered as estimators of school performance. Variances and error estimates are calculated by conventional multivariate methods (Bevington and Robinson, 1992). For 1999, MCAS tests were, in the parlance of US public schools, a "medium-stakes" enterprise, associated with some indirect social pressures but no hazardous consequences for students. No current high school students stood to be denied graduation because of test scores, although summaries of test scores by districts and schools were being published. This was the second year of regular testing. It had been preceded by a year of trial testing, following three years of curriculum specification and test development. Note that grade size reduction, calculated as a percent decrease in grades 11 and 12 school population as compared with grades 9 and 10, is not identical with dropout rate. While dropout statistics are available, as in other states they are compromised by lack of consistent longitudinal data for educational outcomes. Grade size reduction simply indicates that, for whatever reasons, later year high school grades are smaller than earlier year grades. When substantial, it suggests that many students do not graduate in a normal time sequence. (Note 8) Values in about +/- 5 to +/- 8 percent ranges will be typical of fluctuations for schools of these sizes with stable area boundaries and populations and very low transiency, retention and dropout rates, according to Poisson statistics. The Department of Education publishes only district information about budgets and special education. Without an accurate means to apportion such measurements to high schools (and in four communities to specific high schools), they have been excluded from consideration. There is considerable variation. District percentages of special education students ranged from 11.1 to 25.5 percent of all students and grades in 1999. Annual district spending reported for all regular education programs ranged from $3,986 to $9,251 per student in 1999. Even within this small group of communities, the Education Reform Act of 1993 failed to equalize school spending. Factor distributions and correlations for this model are shown in Figure 2-3, a matrix of histograms and scatterplots with unweighted, unconstrained lines of best fit. Figure 2-3: Factor Distributions for MCAS Grade 10 Math Test Scores
![]() Although there are strong associations in Figure 2-3, such as "Percent Hispanic / Latino" with "Percent limited English proficiency," there are no factors so highly correlated as to be entirely redundant. The numerical correlation matrix is presented in Table 2-5, corresponding to the matrix of plots in Figure 2-3, with all values beyond about +/- 0.5 significant at the p<.05 level. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Some of the correlations in Table 2-5 are strong enough that multiple regression coefficients are likely to be unstable. Therefore a model was developed in stages, examining factors for significance. The full model from the factors in Table 2-4 was first evaluated with weights proportional to numbers of test participants. It yielded two strong factors with low correlation (C and E): "Percent Asian or Pacific Islander" at p<.02, with a positive coefficient, and "Percent limited English proficiency" at p<.002, with a negative coefficient: Factors for school population and percent grade reduction had particularly small coefficients and low significance. They were removed, and a model with the remaining five factors then associated 67 percent of the variance and produced the factor weights shown in Table 2-6.
With model factors in Table 2-6, high factor weight and significance found in other studies for percentages of African American or Latino students disappear. Both factors have small coefficients and low significance. Statistical weight that might have been attached to these factors instead follows cultural and economic factors: "Percent limited English proficiency" and "Percent free or reduced price lunch." As an experiment, the model was rerun with the latter factors removed; only 57 percent of the variance was associated, and factor weights became those shown in Table 2-7.
In Table 2-7, two "racial" or "ethnic" factors (marked *) have become significant at a p<.05 level. The coefficient for "Percent African American" has turned from positive to negative, and the coefficient for "Percent Hispanic / Latino" has become strongly negative. It seems likely that these two factors are acting as proxies for cultural and economic factors with more predictive power. Residuals from the five-factor model of Table 2-6 are shown in Table 2-8. This include standard error estimates based on results from the trends study of Section 2A.
At first glance, some residuals in Table 2-8 look substantial, several scale points of difference from the average scores predicted by the model. However, residual ratios for most schools are within +/- 2 standard errors, not significant at a p<.05 level. Someone familiar with metropolitan Boston will recognize that schools with high and low residual ratios tend to be in high-income and low-income communities, respectively. It therefore seems likely that adding a factor for incomes can increase the predictive power of the model. The most recent community income data were from the US Census of 1990, for 1989 per-capita income. Comparable 1999 income statistics were not yet available. The Massachusetts Department of Revenue could produce current community income statistics but has not done so; the state continues to use 1989 federal census data on incomes to apportion aid to public schools. After adding 1989 per-capita community income in $1,000s as a factor (Mass. DoR, 1999), without any attempt to adjust incomes so as to reflect school districts or student households, the model associates 80 percent of the statistical variance, and factor weights became those shown in Table 2-9.
Three factors in Table 2-9 (marked *) have substantial significance, at a p<.005 level or better, and three have very low significance. Factor weight has shifted from "Percent free or reduced price lunch" to "Per-capita community income (1989)," while "Percent limited English proficiency" retains a large coefficient and high significance. Dropping low-significance factors, the resulting three-factor model is shown in Table 2-10.
The three-factor model of Table 2-10 also associates 80 percent of the statistical variance. All of its factors are statistically significant at a p<.001 level. For each school included in these studies, Table 2-11 presents adjusted average 1999 tenth-grade MCAS mathematics test scores and residuals from the three-factor statistical model of Table 2-10, with the uncertainties in average scores and residuals expressed as standard errors, based on the variance estimate calculated in the trends study of Section 2A.
|