This article has been retrieved
times since January 28, 2002
Education Policy Analysis Archives | ||
Volume 10 Number 9 |
January 28, 2002 |
ISSN 1068-2341 |
|
Editor: Gene V Glass College of Education Arizona State University
Copyright 2002, the
EDUCATION POLICY ANALYSIS ARCHIVES . Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Confusing
the Messenger with the Message:
|
|
Abstract
|
|
In an analysis of the Massachusetts graduation examination, Bolon (2001) examined the aggregate grade 10 mathematics test scores for 47 high schools and the demographic characteristics of the communities in which they were situated. From several data analyses, Bolon determined that since the best single predictor of mean high school score was community per capita income, "The state is treating scores and ratings as though they were precise educational measures of high significance. A review of thenth-grade mathematics test scores from academic high schools in metropolitan Boston showed that statistically they are not." Further, when removing the variability due to per capita income, "Large uncertainties in residuals of school-averaged scores, after subtracting predictions based on community income, tend to make the scores ineffective for rating performance of schools. Large uncertainties in year-to-year score changes tend to make the score changes ineffective for measureing performance trends."While we agree with Bolon's concerns, on the whole, we find little support in the evidence he presents to support them. Our discussion below details our concerns. Predicting aggregate test scoresOne of the problems with regression analysis is that without reasonable theoretical support, all sorts of predictors can be found that produce high correlation. In examining aggregate scores, such as high school test means, it is no secret that for many decades, as Bolon himself pointed out (Bolon, 2000), achievement has been associated with socioeconomic conditions in communities. In earlier eras, when school spending was much more unequal, these differences were more indicative of opportunity to learn for students. In a judicial climate that has tended to minimize, although not eliminate such disparities, it is much less persuasive, although it remains an important area for study.The difficulty with using a community aggregate measure as a predictor is that it is a surrogate for many other indicators, some of which are absurd at face value but interpretable. Variables such as driver's-license passing rate or per capita champagne consumption may predict student achievement as well as community per capita income. We can construct meaningful arguments why they might. For none is the test invalidated using accepted standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999). In other areas of research such aggregation has produced fundamentally misleading conclusions. For example, the literature on intelligence and income is directly parallel to the discussion here. White (1982) demonstrated the difference between using an aggregate measure of SES (school or community) and individual measure in relating SES to intellectual functioning. Since Bolon used school as his unit of analysis, he eliminated proximate measures more appropriate to his analysis. The school-level variables Bolon eliminated are more appropriate than community per capita income on this basis if in fact they were school-based and not district-based. Measures such as free and reduced lunch (FRL) are better indicators for elementary school than for secondary school analyses, however, because of social undesirability of either participating or reporting among secondary students, who tend to have independent means for buying lunches. The principle of proximity in selecting variables should be carefully considered and invoked. Mixing levels of analysis produces uninterpretable results, as hierarchical linear modeling advocates have pointed out. Bolon erred in this way, we argue. Test validity: AERA/APA/NCME StandardsThe Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999) list 24 points related to validity. We will review those we believe to be relevant to Bolon's argument and attempt to show that his representation is irrelevant to any of them. Standard 1.1 requires a rationale for each recommended interpretation and use of test scores with a summary of evidence and theory. Standard 1.6 requires content validity procedures to be described and justified. While we do not pretend to know in detail the Massachusetts tests, we have a great deal of familiarity with those in our own state, and with the arguments focused on such high stakes tests. The foremost rationale presented in all such state testing programs is content of the state curricula or guidelines.Challenges to content validity have been consistently thrown out by courts, including a recent case here in Texas (Mehrens, 2000, citing GI Forum et al vs. TEA et al., CA No. SA-97-1278-EP, U.S. District Court, Western District of Texas, San Antonio, TX). Mehrens invoked the 1985 Standards to review the Texas statewide assessment in a process we follow more briefly here. The congruence of test content with intended instruction is a central focus of test development. Nothing about test content appropriateness was evident in the analysis of income prediction of performance by Bolon. A more comprehensive and focused analysis might ask if schools in lower income communities do not adhere to the state guidelines, or if their teachers are unprepared to teach the mathematics required, or suitable textbooks are not available, so that students do not have an opportunity to learn. These representations might make a case for the relevance of income in dismissing the mathematics test as a precise educational measure of high significance. The per capita income disparities in our state are much greater than those shown by Bolon. Our experience with our own Texas Assessment of Academic Skills (TAAS) at all levels and content areas has convinced us that income inequities, while important, are not the most useful explanatory variable in school performance. With much larger databases available to us, such as multiyear summaries of all schools in Texas by grade, we see much greater variation in school performance than is shown in the 47 schools Bolon selected. While the correlation is much weaker than the .9+ Bolon presented, nevertheless it is substantial and meaningful. When looking at scatterplots of performance, however, we are struck by the existence of very high-poverty community schools that manage to score very high on the TAAS. For example, Fig. 1 shows the scatter for about 3000 schools with 3rd grade classrooms of TAAS reading and percent economically disadvantaged students (school level measurement), what we would call a surrogate for per capita income. What is of interest is the top of the graph, and the many schools that perform in a manner the state defines as excellent. The correlation data are reported in Table 1 (the approximately 800 schools not reporting economic disadvantage had somewhat higher TAAS scores than those are that did report).
Per capita income is an uninterpretable predictor; its relatedness, or not, to school achievement tells us nothing about the stakes being tested, high or not. It fails the theory criterion of standard 1.1. Instructional effectsWhile income is related to achievement, whether in Boston or Texas, the central issue is what students enter a school year knowing, what the school teaches them, and what part of the cotent taught is assessed by the end of year test. Standard 1.15 is most relevant: "When it is asserted that a certain level of test performance predicts adequate or inadequate criterion performance, information about the levels of criterion performance associated with given levels of test scores should be provided." Per capita income does not provide any insight into this, nor does, unfortunately, year-to-year change score.We are unaware of any state that has actually conducted an instructional effect study with pilot versions of its tests to examine the sensitivity of their high stakes tests to instruction. The first author was a member of a committee formed by the legislature of the state of Texas to recommend the structure of the current accountability system (College of Education, Texas A&M University, LBJ School of Public Affairs, The University of Texas at Austing, College of Business, University of Houston, 1993). In the course of committee discussion, the suggestion was raised by the first author that only with some form of pre-post within year assessment at the student level would there be even minimal evidence for instructional change. This suggestion was ultimately rejected by politicians as too costly to consider. Instead, year-to-year student (and school) change was later made into amethodologically suspect statistic, the Texas Learning Index. Bolon has made the same error in considering longitudinal change within test. The alternative explanations for yearly change negate any interpretation about large uncertainties. Student composition, student mobility, curricular emphases, teacher stability, administrative upheavals, and historical internal validity threats all may explain the variation in a school. Unless and until those are explored and discounted, Bolon's analysis does not support any particular validity threat to the test. We agree that schools, before being held accountable, must be examined carefully for the alternatives listed above. Year-to-year comparisons are inherently flawed due to internal validity threats; the connection between instruction and student performance is weak. It is only because of unwillingness to investigate the actual productivity of the school that a year-to-year comparison is made. Content limitationsAnother major limitation in any interpretation of either static (between school) or dynamic (within school) variation in performance lies in the test items and the sampling of the curriculum. Most high stakes tests are too brief to represent the curriculum adequately. Bolon does not discuss the characteristics of the 10th grade mathematics test. Our experience with the exit level math exam in Texas is that it is unrelated to the content studied in the last 2-3 years (typically grade 8 arithmetic and pre-algebra), and while possessing reasonable internal consistency (.90+), it is too brief to span the domain with only about 40 items. As Mehrens (2000) pointed out for the Texas graduation test, states such as Massachusetts conduct the technical aspects adequately. The standards (either for 1985 or for 1999) will be met. Nevertheless, although important concepts are sampled, the tests are brief, certainly briefer than one would wish to generate a score representing 10 or 11 years' schooling.The Texas released 10th grade mathematics examination (Texas Education Agency, 2000) has 40 items. From a review of the content, it appears that at best only one or two assess topics not covered in grades 8 or below, while one item (19) appears to be a spatial rotation task more appropriate to an intelligence test. The inadequacy of such a test to evaluate 9 or 10 years' mathematics learning is, if not self-evident, at least empirically testable. One can conceive of various research investigations involving interview with teachers and students and performance demonstrations by students on the full range of TEKS objectives to evaluate how well a short form such as the TAAS estimates actual mathematics declarative and procedural knowledge. In the 1993 discussions in Texas cited above, the introductory letter by Charles Miller (1993) made clear that the committee proposed to eliminate the 10th grade test in favor of specific grade 10 subject matter tests such as Algebra and Biology. While there was an obvious concern for creating a set of hurdles, the committee's recommendation was based on testing students over content more proximate to their instruction. Year-to-year stabilityTable 1 presents correlations within and across year for grades 3-5 for 1999-2000Table 1
|
Copyright 2002 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu . EPAA Editorial Board
|