This article has been retrieved   times since January 28, 2002

   other vols.   |   abstracts   |   editors   |   board   |   submit   |   book reviews   |   subscribe   |   search


 

Education Policy Analysis Archives

Volume 10 Number 9

January 28, 2002

ISSN 1068-2341


A peer-reviewed scholarly journal
Editor: Gene V Glass
College of Education
Arizona State University

Copyright 2002, the EDUCATION POLICY ANALYSIS ARCHIVES .
Permission is hereby granted to copy any article
if EPAA is credited and copies are not sold.

Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education.

Confusing the Messenger with the Message:
A Response to Bolon

Victor L. Willson
Texas A&M University

Thomas Kellow
University of Houston

Citation: Willson, V.L. & Kellow, T. (2002, January 28). Confusing the messenger with the message: A response to Bolon. Education Policy Analysis Archives, 10(9). Retrieved [date] from http://epaa.asu.edu/epaa/v10n9/.

Abstract
The conclusions by Bolon (2001) based on the relationship between per capita income and school mean grade 10 mathematics scores in Massachusetts and on instability in year-to-year mean school scores are criticized by us. Our concerns focus on the uninterpretable covariation of economic condition with test performance and the limitations in interpreting cross-time variability. We agree with Bolon's conclusions but consider the methodology employed inadequate to support them. We suggest alternative requirements and discuss our own previous efforts in this area.


 

In an analysis of the Massachusetts graduation examination, Bolon (2001) examined the aggregate grade 10 mathematics test scores for 47 high schools and the demographic characteristics of the communities in which they were situated. From several data analyses, Bolon determined that since the best single predictor of mean high school score was community per capita income,

"The state is treating scores and ratings as though they were precise educational measures of high significance. A review of thenth-grade mathematics test scores from academic high schools in metropolitan Boston showed that statistically they are not."

Further, when removing the variability due to per capita income,

"Large uncertainties in residuals of school-averaged scores, after subtracting predictions based on community income, tend to make the scores ineffective for rating performance of schools. Large uncertainties in year-to-year score changes tend to make the score changes ineffective for measureing performance trends."
While we agree with Bolon's concerns, on the whole, we find little support in the evidence he presents to support them. Our discussion below details our concerns.

Predicting aggregate test scores

One of the problems with regression analysis is that without reasonable theoretical support, all sorts of predictors can be found that produce high correlation. In examining aggregate scores, such as high school test means, it is no secret that for many decades, as Bolon himself pointed out (Bolon, 2000), achievement has been associated with socioeconomic conditions in communities. In earlier eras, when school spending was much more unequal, these differences were more indicative of opportunity to learn for students. In a judicial climate that has tended to minimize, although not eliminate such disparities, it is much less persuasive, although it remains an important area for study.

The difficulty with using a community aggregate measure as a predictor is that it is a surrogate for many other indicators, some of which are absurd at face value but interpretable. Variables such as driver's-license passing rate or per capita champagne consumption may predict student achievement as well as community per capita income. We can construct meaningful arguments why they might. For none is the test invalidated using accepted standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999).

In other areas of research such aggregation has produced fundamentally misleading conclusions. For example, the literature on intelligence and income is directly parallel to the discussion here. White (1982) demonstrated the difference between using an aggregate measure of SES (school or community) and individual measure in relating SES to intellectual functioning. Since Bolon used school as his unit of analysis, he eliminated proximate measures more appropriate to his analysis. The school-level variables Bolon eliminated are more appropriate than community per capita income on this basis if in fact they were school-based and not district-based. Measures such as free and reduced lunch (FRL) are better indicators for elementary school than for secondary school analyses, however, because of social undesirability of either participating or reporting among secondary students, who tend to have independent means for buying lunches.

The principle of proximity in selecting variables should be carefully considered and invoked. Mixing levels of analysis produces uninterpretable results, as hierarchical linear modeling advocates have pointed out. Bolon erred in this way, we argue.

Test validity: AERA/APA/NCME Standards

The Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999) list 24 points related to validity. We will review those we believe to be relevant to Bolon's argument and attempt to show that his representation is irrelevant to any of them. Standard 1.1 requires a rationale for each recommended interpretation and use of test scores with a summary of evidence and theory. Standard 1.6 requires content validity procedures to be described and justified. While we do not pretend to know in detail the Massachusetts tests, we have a great deal of familiarity with those in our own state, and with the arguments focused on such high stakes tests. The foremost rationale presented in all such state testing programs is content of the state curricula or guidelines.

Challenges to content validity have been consistently thrown out by courts, including a recent case here in Texas (Mehrens, 2000, citing GI Forum et al vs. TEA et al., CA No. SA-97-1278-EP, U.S. District Court, Western District of Texas, San Antonio, TX). Mehrens invoked the 1985 Standards to review the Texas statewide assessment in a process we follow more briefly here.

The congruence of test content with intended instruction is a central focus of test development. Nothing about test content appropriateness was evident in the analysis of income prediction of performance by Bolon. A more comprehensive and focused analysis might ask if schools in lower income communities do not adhere to the state guidelines, or if their teachers are unprepared to teach the mathematics required, or suitable textbooks are not available, so that students do not have an opportunity to learn. These representations might make a case for the relevance of income in dismissing the mathematics test as a precise educational measure of high significance.

The per capita income disparities in our state are much greater than those shown by Bolon. Our experience with our own Texas Assessment of Academic Skills (TAAS) at all levels and content areas has convinced us that income inequities, while important, are not the most useful explanatory variable in school performance. With much larger databases available to us, such as multiyear summaries of all schools in Texas by grade, we see much greater variation in school performance than is shown in the 47 schools Bolon selected. While the correlation is much weaker than the .9+ Bolon presented, nevertheless it is substantial and meaningful. When looking at scatterplots of performance, however, we are struck by the existence of very high-poverty community schools that manage to score very high on the TAAS. For example, Fig. 1 shows the scatter for about 3000 schools with 3rd grade classrooms of TAAS reading and percent economically disadvantaged students (school level measurement), what we would call a surrogate for per capita income. What is of interest is the top of the graph, and the many schools that perform in a manner the state defines as excellent. The correlation data are reported in Table 1 (the approximately 800 schools not reporting economic disadvantage had somewhat higher TAAS scores than those are that did report).

Per capita income is an uninterpretable predictor; its relatedness, or not, to school achievement tells us nothing about the stakes being tested, high or not. It fails the theory criterion of standard 1.1.

Instructional effects

While income is related to achievement, whether in Boston or Texas, the central issue is what students enter a school year knowing, what the school teaches them, and what part of the cotent taught is assessed by the end of year test. Standard 1.15 is most relevant: "When it is asserted that a certain level of test performance predicts adequate or inadequate criterion performance, information about the levels of criterion performance associated with given levels of test scores should be provided." Per capita income does not provide any insight into this, nor does, unfortunately, year-to-year change score.

We are unaware of any state that has actually conducted an instructional effect study with pilot versions of its tests to examine the sensitivity of their high stakes tests to instruction. The first author was a member of a committee formed by the legislature of the state of Texas to recommend the structure of the current accountability system (College of Education, Texas A&M University, LBJ School of Public Affairs, The University of Texas at Austing, College of Business, University of Houston, 1993). In the course of committee discussion, the suggestion was raised by the first author that only with some form of pre-post within year assessment at the student level would there be even minimal evidence for instructional change. This suggestion was ultimately rejected by politicians as too costly to consider. Instead, year-to-year student (and school) change was later made into amethodologically suspect statistic, the Texas Learning Index. Bolon has made the same error in considering longitudinal change within test. The alternative explanations for yearly change negate any interpretation about large uncertainties. Student composition, student mobility, curricular emphases, teacher stability, administrative upheavals, and historical internal validity threats all may explain the variation in a school. Unless and until those are explored and discounted, Bolon's analysis does not support any particular validity threat to the test. We agree that schools, before being held accountable, must be examined carefully for the alternatives listed above. Year-to-year comparisons are inherently flawed due to internal validity threats; the connection between instruction and student performance is weak. It is only because of unwillingness to investigate the actual productivity of the school that a year-to-year comparison is made.

Content limitations

Another major limitation in any interpretation of either static (between school) or dynamic (within school) variation in performance lies in the test items and the sampling of the curriculum. Most high stakes tests are too brief to represent the curriculum adequately. Bolon does not discuss the characteristics of the 10th grade mathematics test. Our experience with the exit level math exam in Texas is that it is unrelated to the content studied in the last 2-3 years (typically grade 8 arithmetic and pre-algebra), and while possessing reasonable internal consistency (.90+), it is too brief to span the domain with only about 40 items. As Mehrens (2000) pointed out for the Texas graduation test, states such as Massachusetts conduct the technical aspects adequately. The standards (either for 1985 or for 1999) will be met. Nevertheless, although important concepts are sampled, the tests are brief, certainly briefer than one would wish to generate a score representing 10 or 11 years' schooling.

The Texas released 10th grade mathematics examination (Texas Education Agency, 2000) has 40 items. From a review of the content, it appears that at best only one or two assess topics not covered in grades 8 or below, while one item (19) appears to be a spatial rotation task more appropriate to an intelligence test. The inadequacy of such a test to evaluate 9 or 10 years' mathematics learning is, if not self-evident, at least empirically testable. One can conceive of various research investigations involving interview with teachers and students and performance demonstrations by students on the full range of TEKS objectives to evaluate how well a short form such as the TAAS estimates actual mathematics declarative and procedural knowledge. In the 1993 discussions in Texas cited above, the introductory letter by Charles Miller (1993) made clear that the committee proposed to eliminate the 10th grade test in favor of specific grade 10 subject matter tests such as Algebra and Biology. While there was an obvious concern for creating a set of hurdles, the committee's recommendation was based on testing students over content more proximate to their instruction.

Year-to-year stability

Table 1 presents correlations within and across year for grades 3-5 for 1999-2000

Table 1
Correlations within and across year for grades 3-5 for 1999-2000

Note that within year, cross-grade correlations are higher than between-year, within grade correlations or between-year, cross-grade correlations, generally. The cohort effect, however (grade 3 in 1999 to 4 in 2000, grade 4 in 1999 to grade 5 in 2000) appears supported since these correlations are higher than the cross-year, within-grade correlations for grades 3-5. That at least appears consistent with what might be expected: between-student correlations are lower than within-student correlations, however attenuated they might be for school averages. If these correlations are to be used as part of a school-level assessment, however, they appear woefully inadequate psychometrically.

A different approach can be taken by treating the within-year cross-grade TAAS scores as scale variables. Then coefficient alpha for 1999 is.8539 and for 2000 is .8637. If one is evaluating schools, these are reasonably good values.

Table 2 presents change correlations. While cautions abound about interpreting change scores, if change is the currency to be used, descriptive and correlational characteristics must be considered. Obviously, the change measures that include the same score, such as D33 with D43, are inflated by the self-covariation. The other correlations, that are not self-inflated, are generally positive but modest, almost all in the .1 to .2 range. The D43 and D54 correlation of .165, for example, supports a conclusion that schools' cohorts improve together (or fall behind together).

Table 2
Correlations for yearly change within and between for grade 3-5 and changes for grades 3 to 4 and 4 to 5 for Texas schools 1999-2000


Note: D33, D44, and D55 are the stability coefficients in Table 1.

Table 3 presents correlations between change measures and available school characteristics such as size of school (ENROLL 2000), percent of school on free-and-reduced lunch, percent of school with Limited English Proficient (LEP) students, and percentage of the ethnic groups that in Texas are the focus of civil rights enforcement (African-American and Hispanic). LEP students were exempted from the TAAS in these years, so high performing schools with high LEP percentages can be based on very small samples of white students. The correlations are of interest insofar as they provide different ways to examine school performance at the individual school level rather than using the block method employed by Bolon.

Conclusions

Our concern with the Bolon (2001) study was that it focused on a relationship between school performance on a high stakes test and community wealth that is not informative about the characteristics of the test. The emphasis on wealth and its relationship to schooling has been highlighted in educational thought since Coleman's (Coleman, Campbell, Hobson, McPortland, Mood, Weinfeld, & York, 1966) conclusions about the efficacy of schooling. These aggregated analyses have not, we contend, illuminated much about why schools succeed or fail. Studies about schools focusing on leadership and its relationship to school performance, for example, provide meaningful, interpretable, and actionable conclusions for school level policy. Unfortunately, barring exchanges of cash between communities, Bolon's work does not.

Table 3
Correlations between TAAS 1999-2000 grade changes and selected school characteristics


Note: D33=grade 3 change from 1999 to 2000; D43 = change from grade 3 in 1999 to grade 4 in 2000, etc. Lunch (% ECON DISADV), and percentage of targeted minority groups in Texas, African- Americans (% AF AM) and Hispanics (% HISP), as well as majority whites (% WHITE).

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Bolon, C. (2000). School-based standard testing. Education Policy Analysis Archives 8(23), May, 2000, available at http://epaa.asu.edu/epaa/v8n23.

Bolon, c. (2001). Significance of test-based ratings for metropolitan Boston schools. Education Policy Analysis Archives 9(42). October 16, 2001, available at http://epaa.asu.edu/epaa/v9n42.

Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPortland, J., Mood, A. M., Weinfeld, F. D., & York, R. L. (1966). Equality of educational opportunity. Washington, DC: U.S. Government Pringing Office, 1966.

College of Education, Texas A&M University, LBJ School of Public Affairs, The University of Texas at Austing, College of Business, University of Houston, (1993). A New Accountability System for Texas Public Schools, Vol. 1. Austin: Educational Economic Policy Center, The University of Texas at Austin.

Mehrens, W. A.(2000). Defending a state graduation test: GI Forum et al. vs. Texas Education Agency. Measurement perspectives from an external evaluator. Applied Measurement in Education, 13(4), pp. 387-401.

Miller, C. (1993). Introductory letter. In College of Education, Texas A&M University, LBJ School of Public Affairs, The University of Texas at Austing, College of Business, University of Houston, A New Accountability System for Texas Public Schools, Vol. 1. Austin: Educational Economic Policy Center, The University of Texas at Austin.

Texas Assessment of Academic Skills Exit Level . Austin: Texas Education Agency, February 2000. Available at http://www.tea.state.tx.us/student.assessment/resources/release/taas/release01/xl.pdf

White, Karl R. (1982). The relation between socioeconomic status and academic achievement. Psychological Bulletin, 91(3), 461-81.

About the Authors

Victor L. Willson
Professor
Department of Educational Psychology
Texas A&M University
College Station, Texas

Email: v-willson@tamu.edu

Victor L. Willson is Professor of Educational Psychology and Teaching, Learning and Culture, and Director of the Cognition and Instructional Technologies Laboratory, Texas A&M University. His current research interests focus on the intersections of psychometrics, children's reading development, and individual differences in cognitive measurement.

J. Thomas Kellow
Visiting Assistant Professor
College of Education
University of Houston

Email: tkellow@pdq.net

J. Thomas Kellow is currently Visiting Assistant Professor in the College of Education at University of Houston. He received his Ph.D from Texas A&M in Educational Psychology, and has research interests in high-stakes testing, applied statistics, and disability studies with an emphasis on the community integration of persons with mental retardation.


Copyright 2002 by the Education Policy Analysis Archives

The World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu

General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-2411. The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu .

EPAA Editorial Board

Michael W. Apple
University of Wisconsin
Greg Camilli
Rutgers University
John Covaleskie
Northern Michigan University
Alan Davis
University of Colorado, Denver
Sherman Dorn
University of South Florida
Mark E. Fetler
California Commission on Teacher Credentialing
Richard Garlikov
hmwkhelp@scott.net
Thomas F. Green
Syracuse University
Alison I. Griffith
York University
Arlen Gullickson
Western Michigan University
Ernest R. House
University of Colorado
Aimee Howley
Ohio University
Craig B. Howley
Appalachia Educational Laboratory
William Hunter
University of Calgary
Daniel Kallós
Umeå University
Benjamin Levin
University of Manitoba
Thomas Mauhs-Pugh
Green Mountain College
Dewayne Matthews
Education Commission of the States
William McInerney
Purdue University
Mary McKeown-Moak
MGT of America (Austin, TX)
Les McLean
University of Toronto
Susan Bobbitt Nolen
University of Washington
Anne L. Pemberton
apembert@pen.k12.va.us
Hugh G. Petrie
SUNY Buffalo
Richard C. Richardson
New York University
Anthony G. Rud Jr.
Purdue University
Dennis Sayers
California State University—Stanislaus
Jay D. Scribner
University of Texas at Austin
Michael Scriven
scriven@aol.com
Robert E. Stake
University of Illinois—UC
Robert Stonehill
U.S. Department of Education
David D. Williams
Brigham Young University

EPAA Spanish Language Editorial Board

Associate Editor for Spanish Language
Roberto Rodríguez Gómez
Universidad Nacional Autónoma de México

roberto@servidor.unam.mx

Adrián Acosta (México)
Universidad de Guadalajara
adrianacosta@compuserve.com
J. Félix Angulo Rasco (Spain)
Universidad de Cádiz
felix.angulo@uca.es
Teresa Bracho (México)
Centro de Investigación y Docencia Económica-CIDE
bracho dis1.cide.mx
Alejandro Canales (México)
Universidad Nacional Autónoma de México
canalesa@servidor.unam.mx
Ursula Casanova (U.S.A.)
Arizona State University
casanova@asu.edu
José Contreras Domingo
Universitat de Barcelona
Jose.Contreras@doe.d5.ub.es
Erwin Epstein (U.S.A.)
Loyola University of Chicago
Eepstein@luc.edu
Josué González (U.S.A.)
Arizona State University
josue@asu.edu
Rollin Kent (México)
Departamento de Investigación Educativa-DIE/CINVESTAV
rkent@gemtel.com.mx       kentr@data.net.mx
María Beatriz Luce (Brazil)
Universidad Federal de Rio Grande do Sul-UFRGS
lucemb@orion.ufrgs.br
Javier Mendoza Rojas (México)
Universidad Nacional Autónoma de México
javiermr@servidor.unam.mx
Marcela Mollis (Argentina)
Universidad de Buenos Aires
mmollis@filo.uba.ar
Humberto Muñoz García (México)
Universidad Nacional Autónoma de México
humberto@servidor.unam.mx
Angel Ignacio Pérez Gómez (Spain)
Universidad de Málaga
aiperez@uma.es
Daniel Schugurensky (Argentina-Canadá)
OISE/UT, Canada
dschugurensky@oise.utoronto.ca
Simon Schwartzman (Brazil)
Fundação Instituto Brasileiro e Geografia e Estatística
simon@openlink.com.br
Jurjo Torres Santomé (Spain)
Universidad de A Coruña
jurjo@udc.es
Carlos Alberto Torres (U.S.A.)
University of California, Los Angeles
torres@gseisucla.edu


   other vols.   |   abstracts   |   editors   |   board   |   submit   |   book reviews   |   subscribe   |   search