The Multidimensionality of School Performance: Using Multiple Measures for School Accountability and Improvement

The Every Student Succeeds Act of 2015 grants states and districts the flexibility to use multiple measures to assess school performance and strategically manage public schools for improvement in the United States. However, there is a lack of systematic, 1 Min Sun’s efforts on this study is supported by a grant from the National Science Foundation under Grant No. DRL-1506494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Education Policy Analysis Archives Vol. 28 No. 89 2 evidence-based guidance for practitioners on how to interpret the complex relationships between these multiple measures. Drawing on the organizational management literature on the multidimensionality of organizational effectiveness, along with longitudinal data from Washington State, we illustrate the multidimensionality of school performance and different measurement properties of school performance data. We also find that schools that are higher performing in terms of students’ average scale scores and average growth percentiles in some cases have larger disparities in these same measures between historically underserved students of color and their peers than lower performing schools do. Moreover, these performance measures have time-series properties. The complexity of school performance measurement systems calls for continuous support for local educators to appropriately use school performance data to promote student success.

evidence-based guidance for practitioners on how to interpret the complex relationships between these multiple measures. Drawing on the organizational management literature on the multidimensionality of organizational effectiveness, along with longitudinal data from Washington State, we illustrate the multidimensionality of school performance and different measurement properties of school performance data. We also find that schools that are higher performing in terms of students' average scale scores and average growth percentiles in some cases have larger disparities in these same measures between historically underserved students of color and their peers than lower performing schools do. Moreover, these performance measures have time-series properties. The complexity of school performance measurement systems calls for continuous support for local educators to appropriately use school performance data to promote student success. Keywords: school performance; school accountability; school improvement; multidimensionality; multiple measures Las múltiples dimensiones del rendimiento escolar: Uso de múltiples medidas para la rendición de cuentas y la mejora de las escuelas Resumen: The Every Student Succeeds Act of 2015 permite estados y distritos de los Estados Unidos la flexibilidad de utilizar múltiples medidas para evaluar el rendimiento escolar y administrar estratégicamente las escuelas públicas para mejorar. Sin embargo, existe una falta de orientación sistemática y basada en la evidencia para los profesionales sobre cómo interpretar las complejas relaciones entre estas medidas múltiples. Basándonos en la literatura sobre efectividad organizacional, junto con datos longitudinales del estado de Washington, ilustramos las múltiples dimensiones del rendimiento escolar y las diferentes propiedades de medición de los datos sobre el rendimiento escolar. Encontramos que las escuelas con un rendimiento más alto en términos de puntajes de escala promedio de los estudiantes y percentiles de crecimiento promedio en algunos casos tienen mayores disparidades en estas mismas medidas entre los estudiantes de color y sus compañeros que las escuelas con un rendimiento más bajo. Estas medidas de rendimiento también tienen propiedades de series temporales. La complejidad de los sistemas que miden el rendimiento escolar requiere un apoyo continuo para que los educadores locales utilicen adecuadamente los datos para promover el éxito de los estudiantes. Palabras-clave: rendimiento escolar; rendición de cuentas escolar; mejora escolar; multidimensionalidad; medidas múltiples As múltiplas dimensões do desempenho escolar: Uso de várias medidas para a prestação de contas e melhoria das escolas Resumo: The Every Student Succeeds Act of 2015 permite aos estados e distritos dos Estados Unidos a flexibilidade de usar várias medidas para avaliar o desempenho da escola e gerenciar estrategicamente as escolas públicas para melhoria. No entanto, há uma falta de orientação sistemática e baseada em evidências para os profissionais sobre como interpretar as complexas relações entre essas múltiplas medidas. Com base na literatura sobre eficácia organizacional, juntamente com dados longitudinais do estado de Washington, ilustramos as múltiplas dimensões do desempenho escolar e as diferentes propriedades de medição dos dados sobre o desempenho escolar. Concluímos que as escolas com melhor desempenho em termos de pontuação na escala média dos alunos e percentis de crescimento médio em alguns casos apresentam disparidades maiores nessas mesmas medidas entre estudantes de cor e seus colegas do que as escolas com

Introduction
The school performance measurement system established by the Every Student Succeeds Act (ESSA) of 2015 broadened the definition of school performance in the United States. Under the legislation that preceded it, No Child Left Behind (NCLB), school performance for accountability purposes was primarily assessed in terms of output-oriented measures, including student achievement, academic growth, and graduation rates. Under ESSA, in addition to these outputoriented measures, state school accountability systems can include input-and process-related measures associated with school efficiency or effectiveness, such as the availability of advanced coursework, teacher quality and its equitable distribution, teacher collaboration and school leadership, and student engagement (attendance) and behavior (discipline). Including such measures addresses a problem often discussed in the organizational management literature-that the performance measures used to evaluate the success of nonprofit organizations are often not linked to strategies for improvement (Sawhill & Williamson, 2001). With ESSA's introduction of a broadened measurement system for school performance, the intention was to provide districts and schools with more information that would help them identify strategies for improvement (Bae, 2018).
However, research has provided states and school districts with little guidance on how to incorporate multiple measures into one accountability system and how to use them for school improvement. One basic challenge is understanding the relationships between performance measures and what they mean for schools. Individual measures capture different aspects of school performance, and combining them into an overall summative score may mask which schools are lowest performing on certain measures and how schools should target those areas with their improvement efforts (Hough, Penner, & Witte, 2016). Furthermore, process-related measuressuch as student attendance, student disciplinary infractions, and teacher collaboration and school leadership-on one hand may provide useful information for districts and schools as they make improvement plans but on the other hand may be less standardized, reliable, or valid for conducting cross-school or cross-district comparisons (Duckworth & Yeager, 2015;West, 2016). ESSA grants states and districts the flexibility to use different measures to develop local theories of change for school improvement, but to fully leverage this flexibility, policymakers and practitioners need to understand the complex relationships between these measures and the tradeoffs of using each.
Another challenge is understanding the relationship between schools' overall performance and their differential contributions to subgroups of students. Consistent with NCLB, ESSA legislation requires states to report measures disaggregated by student subgroup, including by race/ethnicity, income, English proficiency level, and disability status. Prior studies have highlighted disparities in learning opportunities and outcomes for racially underserved students of color or lowincome students and provided basic methods for gauging educational inequality (e.g., Reardon, 2019;Reardon & Ho, 2015;Reardon & Robinson, 2008;Skiba et al., 2014). However, less is known about the relationship between a school's average performance and its students' differential performance-that is, within-school disparities in student achievement and opportunities to learnin an accountability context.
To shed light on the complexity of using multiple measures for school accountability and improvement, in the current study, we use data from Washington State and its largest school district, Seattle Public Schools, to explore the following research questions: 1. What are the relationships among several student behavioral and academic measures of school performance, including schools' achievement levels, academic growth, graduation rates, chronic absenteeism rates, and disciplinary infraction rates? Some combination of these measures has been used in virtually every approved ESSA state accountability plan, and by examining the relationships between them, we are able to illustrate why and how states and school districts need to understand these measures and use them appropriately for accountability and continuous improvement. 2. How are disparities in student behavioral and academic measures for racial/ethnic subgroups related to schools' average performance? Do historically underserved students of color benefit as much as their peers do from schools that are highperforming on average? 3. How do schools' internal process measures (such as student and staff perceptions of school climate) explain variations in school performance and equity? Perceptions of school climate have been widely used by states and districts to measure school quality (e.g., in Alaska, Vermont, Chicago Public Schools, and New York City) or to inform school improvement (e.g., in Montana and the CORE school districts in California).
By examining these questions, our study makes several contributions to the literature on school accountability. First, although prior studies have illustrated several potential issues related to implementing multiple measures of school performance (Hough et al., 2016), researchers have not encapsulated these issues in a coherent framework. This study is the first, to our knowledge, to draw on the framework of multidimensional organizational effectiveness from the management literature to inform the understanding and measurement of school performance. Second, although prior literature has foreshadowed several dimensions of the complexity of the school measurement system under ESSA, it has not empirically illustrated them in a school accountability and improvement system. For example, discussions of using multiple measures of teacher effectiveness in teacher evaluations have some bearing on measuring schools using multiple measures (Kane, McCaffrey, Miller, & Staiger, 2013). Another example can be found in methodological discussions of school value-added and student growth percentiles (e.g., Ehlert, Koedel, Parsons, & Podgursky, 2014;Raudenbush, 2009;Reardon & Raudenbush, 2009). Another example includes studies on disparities in student achievement and disciplinary infractions between racial/ethnic groups that provide basic knowledge and methods for gauging educational inequality across subgroups of students (e.g., Reardon & Ho, 2015;Skiba et al., 2014). However, none of these studies was conducted in the context of informing high-stakes school accountability measures such as those used under ESSA. The current study is unique in its use of real data from one state and one district to illustrate issues similar to those reflected in prior studies and synthesize recommendations into a coherent framework that practitioners can use to inform continuous improvement in schools.
Third, this paper is cowritten by researchers and a practitioner and adds a unique, reflective lens to the issues around implementing ESSA's school performance measurement system at the state and district levels. As described in a special issue of Education Policy Analysis Archives, ESSA makes state and local leaders fundamentally rethink how they approach school performance and accountability (Stosich, Bae, & Snyder, 2018). Our discussions of the premises and pitfalls of multiple measures accountability systems provide actionable knowledge that can inform states' and districts' use of evidence for school improvement under ESSA.

The Multidimensionality of School Effectiveness and Its Measurement
To understand the ESSA school accountability measurement system from a conceptual standpoint, we draw on the organizational management literature on the multidimensionality of organizational effectiveness (Bentes, Carneiro, da Silva, & Kimura, 2012;Cameron, 1986;Chakravarthy, 1986;Herman & Renz, 2008;Richard, Devinney, Yip, & Johnson, 2009;Venkatraman & Ramanujam, 1986). This literature emphasizes that organizations-both profit and nonprofit-are inherently complex and should not be measured by a unidimensional metric. Strategic performance management requires organizations to use diverse measurement systems to track and manage local initiatives, connect these initiatives with organizational missions and goals, and efficiently meet the needs of their stakeholders.
The multidimensionality of organizational effectiveness means that different theoretical and empirical components of organizational performance may or may not be related, and yet they collectively reflect the health of an organization (Callen, 1991;Devinney, Yip, & Johnson, 2010;Henri, 2004). To inform organizational improvement, then, the measurement of organizational performance needs multiple discipline-specific measures that address the relationships between organizational inputs, processes, and outputs. The use of multiple measures is critical because research shows that firms with more extensive measurement systems-especially those that include objective financial and subjective stakeholder perception measures-perform better in terms of stock market returns (Ittner, Larcker, & Randall, 2003;Van der Stede, Chow, & Lin, 2006). The literature continues to suggest that to preserve the underlying dimensionality of performance measures, it is more effective to operate at the disaggregated level and not to impose relationships between measures unless there is a sound theoretical model for how the measures are related to one another.
Before illustrating measurement issues in operation, it is critical to understand the sources of the multidimensionality of organizational effectiveness. Richard and his colleagues (2009) conceptualized three sources of multidimensionality, all of which need to be accounted for to validly measure performance: (a) Who are the stakeholders for whom a performance measure is relevant? (b) What is the landscape over which performance is being determined? Organizations are heterogeneous in their resources, capabilities, and how and where they choose to use them, and measurements of performance must account for this heterogeneity; and (c) What timeframe is relevant in measuring performance? Drawing on this framework and relevant studies in education, then, we discuss four aspects of the multidimensionality of school performance.
First, the stakeholders of public schools have diverse values and interests. Scholars have elaborated the diverse and often conflicting values and ideals among stakeholders (Green, 1983). For example, state policymakers represent interests at the aggregate level in society. They may value promoting social and economic efficiency and equity as important outcomes of schooling. In contrast, individual parents may care more about their own child's academic performance and behavioral development and how schools can give their child an advantage in future employment and life (Green, 1983). Even among parents, some may place a higher value on academic performance, while others may care more about their children's behavioral development. Moreover, community stakeholders may expect schools to serve local needs, boosting economic development and solving societal problems. Recognizing students' need to master skills beyond basic competencies in tested subjects (e.g., math and English language arts), such as creativity, problem-solving, and the ability to manage their behaviors and emotions, states and districts broaden their educational priorities and the associated metrics (Bae, 2018). For schools that aim to serve diverse interests and goals, it is only logical to assume that instruction and school practices serve multiple ends.
Second, school performance varies depending on school resources, strategic choices, and policy environments. The fiscal resources available to schools affect the services and reform strategies they adopt (Dragoset et al., 2017). The quantity and quality of intellectual resources that are accessible-such as instructional and leadership coaches who can serve as critical sources of knowledge and information-may also affect school practices. Schools' strategic choices subsequently affect school performance (Rumberger & Palardy, 2005;Sun, Saultz, & Ye, 2017). For example, schools that adopt positive behavioral intervention support for at-risk students may expect to see reductions in disciplinary infractions and absenteeism (e.g., Bohanon et al., 2006;Flannery, Fenning, Kato, & Mclntosh, 2014;Freeman et al., 2015;Horner et al., 2009;Sun, Liu, Zhu, & LeClair, 2019). Other schools that adopt ambitious instructional approaches or expand the advanced course options available to their students (such as International Baccalaureate or Advanced Placement courses) may expect to see an uptick in student test scores, graduation rates, and college readiness (e.g., Burris, Heubert, & Levin, 2006;Kolluri, 2018). Moreover, the political context in which a school is situated also matters. Schools that are under federal or state accountability pressure to improve student achievement may focus their resources strategically on preparing students for standardized tests (Darling-Hammond & Adamson, 2014;Hamilton, Berends, & Stecher, 2005;Lauen & Gaddis, 2016;Stecher, Barron, Kaganoff, & Goodwin, 1998). Other schools that are audited for racial/ethnic disparities in student outcomes may regard educational equity as a central goal of their reform efforts and take strategic actions to reduce achievement gaps (Sparks, 2015). Therefore, to account for variations in school resources, strategic choices, and organizational contexts, a diverse set of measures is needed to provide actionable, local, and useful information to support school improvement.
Third, schools may have heterogeneous contributions to different groups of students. Prior literature in education provides ample evidence on disparities in learning opportunities and educational attainment among students from different racial/ethnic subgroups. Historically underserved students of color (including American Indian/Alaska Native, Black/African American, Hispanic/Latino, and Pacific Islander students) tend to have lower achievement and higher disciplinary infraction rates. Students from these groups are more likely to enroll in special education and less likely to enroll in gifted programs and Advanced Placement courses (e.g., Hosp & Reschly, 2004;Reardon, Kalogrides, & Shores, 2016;Reardon & Robinson, 2008;Shores, Kim, & Still, 2017;Wallace, Goodkind, Wallace, & Bachman, 2008). What is more, several school components that are critical to student learning may not be distributed equally. Even within schools, students who are members of historically underserved minority groups, are eligible for subsidized lunch, or have lower prior achievement are more likely to have teachers with fewer years of experience, lower licensure exam scores, and lower value-added scores (Goldhaber, Lavery, & Theobald, 2015;Kalogrides, Loeb, & Béteille, 2013). Further, school practices such as tracking, patterned participation in extracurricular activities, and segregated peer networks can generate unequal learning opportunities for different groups of students (Carter, 2012;Crosnoe, 2009;Hallinan & Williams, 1989;Mickelson, 2003;Moody, 2002). Such differences in access to learning opportunities may explain why achievement gaps within schools account for more of the overall disparities in achievement than do gaps between schools (Bohrnstedt, Kitmitto, Ogut, Sherman, & Chan, 2015). Systems for measuring school performance need to attend to the possibility that schools may have differential contributions to students within schools.
Fourth, the influence of school practices on student learning can be incremental. Some measures used in states' ESSA accountability systems pertain to student outcomes (e.g., achievement), and others pertain to the processes or conditions for learning (e.g., attendance, positive learning environment in the classroom, effective teachers). The latter are often considered "leading indicators" of change, occurring before substantial changes in student achievement can be observed. Policymakers and researchers have leveraged this temporal relationship between measures to diagnose issues within schools and develop interventions. For instance, Sun et al.'s prior study in the San Francisco Unified School District found that reform efforts aimed at transforming underperforming schools showed early improvements in many process measures, including a reduction in unexcused student absences and an increase in the retention of high value-added teachers in math and English language arts  These improvements in process measures provide evidence on the gradual improvements in distal outcomes (such as test scores). However, it should be noted that short-or medium-term success may not always translate to long-run changes in outcomes. These short-term successes can be heavily biased by random fluctuations. Accordingly, researchers and school leaders should interrogate any process measures used by examining the degree to which they predict longterm outcomes.
In sum, as with organizational performance in other industries, the multidimensionality of school performance is rooted in several factors, including the diversity of stakeholder interests, strategic choices regarding initiatives and resource allocation, interactions between school services and student backgrounds, and the temporal relationship between organizational activities and outcomes. This multidimensionality requires that a diverse set of measures be used in assessing school performance. The ESSA measurement system intends to meet this requirement, yet our knowledge of the complex process of integrating multiple measures into one coherent system has been insufficient to fully guide practitioners in the design and implementation of school accountability plans. The current study helps address this gap in knowledge by using empirical data to illustrate the mutual consistency and predictive relationships between several measures of school performance that are typically included in states' ESSA plans.

Method Data and Sample
To examine the relationships between several measures of school effectiveness and the disparities between racial/ethnic student groups, we use data from all public K-12 schools in the state of Washington from 2010 (the spring of the 2009-10 school year) to 2016. Since Washington State did not collect statewide school and teacher surveys longitudinally (and most states in the country do not), in our exploration of the role of malleable school processes in student outcomes, we use student and staff survey data from Seattle Public Schools from 2014 to 2016.
Our sample only includes one state and one district, so the findings may be limited to the context of this state or this district. Moreover, the measures we developed do not completely align with Washington State's school performance measures under ESSA, nor do they reflect exactly how Seattle Public Schools uses its data for improvement. Because our intent with this study is not to discuss one state's policy or one district's data use, in determining which measures to focus on, we draw inspiration from multiple states' school accountability designs and districts' possible ways of using data. Our aim is to identify patterns that inform state policymakers, district administrators, and researchers about issues to consider in measuring school performance for accountability and improvement. As shown in Table 1, Washington State's public schools serve nearly 1.1 million students annually in more than 1,800 elementary and middle schools and 500 high schools. From 2010 to 2016, about 28% of students in the state were identified as historically underserved students of color (i.e., Hispanic/Latino, Black/African American, American Indian/Alaska Native, or Pacific Islander). Additionally, about half of students in the state were eligible for free or reduced-price lunch, almost 20% reported that their primary language was not English, and about 3% were identified as homeless.
Seattle Public Schools, the largest school district in Washington, enrolls almost 55,000 students and is similarly diverse. Between 2010 and 2016, 32% of Seattle Public School students were identified as historically underserved students of color, about 45% were eligible for free or reduced-price lunch, about 24% reported that their primary language was not English, and about 4% were identified as homeless. Table 1 provides summary statistics of school-level demographic characteristics.

Measures
We look at four broad categories of performance measures: student academic performance, student behavior, racial/ethnic disparities in school performance, and school internal processes. Each of these categories includes a number of sub-measures.

Student academic performance.
Measures of student academic performance include average math and reading scores from state assessments, standardized within a given grade, year, and test, along with the proportion of students who achieve proficiency in math and reading. School growth in math and reading is calculated in terms of school value-added and median student growth percentile. School value-added captures the degree to which learning gains during the year are greater than the estimated gains predicted from students' prior achievement, student characteristics, and school demographics and prior performance. Median student growth percentile is a more commonly used measure for assessing growth in states' ESSA plans. Appendix A describes our model specifications for both value-added and median student growth percentile. Finally, we look at the five-year adjusted cohort graduation rate for high schools, which is the percentage of students from a ninth-grade cohort who graduate by their fifth year at the school, adjusting for transfers in and out of state public schools.

Student behavior.
We look at two measures of student behavior. Chronic absenteeism rate is the percentage of students at a school who have ≥ 18 days of excused or unexcused absences during the school year, which is equivalent to 10% of a school year. Disciplinary infraction rate is the percentage of students at a school with at least one formal disciplinary infraction during the school year.
Racial/ethnic disparities. We measure performance disparities for racial and ethnic groups of students in two categories: 1. Historically underserved students of color (hereafter referred to as HUSC), which includes American Indian/Alaska Native, Black/African American, Hispanic/Latino, and Pacific Islander students; and 2. Non-HUSC, which includes all remaining groups within the seven federally defined racial/ethnic categories-White, Asian, and multiracial students.
Although our definition of HUSC may not be consistent with that used in some prior studies, it is derived from achievement data from Washington State. The four groups we identify as HUSC all performed below the state average on both math and reading tests, while the three remaining groups performed above average. In most cases, we define racial/ethnic disparity as the difference in means (or medians for student growth percentile) for these two groups of students. However, we use a different measure when comparing their scale score performance. For these performance measures, we calculate the V-disparity, separately for math and reading. The V-disparity (or V-gap, as Ho and Reardon [2012] call it), can be understood as the difference in mean test scores between two groups, both with standard normal test score distributions. This measure allows us to compare test score disparities across different test scales. We calculate disparities when there are at least 10 students in each of the two groups (HUSC and non-HUSC). 2 2 We set this minimum subgroup size to ensure that the measures are reliable and accurate representations of school disparities without excluding schools that have smaller yet still important numbers of students in particular subgroups. Given that chronic absenteeism and disciplinary infractions are relatively rare, the risk differences are relatively small. Compared with another measure of disparity in the literature-risk ratio (i.e., the HUSC rate divided by the non-HUSC rate)-risk difference might underestimate disparities. However, risk ratio has an inherent problem in that as one group's risk approaches 0.5, the maximum risk ratio is 2. This generates statistical artifacts, prohibiting us from observing the underlying relationship between average performance and disparities. The risk difference measure is also more consistent with the way in which many states display chronic absenteeism rates on their websites. States often show the rates among subgroups of students side-by-side; our approach of taking the differences between HUSC and non-HUSC peers conveys a similar meaning.

School internal processes.
To capture school conditions and supports for learning, we identified common items in Seattle Public Schools' student and staff surveys that best capture educational practices and conditions that are malleable to school policies and practices. Using exploratory principal component factor analysis and orthogonal varimax (Kaiser off) rotation, we developed four composite constructs: positive classroom peer environment, or whether teachers manage their classrooms to create an inclusive and friendly environment focused on learning; safe and welcoming school environment, or whether the school creates a sense of pride, respect, support, and safety among students; effective school management and procedures, or whether the school engages teachers in decision-making, promptly and effectively resolves conflicts among staff, and uses consistent processes for supporting students who struggle; and teaching and teacher supports, or whether the school supports collaboration among faculty members and provides other supports for effective instruction. A full list of survey items can be found in Appendix B.

Analysis and Results
Below, for each type of measure we look at in relation to school performance, we introduce the analytic methods we used and summarize our key findings.

Student Behavioral and Academic Measures and School Performance
We examine the relationships between measures of school performance by calculating correlation coefficients for each pair of performance measures. We separate the measures into two groups-elementary and middle schools (grades 3-8) and high schools (grades 10-11)-based on the accountability system designs that most states and districts use and the availability of measures at different educational levels. Because the range of some measures (e.g., chronic absenteeism rates and disciplinary infraction rates) is relatively restricted, which can attenuate correlation coefficients, we use scatterplots to visually inspect the relationships between these measures. The scatterplots suggest patterns that are consistent with those reflected by the correlation coefficients. Since coefficients provide a more succinct way of summarizing the patterns than scatterplots do, we present mainly coefficients in this section and include the scatterplots in Appendix B. We then present an exploratory factor analysis of these performance measures to assess the dimensionality of the data. Table 2, for elementary and middle schools, average test scores for math and reading are highly correlated with one another (ρ = .91). The proficiency levels for math and reading (i.e., the percentage of students meeting standards on math and reading exams) are also highly correlated (ρ = .87). When looking across measurement types for the same subject (e.g., the average test score for math and the proficiency rate for math), we observe similarly high correlation coefficients (e.g., ρ = .93 or .90).

Elementary and middle schools. As shown in Panel A of
In addition to performance levels, we look at two types of growth measures: school valueadded and median student growth percentile. The growth measures captured by school value-added are positively correlated with achievement levels. However, the correlation coefficients are relatively small, ranging from .10 to .28. In contrast, median student growth percentile has much greater correlations with average school performance levels in terms of both test scores and proficiency (ρ = .34-.56). The reason for this difference in correlations may be that schools' average performance levels are influenced by student demographics and socioeconomic status, whereas the value-added growth measures deliberately account for the influence of these student characteristics and school contextual factors. In addition, while schools' average performance and growth (value-added or student growth percentile) are positively correlated (i.e., schools with high average scores tend to have high growth), the low to moderate size of the coefficients suggests they do not always align with one another. Moreover, the two behavioral measures-chronic absenteeism rates and disciplinary infraction rates-are closely related (ρ = .53). These two measures are also negatively correlated with the measures of academic performance (i.e., math and reading performance levels and growth) as expected, although the correlation coefficients are low to moderate (ρ = -.08 to -.41). (1) Average math score .94 (2) Average reading score .94 (3) % proficient in math .93 (4) % proficient in reading .90 (5) Growth in math (VA) .87 (6) Growth in reading (VA) .82 (7) Growth in math (SGP (2) Average reading score .55 (3) % proficient in math .89 (4) % proficient in reading .42 (5) Graduation rate .45 (6) Chronic absenteeism rate .57 (7) Disciplinary infraction rate .56 Note. Factor loadings < .4 are omitted. VA = value-added. SGP = student growth percentile. Table 2, for high schools, the correlation between test scores in math and reading is lower than the same correlation for elementary and middle schools (ρ = .66 vs. .91). We observe similar positive correlations between scale scores and proficiency levels, although again with coefficients lower than the corresponding ones for elementary and middle schools. We also see that all test score measures are positively correlated with five-year graduation rates (ρ = .41-.58). The two behavioral measures, chronic absenteeism rates and disciplinary infraction rates, are negatively correlated with academic performance as measured by test scores and graduation rates, with correlation coefficients comparable to those calculated for elementary and middle schools. The two behavioral measures are also positively correlated with one another (ρ = .37).

High schools. As shown in Panel B of
To further explore the multidimensionality of these measures, we use exploratory principal component factor analysis and orthogonal varimax (Kaiser off) rotation. As shown in Table 3, four factors are discovered for elementary and middle schools, with Factor 1 focusing on student achievement levels, Factor 2 on math growth, Factor 3 on reading growth, and Factor 4 on student behavior. We observe three factors for high schools, with Factor 1 focusing on student achievement levels, Factor 2 on graduation rates, and Factor 3 on student behavior. These factor loading patterns may largely be consistent with the original intent of the policy design-namely, to include measures that go beyond achievement levels and capture other school outputs that stakeholders value.

Racial/Ethnic Disparities in Student Behavioral and Academic Measures and School Performance
Under ESSA, states are required to assess school performance using the whole school as the unit of analysis and by student subgroup. Thus, we ask: Are schools that are higher performing on average equally effective for HUSC and non-HUSC? To answer this question, we compare measures of average school performance with measures of the disparities in schools' contributions to HUSC and non-HUSC. We start the analysis by presenting simple scatterplots showing the performance of HUSC and non-HUSC in schools with varying average school performance. We further explore the relationship between average performance and HUSC disparities by running regression models that control for other school characteristics (e.g., percentage of HUSC, percentage eligible for free or reduced-price lunch, log enrollment, percentage whose primary language is English, percentage of homeless students, percentage of female students) and Washington State time trends. Standard errors are clustered at the school level. Figure 1 show math and reading scale scores for elementary and middle schools. As expected, we observe a positive relationship between average school performance and the average performance of both HUSC and non-HUSC. However, the slopes of the lines appear to diverge between these two groups in the scatterplot for math scores, with a steeper positive slope for non-HUSC than for HUSC. In other words, while HUSC and non-HUSC both perform better in higher performing schools on average, HUSC tend to fall further behind non-HUSC in math as school performance increases. Therefore, a disparity measure would be larger in high-performing schools than in low-performing schools. It is not necessarily the case that high-performing schools on average are less able to support the learning of HUSC than other schools are, because HUSC in these schools, on average, still tend to be higher performing than HUSC in other schools; rather, this correlation suggests that HUSC do not reach the same level as their non-HUSC peers in the same high-performing schools.

Elementary and middle schools. Panels A and B of
Panels C and D present the relationships for math and reading proficiency rates. For both subjects, we see positive trends for both HUSC and non-HUSC, and the schools with the highest proficiency rates (at the far right of the graph) tend to have narrower gaps between the two groups.
Given that attaining proficiency requires a student to reach a certain set bar, it is not that surprising that the disparities in high-proficiency schools are smaller. Once a school reaches a certain overall proficiency level, there is less room for disparities in proficiency rates to exist. However, given the wider disparities in average scale scores at higher performing schools, it seems that proficiency, which is a coarse measure relative to scale score, may mask some of the differences in achievement between student subgroups within schools (Polikoff, Duque, & Wrabel, 2016).
The regression results in Table 4 formalize the patterns described above. V-disparity, which indicates HUSC minus non-HUSC test scores, increases (becomes more negative) as average school test scores increase. For a one standard deviation increase in average school test scores, HUSC fall about 10% or 13% of a standard deviation further behind their non-HUSC peers in the same school. Moreover, the associations between average school proficiency rate and proficiency rate differences are statistically nonsignificant.  Note. SGP = student growth percentile. Standard errors clustered by school are shown in parentheses. V-disparity measures the gap in scale scores between HUSC and non-HUSC. Difference in proficiency is the difference in proficiency rates between HUSC and non-HUSC. Value-added disparity is equal to the difference between school valueadded measures for HUSC and non-HUSC. SGP disparity is the difference between median SGP for HUSC and non-HUSC at a given school. Risk differences in chronic absenteeism and disciplinary infractions are calculated using HUSC risks minus non-HUSC risks. We set a minimum n of 10 (i.e., at least 10 students need to have observed data for each group at the school). Value-added is estimated starting in 2010-11 and only for grades 3-8. All models include schoollevel covariates and year fixed effects. Full model results are presented in Appendix Panels A and B of Figure 2 plot trends for HUSC and non-HUSC in math and reading value-added. Value-added controls not only for students' prior test scores but also for several student and school characteristics, such as whether the student is eligible for free or reduced-price lunch, disabled, or homeless, as well as similar student population characteristics at the school level. Given that these controls are likely correlated with race/ethnicity, it is unsurprising that the gaps in value-added are very narrow. However, we find an interesting pattern in reading value-added. The trend lines seem to indicate that the disparities between HUSC and non-HUSC in value-added decline as the school's overall value-added increases. This finding suggests, at least in reading, that higher value-added schools are more effective than lower value-added schools at reducing equity gaps.
Panels C and D present the scatterplots for another measure of student growth-median student growth percentile. This measure does not control for student and school characteristics as value-added does and is therefore still correlated with these characteristics. Most likely due to this difference, the patterns differ from the ones we see in Panels A and B. For both math and reading, schools with lower median student growth percentiles have smaller differences between HUSC and non-HUSC, while schools with higher median student growth percentiles have larger disparities.
The relationships observed in Figure 2 are further confirmed in our regression analyses. Table 4 shows a statistically nonsignificant relationship between math value-added and racial disparities in math value-added, but it shows a significant positive relationship between reading value-added and racial disparities in reading value-added. A one standard deviation increase in average value-added at a school is correlated with a 3% of one standard deviation decrease in the difference in value-added, suggesting that high-growth schools in reading tend to be more equal schools. In terms of median student growth percentile, however, the relationship between school average and disparities is reversed. With a 1 percentile increase in a school's median student growth percentile, there is about a 3% increase in the differences in median math student growth percentiles between HUSC and non-HUSC groups in the same school. The relationship between reading valueadded and racial disparities in reading value-added is nonsignificant. We next explore how schools' performance in behavioral outcomes compares with HUSC disparities in these same measures. Panels E and F of Figure 2 reveal a positive relationship between the average school levels and risk differences for chronic absenteeism and disciplinary infractions (i.e., the HUSC rate compared with the non-HUSC rate). Schools with higher average chronic absenteeism or disciplinary infraction rates tend to show higher disparities, with greater risk among HUSC than among their non-HUSC peers. The last two columns in Table 4 confirm the pattern shown in the scatterplots, even after controlling for school characteristics and state time trends. 3 A 10 percentage point increase in chronic absenteeism rates (or disciplinary infraction rates) at a school would increase the risk difference at the average elementary or middle school by 1 percentage point (2 percentage points for disciplinary infractions). Figure 3, which plots HUSC and non-HUSC average math test scores (for 10th and 11th graders), we see a more pronounced divergence in the trend lines than we do for elementary and middle schools (Figure 1, Panel A). That is, we see a more obvious difference in the disparities between high-and low-performing schools for math test scores. Again, the trend lines for reading scores show no patterns of divergence or convergence (Panel B). Table 5 presents the estimated coefficients for the regression of high school disparity measures on average school performance and shows that for a one standard deviation increase in average test scores, there would be a 26% of one standard deviation increase in disparities in math and a 16% of one standard deviation increase in disparities in reading.

High schools. As shown in Panel A of
Panels C and D of Figure 3 present the analogues to the scatterplots shown in Panels C and D of Figure 1 for proficiency levels. For math, the lines diverge greatly in the highest performing schools, with HUSC falling further behind their non-HUSC peers in the same school as average school performance increases. This is in contrast with Panel C of Figure 1, which shows almost parallel lines with a slight narrowing of disparities in the highest performing schools. We see a similar pattern for reading proficiency in high schools (10th or 11th grades) as we do in elementary and middle schools. The lines for these two groups are almost parallel, although in the highest performing high schools, the lines converge, suggesting a reduction in gaps. The regression results in Table 5 show similar patterns.
Panel E of Figure 3 plots the graduation rates for HUSC and non-HUSC. The disparities in graduation rates appear to narrow as the school's overall graduation rate increases. This makes sense given that graduation provides a set bar for students to cross. In high-performing schools overall, both HUSC and non-HUSC are likely to reach this bar; we thus see a converging trend between HUSC and non-HUSC when measuring their performance in terms of graduation rates. However, after controlling for school-level factors, the regression results in Table 5 show a statistically nonsignificant narrowing of the graduation rate disparity.
Lastly, Panels F and G of Figure 3 reveal a largely positive relationship between average levels and risk differences for chronic absenteeism and disciplinary infractions, similar to the patterns seen in elementary and middle schools. Schools with higher average chronic absenteeism rates (or disciplinary infraction rates) tend to show higher disparities between HUSC and non-HUSC. The last two columns in Table 5 confirm these observations even after controlling for school characteristics and state time trends. A 10 percentage point increase in chronic absenteeism rates (or disciplinary infraction rates) at an average high school would increase risk differences by about 3 percentage points.  Note. Standard errors clustered by school are shown in parentheses. V-disparity measures the achievement gap between HUSC and non-HUSC. Difference in proficiency/graduation is the difference in proficiency/graduation rates between HUSC and non-HUSC. We set a minimum n of 10 (i.e., at least 10 students need to have observed data for each group at the school). Graduation rates are available starting in 2010-11. Risk differences in chronic absenteeism and disciplinary infractions are calculated using HUSC risks minus non-HUSC risks. All models include school-level covariates and year fixed effects. Full model results are presented in Appendix Table B3 (academic measures) and Appendix Table B5 (behavioral measures). *p < .05. **p < .01.

School Process Measures and School Performance
Our final research question relates to how well school processes-as captured by student and staff satisfaction surveys-predict schools' academic performance and disparities between HUSC and non-HUSC. 4 Using data from Seattle Public Schools, we regress each measure of schools' average performance and disparities on the composites of process measures by controlling for a lagged dependent variable in each regression, school covariates, and year fixed effects. Table 6 presents the main results, with Panel A focusing on average school performance and Panel B focusing on disparities. 5 Coefficients for all but two of the measures are statistically nonsignificant. First, more positive staff perceptions of instructional supports appear to be positively associated with higher average test scores and proficiency rates in both math and reading. Second, more positive student perceptions of the school environment (i.e., whether it is safe and welcoming) appear to be associated with smaller differences in reading growth (student growth percentile) between HUSC and non-HUSC. The overall weak correlations between these process measures and student achievement outcomes may be indicative of the multidimensionality of school performance, as many school actions may not result in immediate impacts on test scores. Alternatively, these weak relationships may be due to this study's data limitations. We have a relatively small sample size, and survey measures may contain measurement error, both of which reduce the study's analytical power.

Contributions to the Literature
Although prior studies have illustrated the difficulty of using multiple measures to consistently identify schools in need of improvement (Hough et al., 2016), they have not conceptualized these issues in a coherent framework. This study, to our knowledge, is the first to conceptualize the sources of the multidimensionality of school performance by drawing on both management science and education literature, which include stakeholders' values and interests in education, strategic choices regarding initiatives and resource allocation, interactions between school services and student backgrounds, and the temporal relationship between organizational activities and outcomes. Moreover, we empirically illustrate three key school performance measures as the manifestation of the inherent sources of organizational multidimensionality: first, that measures have different associations with one another and are loaded onto different factors; second, that school performance can be heterogeneous for different subgroups of students; and third, that the nature of the measures and the temporal relationships between them add another layer of complexity to the measurement of school performance.
In particular, our analyses of HUSC disparity measures are a key contribution to the literature on organizational multidimensionality and are not discussed in Richard et al.'s (2019) original framework. Richard et al.'s framework was developed for private companies; our conceptualization, meanwhile, foregrounds schools' mission of promoting educational equity and broader societal good. Our study also adds to the literature on educational inequality by inquiring into disparities in achievement and growth within schools. The negative relationships between average school performance and HUSC disparities in scale scores and median student growth percentiles suggest that HUSC in schools with high average performance were further behind their non-HUSC peers in the same school. In schools with higher rates of chronic absenteeism and disciplinary infractions, the relative risk for HUSC increases as well. In contrast, schools that promoted higher average growth for all students as captured by value-added measures shrank the gap between HUSC and non-HUSC in reading. Moreover, the relationship between average graduation rates and graduation disparities seems to suggest that schools that perform higher on average are more equal schools.
Although school production may explain some of the complex relationships between average school performance and within-school disparities, which suggests all schools need to do more to reduce disparities, the inconsistent patterns among measures may be more a reflection of the nature of the measures themselves. For example, measures of average performance based on scale scores and student growth percentile are often influenced by student socioeconomic and school contextual backgrounds, while value-added measures deliberately account for these factors. In other words, value-added measures level the playing field between HUSC and non-HUSC by eliminating the associations between student growth and student background characteristics (Ehlert et al., 2014). In addition, proficiency rates and graduation rates each have a set of criteria for all students to meet statewide, and these criteria tend to be basic. Once a school's average performance reaches a certain level, it is not hard for most students in the school, if not all students, to meet these criteria. We thus see smaller disparities in these measures between student subgroups in schools that are higher achieving on average. In sum, the psychometric attributes of the measures themselves add another layer of complexity to the measurement of school performance.

Practical Implications
At this early stage of ESSA implementation, virtually all states use a weighting system to combine scores from multiple measures into one index to differentiate between schools and identify underperforming schools for additional supports. Our findings suggest that besides this composite score, states should also inform the public of schools' performance on each measure individually. For example, Washington State has detailed weighting mechanisms that vary by grade level (Washington State Office of Superintendent of Public Instruction, 2018, p. 42). For K-12 schools with all indicators, English language arts and math proficiency are each worth 15%, growth is worth 25%, graduation 25%, English language progress 5%, and school quality or student success indicators (averaged) 15%. Delaware aggregates five indicators (academic achievement, academic progress, school quality/student success, graduation rate, and progress toward English language proficiency) to create a summative index score for schools, which is then translated into an overall categorical identification (e.g., exceeds, meets, or meets few expectations) (Delaware Department of Education, 2019, p. 49). New York State gives the greatest weight to academic achievement and growth in elementary and middle schools and academic achievement and graduation rates in high schools, followed by progress toward English language proficiency and then chronic absenteeism and a college and career readiness index (New York State Education Department, 2017, p. 70).
Although a composite index may provide a more holistic portrait of school performance than each single measure, it may mask important differences between schools on individual dimensions. Moreover, some measures (disciplinary infraction rates and chronic absenteeism rates) may be noisier than others (such as test scores or graduation rates). More complex school data profiles generated by multiple measures contain more noise or can fluctuate more dramatically from year to year. Another challenge for combining different measures into a single framework is to connect these measures to a coherent framework for student and school success. In the absence of a conceptual framework, school systems will simply be adding more measures to their accountability system. Given these concerns, it is important for states to publish not only the composite but also the individual measures. For example, Washington State school report cards display each measure for a given school or district and allow users to view the breakdown for each by student subgroup (Washington State Office of Superintendent of Public Instruction, 2019). This display allows users to interpret and select the information that is most consistent with their theory of action.
Second, our study highlights the importance of having a coherent school improvement framework to guide local districts and schools to carry out data-driven, evidence-based reform strategies. The appropriate use of data demands both technical expertise and a systemic understanding of the practical issues involved. For example, improving process measures such as collaborative teaching may predict student achievement gains, and policymakers and researchers can leverage the temporal relationship between these measures to diagnose issues and develop interventions to improve student outcomes. However, these process measures, often gathered through surveys or observations, contain measurement error, which may significantly attenuate the estimated relationships between school processes and outcome measures. What is more, the relationships between school processes and outcomes are inherently complex. School district leaders need a deep conceptual understanding of the mechanisms of change. This evidence-based practice demands high levels of both technical and theoretical expertise, which oftentimes practitioners may not have. Partnerships between researchers and practitioners that aim to build school districts' capacity for using data and evidence-based practices can be instrumental in this regard (Booker, Conaway, & Schwartz, 2019;Coburn & Penuel, 2016;Kane, 2017).

Limitations
We acknowledge several limitations of our work. First, this study by no means includes all measures that are used in school performance measurement systems across states under ESSA. Further research is needed to examine other popular measures (e.g., access to advanced coursework). Additionally, several measures-including student behavioral and school processesmay be noisy due to measurement error or underpowered due to small sample sizes. These issues limit this study's capacity to identify true relationships between these measures.

Conclusion
Despite these limitations, this paper provides useful empirical illustrations of the complex relationships between multiple measures used under the ESSA school performance measurement system. States and local practitioners will need consistent support as they select, collect data on, and use these measures to remedy underperformance and promote equity in schools. Figure B2. Scatterplots of relationships between multiple measures in high schools • Adults at school care about me.
• I enjoy going to school most days.
• I feel safe in my classroom.
• I am treated with as much respect as other students.
• I feel proud of my school.
• In my school, I feel that I belong to a group of friends.
Effective school management and procedures (Eigenvalue = 5.1, Cronbach's α = 0.84) • This school has an effective process for making group decisions and solving problems. • Conflict among staff is resolved in a timely and effective manner.
• I feel included in the decision-making process at this school.
• This school has a consistent process for identifying students who struggle academically. • This school implements a clear plan of action when a student struggles academically. • I receive the support I need to address my students' behavior and discipline problems.
Teaching and teacher supports (Eigenvalue = 1.31, Cronbach's α = 0.78) • This school meets regularly and often to discuss student data.
• We share a common understanding of instructional best practices.
• We have adequate time to plan collaboratively.
• I have access to strategies and materials to support all learners in our classes. • I receive the support I need to differentiate and modify instruction for my students. Note. SGP = student growth percentile. Standard errors clustered by school are shown in parentheses. V-disparity measures the achievement gap between HUSC and non-HUSC. Difference in proficiency is the difference in proficiency rates between HUSC and non-HUSC. Value-added disparity is equal to the difference between school value-added measures for HUSC and non-HUSC. SGP disparity is the difference between median SGP for HUSC and non-HUSC at a given school. We set a minimum cell size of 10 (i.e., at least 10 students need to have observed data for each group at the school). Value-added is estimated starting in 2010-11 and only for grades 3-8. All models also include year fixed effects. *p < .05. **p < .01. Note. Standard errors clustered by school are shown in parentheses. V-disparity measures the achievement gap between HUSC and non-HUSC. Difference in proficiency/graduation is the difference in proficiency/graduation rates between HUSC and non-HUSC. We set a minimum cell size of 10 (i.e., at least 10 students need to have observed data for each group at the school). Graduation rates are available starting in 2010-11. All models include year fixed effects. *p < .05. **p < .01. Note. Standard errors clustered by school are shown in parentheses. We set a minimum "cell" size of 10 (i.e., at least 10 students need to have a positive value for the event for each group at the school). Attendance and discipline data are only available starting in 2012-13. All models include year fixed effects. *p < .05. **p < .01. Note. Standard errors clustered by school are shown in parentheses. We set a minimum "cell" size of 10 (i.e., at least 10 students need to have a positive value for the event for each group at the school). Attendance and discipline data are only available starting in 2012-13. All models include year fixed effects. *p < .05. **p < .01. education policy analysis archives Volume 28 Number 89 June 8, 2020ISSN 1068-2341 Readers are free to copy, display, distribute, and adapt this article, as long as the work is attributed to the author(s) and Education Policy Analysis Archives, the changes are identified, and the same license applies to the derivative work. More details of this Creative Commons license are available at https://creativecommons. Please send errata notes to Audrey Amrein-Beardsley at audrey.beardsley@asu.edu Join EPAA's Facebook community at https://www.facebook.com/EPAAAAPE and Twitter feed @epaa_aape.