education policy analysis Using Global Observation Protocols to Inform Research on Teaching Effectiveness and School Improvement: Strengths and Emerging Limitations

observation Abstract : An essential feature of many modern teacher observation protocols is their “global” approach to measuring instruction. Global protocols provide a summary evaluation of multiple domains of instruction from observers’ overall review of classroom processes. Although these protocols have demonstrated strengths, including their comprehensiveness and advanced state of development, in this analysis we argue that global protocols also have inherent limitations affecting both research use and applied school improvement efforts. Analyzing the Measures of Effective Teaching study data, we interrogate a set of five potential limitations of global protocols. We conclude by discussing fine-grained measures of instruction, including tools that rely on automated methods of observation, as an alternative with the potential to overcome many of the fundamental limitations of global protocols.


Introduction
Teacher observation protocols are a central element of efforts to evaluate teachers and inform instructional improvement efforts. 1 An essential feature of many modern protocols is their "global" approach to measuring instruction. Global protocols provide a summary evaluation of multiple domains of instruction from observers' overall review of classroom processes, by for example, scoring features of classroom discourse over the entire class session or interval of class time observed. Such protocols enable an observational approach to studying classroom instruction that offers both a powerful lens on the distribution of opportunity to learn between and within schools, as well as a framework for instructional improvement grounded in knowledge and information about the teaching process. Yet, recent evidence suggests it is often difficult to obtain a robust portrait of classroom talk and other instructional features using existing protocols, both for research purposes and in applied school improvement efforts Gitomer et al., 2014).
In this analysis we argue that both the strengths and limitations of existing protocols stem from their ambitious design parameters; the comprehensiveness and broad intended use of these protocols come with tradeoffs. Empirically, we demonstrate these concerns with reference to specific elements of each protocol and features of their use in the Measures of Effective Teaching Study. Along with the Understanding Teaching Quality (UTQ) study and other research since approximately 2010, the Measures of Effective Teaching study represents an unprecedented effort to understand the limits and possibilities of teacher observation using a suite of well-developed observational protocols that are most commonly used in school districts. Analyzing the MET data, we hypothesize that global protocols have several basic limitations that limit their potential to inform both practice and research, posing and answering the following research questions: (RQ1) Do global protocols offer precise discriminations in lesson quality? (RQ2) How independent are protocol sub-domains that are designed to capture different aspects of effective teaching? (RQ3) How sensitive is measurement reliability to rater training? (RQ4) Do protocols identify the teacher's own contribution to instructional quality beyond what students themselves may bring to the classroom? (RQ5) Do protocols exclusively evaluate a continuum of effective practice, making it difficult or impossible to detect tradeoffs and instructional adaptation?
As a set of five research questions, we investigate the extent to which these hypothesized limitations of global protocols are present in the MET data. Some of these concerns have been discussed with regularity in the literature (#3; e.g. Bell et al., 2014;White, 2018), others became especially salient to us in our own analyses seeking to document interactions between instructional practices and classroom composition in the MET data (Aucejo et al., 2018), an investigation relevant to understanding the effects of between and within school sorting of students as well as teacher evaluation policies. In particular, concern #4 and #5 pose great difficulty in understanding how teachers might adapt instruction to match the needs of students.
We conclude by discussing how the development of fine-grained observational toolsfor example, ones that record and carefully analyze individual utterances, questions, turns at talk, etc.-offer the potential for exceptionally reliable and precise teacher assessment and feedback and might prove an important complement to traditional teacher observation, overcoming the limitations we highlight.

Teacher Observation, Learning, and Instructional Growth
The use of classroom observation to drive school improvement is situated within both accountability and process models of school improvement. In this analysis we focus on the utility of teacher observation within the latter conceptual framework that views teacher observation as providing a critical external source of information to spur teacher learning and instructional growth (Clarke & Hollingworth, 2002;Goe et al., 2012).
At the end of the NCLB reform era, there is an increasing consensus that process models of improvement, including evidence-based innovations, rather than staffing, accountability, or incentive systems, hold the most promise for improving teachers' practice (Gamoran, 2012;Gamoran, 2013;Kelly, 2012). For example, in their 2011 NRC report on test-based accountability, Hout and Elliot conclude, "…overall effects on achievement tend to be small and are effectively zero for a number of programs." (National Research Council, 2011). Similarly, in their evaluation of three randomized trials of incentive pay programs for teachers, Yuan et al. (2013) report that teachers "...did not report their program as motivating," and "…none of the three programs changed teachers' instruction." In contrast, a variety of reform approaches more closely related to the process of instruction, from comprehensive school reform models (see e.g. Borman et al., 2007), to professional development/coaching (Biancarosa et al., 2010) to one-on-one tutoring (Farkas & Durham, 2007) have proven effective in improving the quality of instruction and achievement growth.
Observational information may greatly enhance capacity for instructional improvement in two ways. First, teacher learning entails growth in pedagogically relevant knowledge Shulman, 1987) and the ability to evaluate classroom processes and outcomes in light of that knowledge. Classroom observations provide an external viewpoint into classroom processes and outcomes, and an opportunity to structure evaluation around core concepts of teaching and learning. Relatedly, observational protocols can be used to link pre-service teacher education to inservice teacher professional development, generating continuity in teacher learning by carrying forward concepts and concerns. Second, teacher learning may often occur when teachers make improvements through experimenting with classroom practice and then reflecting on outcomes (Clarke & Hollingworth, 2002;Goldsmith et al., 2014). Observational systems can provide structure, motivation, and feedback in the process of "experimentation," reflection, and evaluation by peers, mentors, and teachers themselves.
Overall, instructional reform requires professional development (PD) that helps teachers build and use new knowledge and pedagogical approaches (Darling-Hammond & McLaughlin, 1995). While the professional development that teachers have traditionally received has been criticized as intellectually superficial (e.g. Ball & Cohen, 1999), when PD efforts are substantial, and founded on scientifically-based research (as in the case of university-district partnerships), they are in many cases highly effective (Yoon et al., 2007). Observation systems can potentially facilitate a widearray of process interventions with broad application to job-embedded teacher professional development (Camburn, 2010;Camburn & Han, 2015;Croft et al., 2010;Desimone et al., 2002;Putnam & Borko, 2000), including activities that strengthen social ties among teachers in a professional learning community (Coburn & Russell, 2008;Darling-Hammond et al., 2009;Penuel et al., 2009).

An Overview of Global Teacher Observation Protocols
The use of teacher observation protocols by districts and states to evaluate teachers and inform practice is now well established. Prompted by the federal Race to the Top initiative (RttT), many states (e.g. New York, Ohio, Tennessee, Ohio, North Carolina, Colorado, Michigan) adopted composite systems of teacher evaluation that included systematic teacher observations along with other components. For example, the New Jersey Department of Education's Teacher Evaluation plan, Achieve NJ, rates teachers on a four-category scale (highly effective to ineffective), where teacher practice on a state-approved observation instrument accounts for 70-85% of the total evaluation score. 2 New Jersey allows individual districts to choose from a wide-array of protocols in conducting teacher evaluations, including versions of the Danielson Framework for Teaching or FFT (The Danielson Group, 2013), the Classroom Assessment Scoring System or CLASS (see e.g. Hamre et al., 2013), Marzano's Causal Teacher Evaluation Model (Marzano Research Laboratory), and others (New Jersey Department of Education, 2019). In the case of New Jersey, teacher evaluations weighted heavily on classroom observation scores serve multiple monitoring and staff improvement purposes: "All teachers receive individual professional development plans based on their ratings. Teachers rated Ineffective or Partially Effective work with their principals to create a Corrective Action Plan with targeted professional development for the subsequent year. To maintain tenure, all teachers (regardless of hire date) have to continue to earn a rating of Effective or Highly Effective" (New Jersey Department of Education, 2019).
In contrast to the simple subjective ratings of teachers' overall performance sometimes used in employment decisions (see Brandt et al. 2007), the global observation protocols encouraged by RttT generally have several properties: (1) they allow trained raters to score observations of teaching on multiple dimensions and sub-domains, often totaling several dozen or more specific elements of teaching, (2) are based on rigorous research and/or are aligned with established teaching standards, and (3) are designed not only for evaluation but to enhance professional development. Relative to simple survey reports used in large-scale research (e.g. Gamoran & Carbonaro, 2002;Kelly & Majerus, 2011;Newmann et al., 1996;Raudenbush et al., 1993), these systems are a great leap forward in offering the opportunity for independent, occasion-specific measures of teaching.
In addition to FFT and CLASS referenced above as approved by New Jersey for use in teacher evaluation, numerous observation systems have been developed including the Mathematical Quality of Instruction or MQI (Hill et al., 2008), the Protocol for Language Arts Teaching Observation or PLATO (Grossman et al., 2014), the Classroom Strategies Assessment System or CSAS (Reddy & Dudek, 2014), The Thoughtful Classroom (Silver Strong & Associates), The Five Dimensions of Teaching and Learning (The University of Washington, Center for Educational Leadership), and the TAP Rubric (National Institute for Excellence in Teaching). Although the 2015 Every Student Succeeds Act gave states greater autonomy over teacher accountability, Close et al. (2018) report most states are still using "the same or slightly different versions of the previously required systems." In general these protocols provide a comprehensive assessment of instruction that follows a formative (in the statistical sense) conception of instruction insofar as distinct components of instruction collectively constitute rather than reflect effective instruction (See Jarvis et al., 2003, for discussion of formative construct indicators). While some are clearly used for evaluative purposes, they are also well suited for informing teacher learning within the context of a variety of instructional leadership practices, mentoring, induction, and other organizational improvement efforts. Importantly, the original intent of some global protocols was explicitly for research and teacher development. For example, in one of their early working papers, the developers of PLATO stated, "Ultimately, we hope to create a tool that is not only useful for research on teaching, but can be used for teacher development as well." (Grossman et al., 2010). Here we briefly provide further information on the specific protocols used in the MET study and what we see as their particular strengths and special features.

Protocols Used in MET
The Danielson Framework for Teaching or FFT, first designed in 1996, (The Danielson Group, 2013) is a comprehensive observational instrument designed to apply to all disciplines and a wide array of grade levels. FFT includes four domains: (1) planning and preparation, (2) the classroom environment, (3) instruction, and (4) professional responsibilities, with Two and Three pertaining to in-class observation, and One and Four entailing additional materials and out-of-class interaction with the teacher. Within domains Two and Three, in the MET study, classroom instruction was scored on a total of eight components (sub-domains) on a four-point scale: unsatisfactory, basic, proficient, or distinguished. To structure scoring, components are further comprised of elements and raters look for indicators and critical attributes of performance at a given level. A set of possible examples for each component aid in assigning a score. In our view, FFT succeeds in offering an exceptionally comprehensive and broadly applicable observational protocol (like CLASS it can be used for a wide variety of subject matter areas and grade levels but is less explicitly focused on teacher-student interactions; Goe et al., 2012). In addition, while the framework is intentionally engagement-focused and student-centered (The Danielson Group, 2013, p. 5), it is a well-balanced protocol with emphasis on challenge and content coverage.
The Classroom Assessment Scoring System or CLASS (Hamre et al., 2013;La Paro et al., 2004) is a standardized observational system that focuses in particular on the quality of teacherstudent interactions. CLASS is organized into three domains (emotional support, classroom organization, and instructional support), each having several subdomains further defined by multiple indicators. Several versions of CLASS are now available that offer a tailored system sensitive to the developmental and pedagogical context of students at different grade levels. In MET, raters scored 15-minute intervals of classroom instruction at the dimension level using a 7-point scale labeled simply from low to high. While there is overlap in the constructs measured by FFT, CLASS has an especially strong focus on emotional support in defining that at the domain level. Hamre et al. (2013, p. 466) describe CLASS's emphasis on social and emotional supports as targeting "key elements" of instruction, which is well motivated by developmental theories of self-determination and attachment. At the same time, we view it as a well-balanced protocol, with for example, essential indicators of challenge within the dimensions of instructional support (e.g. concept development, analysis and problem solving).
The Protocol for Language Arts Teaching Observation (PLATO) is a classroom observation tool designed for 4-9 th grade English/Language Arts (ELA) instruction (Grossman et al., 2014). PLATO focuses on 13 elements of instruction related to four underlying domains of instruction: the disciplinary demand of classroom activity and discourse, instructional scaffolding of ELA content, representations and use of content, and the classroom environment. In MET, 8 of the 13 elements were scored during 15-minute instructional segments on a 4-point scale (almost no evidence, limited evidence, evidence with some weaknesses, consistent strong evidence). While there is conceptual overlap with other observational protocols (e.g. with CLASS's instructional support domain), the PLATO elements have their origin specifically in research on English/language arts instruction and are closely linked with literacy learning (e.g. the text-based instruction element). Early research with PLATO found that the choice of instructional activity (e.g. whole class literature instruction vs. small group or paired writing instruction) affected PLATO scores; as such, PLATO is most reliable when aggregated over several lessons that capture a range of instructional activity (Cor, 2011;Grossman et al., 2010).
The Mathematical Quality of Instruction (MQI) protocol was developed by Heather Hill and colleagues as a subject-specific observational system between 2003 and 2010 (Hill et al., 2008;Hill et al., 2011). MQI considers 6 elements (richness of the mathematics; errors and imprecision; working with students and mathematics; student participation in meaning-making and reasoning; explicitness and thoroughness; and connections between classroom work and mathematics) that concern the relationships between teachers, students, and content. 3 In MET, raters scored observations with lessons broken into 4-7.5 minute segments, as well as the whole observation, on a 3-point scale (dimension not present, partially present, or predominantly present) using scoring anchors provided for each score on each element. 4 The data include both the holistic ratings based on the whole observation and the segment-level ratings. As with PLATO, even though it is subject-specific there is conceptual overlap with MQI and the other protocols (e.g. student participation in meaningmaking captures constructivist principles that align with the FFT protocol). However, MQI stands out as being the most heavily focused on teachers' exhibition of pedagogical content knowledge in mathematics. Indeed, in MET, the MQI raters also provided a "lesson-based guess" at mathematical knowledge for teaching.

Data and Methods
The Measures of Effective Teaching Study collected data on teachers' instructional practices in six school districts over a two-year period from 2009-2010 to 2010-2011. Participating school districts were generally very large districts, encompassing but not limited to urban central-city schools in all cases: Charlotte-Mecklenburg (NC) Schools, Dallas (TX) Independent School District, Denver (CO) Public Schools, Hillsborough County (FL) Public Schools, Memphis (TN) City Schools, and the New York City (NY) Department of Education. As general indicators of sociodemographic context, MET teachers ranged from a low of 21.8% white in Memphis to a high of 92.2% in Denver (56.8% overall), while 34% of MET students were white and 54% received a subsidized school lunch (Kane et al., 2012, pp. 16-17).
While the random assignment of teachers to classrooms in Year 2 and other important features of MET are described elsewhere, here we provide a brief overview of relevant features of the teacher sample and observation process (See also the MET user guide, ICPSR document 34771). The MET study scored video observations of instruction using the high-quality observational protocols previously described. In contrast to typical use in evaluation, there were no stakes attached to the observational measures in MET, they were collected purely for research purposes. Also unlike in typical use, the raters were not local school administrators, curriculum coordinators, or lead teachers in the teachers' own schools, but impartial expert raters. While we will concentrate on analyzing the limitations of these protocols here, they were state of the art measures at the time of the MET data collection.
Within each of the six MET districts, teachers were voluntarily recruited from "traditional" public elementary and middle schools; alternative schools, vocational schools, special education schools, and small schools with fewer than three teachers per grade/subject combination were excluded (this last criteria precluded many charter schools from participating). 5 Participating teachers who met a few basic eligibility parameters (e.g. they were not team teaching/looping, planned to remain at the school for the following year, and were part of an eligible group of teachers that could be randomized in Year 2) received a $1500 incentive. Overall, MET included a diverse sample of teachers (only 56.8% were white), generally representative of their districts (Kane et al., 2012).
Sample sizes vary substantially across outcome measures, both because some observational protocols pertain to both English and math (FFT, CLASS), while others are subject specific (e.g. PLATO), and because some observational protocols were utilized to code a smaller subset of lessons/sections. In Year 1, approximately 1570 teachers contributed videos for CLASS and FFT scoring and 940 teachers contributed videos for MQI and PLATO scoring. The number of videos also varied by observation protocol with roughly 7800 for CLASS and FFT and 3500 for PLATO/MQI. Finally, some protocols have fewer ratings per video; for example FFT has about one rating per video while CLASS has closer to two (sample sizes in the first two tables report the number of ratings for each observation protocol).

Observation and Measurement Process
In Year 1 of the study, lessons were video recorded during the spring semester (February-June), and spread out in an effort to increase representativeness. The recorded lessons were balanced between "focal lessons" requested by the MET researchers and lessons of the teacher's choice. Teachers were trained to operate the video and audio recording equipment, which consisted of a camera focused on the board and one providing a 360-view of the room (excluding nonparticipating students), and two microphones, one for teacher audio and one for overall classroom audio. These were later combined into a single video/audio channel for lesson scoring.
The observation rating process included 902 current and former teachers using an online platform to score video observations (in addition to the MET user guide, see the MET Observations Measure Report, ICPSR 34771). Videos were scored in four-hour shifts, where raters used a single protocol to score the first 30-35 minutes of each video, often divided into smaller segments of time for given protocol (the CLASS protocol uses 15-minute segments). Raters were trained over a 17-25 hour period, using a combination of MET developed websites and existing ones associated with a given protocol. Rating quality was further enhanced with calibration videos at the beginning of each rating session, by interspersing "validity videos" into each rater's workload, and by consultation with scoring leaders who "back-scored" a sub-sample of videos to identify raters who needed additional training. 6

Analytic Plan
To assess the extent of the five postulated limitations of global observation protocols, we perform descriptive and inferential analyses across multiple levels of analysis, including ratings of the same classroom observations across different observers, features of observation scores across lessons of the same teacher and in the sample as a whole, and observation scores in relation to features of teachers and classrooms. While analyses related to some of the questions can be found in the existing literature, with the exception of Table 3, we provide novel analytic content in our calculations and comparisons. 7 Although the sample varies somewhat as noted, we generally rely on the Year 1 data, for grades 4-9, when teacher placement in classrooms occurred naturally. In Table 4 we include both Year 1 and Year 2 data (teacher assignment in Year 2 is random rather than naturalistic), because we seek to make inferences about potentially small but meaningful compositional effects where statist ical power is a major concern.
Research Question One: Discriminations in lesson quality. To answer our first research question we report observation-level descriptive statistics for the four protocols. To measure the extent to which the ratings discriminate between lessons of different quality, we consider standard deviations of each subdomain and the number of scores that cluster at the modal rating categories (there is generally a sharp drop off between the prevalence rate in the modal categories and the remaining categories). For example, FFT has four response categories, with the middle two categories representing a strong mode. We also report skewness of the distribution (positive values indicating the mean exceeds the median), and the standardized kurtosis (values greater than 3 are more peaky than the standard normal distribution).

Research Question Two: Independence in sub-domains.
To answer our second question, we perform a basic exploratory factor analysis at the level of the subdomains within protocols to consider whether there is evidence of the subdomains being independent enough to pick up different aspects of teaching, or factors. The metrics we use include several often used rules of thumb for construct independence: the number of factors with eigenvalues greater than 1 (Kaiser criterion), the number of factors needed to explain 90% of the variance in the covariance structure (eigenvalues), and the cumulative proportion of variance explained by adding each additional factor (Richards et al., 2013).
Research Question Three: Reliability and training effects. While Question One and Two rely on observation-level data (a unique observation for each class session as in real-world use), question Three relies on a given class session being scored b y multiple raters, which occurred as part of the MET's inter-rater reliability investigation (a phase of analysis commonly done early in a study before the rating process occurs at scale). For each of the four protocols, we compare three measures of inter-rater reliability across domains (percent exact agreement, simple kappa, and quadratic weighted kappas). These measures are reported using two different sets of rater pairs; rater pairs where both were trained normally and rater pairs where one rater was an expert coder. The comparison across rater types is a rough indication of the importance of coder expertise and training.
Percent exact agreement statistics, while intuitive, are difficult to compare across protocols, because as the number of response categories increases, the likelihood of exact agreement falls accordingly (e.g. for CLASS which has 7 categories). Simple Kappa statistics take into account base-rate chances of agreement, making them well-suited for comparing reliability across protocols with different base rates, and especially different sub-domains within a given protocol (which generally have the same number of response categories). Finally, quadraticweighted Kappa statistics additionally adjust for the number of response categories (i.e. ratings in near vs. far categories are evidence of consistency), making them more comparable across protocols and providing the most holistic comparison to the mathematical model of independence (akin to a chi square statistic). 8 Research Question Four: The teachers' own contribution to instruction. For the fourth question, we provide indirect evidence on how sensitive the MET protocols might have been to influences of social norms and process beyond the teachers' control. We build closely off of Campbell and Ronfeldt's (2018) analysis of FFT to analyze the association between the teacher rating from each of the protocols, measured by averaging scores across subdomains, and classroom characteristics, including average prior test performance on ELA and math, percentage black, Hispanic, Asian, other non-white, male, special education, English language learners, free/reducedprice lunch, class size and teacher value-added. The extent to which classroom characteristics are individually or jointly significant predictors of FFT, CLASS, MQI or PLATO may be suggestive of influences beyond the teacher's control. We also consider the summative significance by considering the total explanatory power. Controlling for teacher value-added helps to control for the alternative explanation that certain classroom characteristics are being matched to more effective teachers.

Research Question Five: A continuum of effective practice.
Are sub-domain scores always positively correlated? For question 5, we examined the covariance matrix of the sub-domains of the MET protocols. A weak test of this question would rely on the covariance matrix for the data as a whole. However, in this case consistently positive pair-wise correlations (e.g. domain 1 of FFT paired with domain 2, then with domain 3, etc.) are likely to be found because teacher training, effort, etc. induce a positive association. Instead, we focus on intra-teacher variance (i.e. looking only within class sections, a teacher and specific class of students, and pooling the results). Thus, we consider how often negative pairwise correlations occur for specific sections/teachers, which might be indicative of potential tradeoffs that teachers may face in choosing emphasis among the subdomains. Given that we have about 8 observations per class section (varying across protocols), some amount of weak negative correlations will occur merely due to chance. Thus, we count the number of pair-wise correlations that was (negatively) greater than −0.2 (a somewhat arbitrary cutpoint, but it should remove some of the chance negative correlations). For example, for FFT, we consider 28 pair-wise correlations among 7820 observations nested within 921 sections. We present the proportion of negative pair-wise correlations, averaging across sub-domains, along with the subdomain pairing with the highest average (most negative instances) and lowest average (least negative instances) incidence rates of apparent tradeoffs.

Imprecise Discriminations in Lesson Quality
While the comprehensive focus of existing global observational protocols is noteworthy, one risk of trying to capture so many dimensions of teaching may be that only rough, imprecise distinctions can be made concerning specific domains of instruction. To answer our first research question, Table 1 reports observation-level descriptive statistics for the four protocols, highlighting the sub-domains with the smallest and largest standard deviations. As a representation of the tendency for observations to cluster in one or a few categories (yielding a low standard deviation), we report the proportion of observations that fall into adjacent modal categories. 9 For example, on the 8 FFT sub-domains, the proportion of teacher observations in the middle two out of four categories (basic, proficient) ranged from a low of 87% (Using Question and Discussion Techniques) to a high of 96% (Communicating with Students). For FFT, the sub-domains exhibit little variation in their distributions with highly similar standard deviations, and modest differences in skewness and kurtosis. PLATO is also scored on a fourcategory scale, and shows somewhat greater spread than FFT. Yet, even the sub-domain of PLATO with the most variable scores (Strategy Use and Instruction), has 81% of observations in the bottom two categories. CLASS is scored on a 7-point scale anchored with "low," "middle," and "high," and thus offers the possibility of greater distinctions. Yet, even the most variable sub-domains (three are tied with a SD of 1.28) contain 73% or more of the observations in just 3 out of 7 categories, while the least variable (lack of negative climate) contains 98% of the observations in three categories (and 92% in the top two categories). For MQI, we show both the holistic scores and the individual 7.5-minute segment scores. The sub-domain with the greatest variability in both segment and holistic scoring, Explicitness and Thoroughness, contained 48% and 58% respectively in the bottom and middle category though this was only evaluated for a small proportion of observations. The sub-domain with the least variability, Student Participation in Meaning Making and Reasoning, contained 82% and 81% of observations in the lowest category in the segment and holistic ratings respectively. We also report results for the holistic Overall Mathematical Quality of Instruction score, where 82% of lessons score in the middle category. In both the CLASS and MQI protocols, there are fairly substantial differences across subdomains, some are quite "peaky" while others are less so.
Overall, while it may be the case that in fact a strong majority of teachers' lesson qualities are indeed generally adequate and "in the middle," we worry that the tendency for scores to cluster in a few modal categories in global protocols limits their ability to guide improvement. Further, teachers may disregard protocol results when they do not have examples of high or low ratings from which they can learn how to improve their teaching. They may also become discouraged about teaching when they only rarely see the potential to excel and lack positive feedback from these ratings (e.g. in the case where 5% or less of observations achieve an exemplary score).

Lack of Independence in Sub-Domains
Do global observational protocols capture multiple domains of instructional practice, such that teachers can receive targeted feedback on what areas of instruction most need improvement? In the systems of observation used in MET and the Understanding Teaching Quality (UTQ) study, individual domains of instruction, ostensibly critical to focusing improvement efforts, are not as separable in practice as they are in theory. Liu et al. (2019) examined the covariance structure of FFT observation scores in three sets of data, low-stakes observations from the research-focused UTQ study, and two practice-based implementations in the Understanding Consequential Assessment Systems for Teachers study. In all cases, they found high correlations across the four FFT domains and eight sub-domains such that a single factor structure best fit the data.
In Table 2, we report results of a basic exploratory factor analysis for all four observational protocols (with multiple specifications of MQI). In all cases, whether one focuses on the number of eigenvalues above 1, the number needed to reach 90%, or the app arent point of diminishing returns as additional factors are added, it is possible to specify a simpler structure, with fewer discernable latent factors than the number of sub-domains that the protocols consist of and intend to measure separately. One possible explanation for the consistency in sub-domains scores is that training and other teacher quality factors create a similar level of competence across domains. Another source of consistency is that while the sub-domains are discreetly defined, they are also in some cases conceptually similar and part of a larger construct (e.g. in MQI Richness of the Mathematics and Errors and Imprecision both relate to teachers' content and pedagogical content knowledge). An alternative explanation is that features of the observation system, such as a tendency for overall perceptions to create a halo-effect, create artificial consistency in subdomain scores (Liu et al., 2019;McCaffrey et al., 2015). Humphry and Heldsinger (2014) argue that consistency in the structural design of rubrics (global teacher observation protocols fall into the general category of a rubric), where all scoring domains have the same small number of response categories, can create similarity in domain scores both because raters are prevented from making the finer distinctions they are capable of for some domains, but also because the common structure can generate conceptual overlap and repetition/redundancy in score descriptions. While the protocols we analyze seem susceptible to the structural concern Humphry and Heldsinger describe, actual conceptual overlap, and/or rater focus on over-arching concepts (e.g. student-centered instruction) rather than discrete domains of teacher practice would remain a concern even with structural modifications.

Reliable Measurement Requires Training and Monitoring
How dependent on expert training are current global protocols in offering reliable assessments of instruction? Certainly, in using any observational data to make inferences about teacher effectiveness, a robust sampling process, with multiple representative observations per teacher is needed. In much prior research, four observations per teacher has been used as a target in making inferences at the teacher level (e.g. Gamoran et al., 1995;Kane et al., 2012;Kelly, 2007). The reliability of measured instruction improves considerably, from, for example, two to four observations (Kane et al., 2012, Table 11), but thereafter begins to reach a point of diminishing returns (Kelly et al., 2018, endnote 8). At a more basic level, affecting reliability at both the teacher and observation level, robust observational measurement requires adequate training and monitoring of observers. Results of observational studies to date suggest that achieving reliable observational ratings of teaching quality is challenging .
One way to demonstrate the importance of training and rater competency and concentration is to compare the inter-rater reliability under "normal" or at-scale conditions, to inter-rater reliability in a university or research-center setting (in MET, scores from content experts at the research firm contracted to collect the data are available). Table 3 reports interrater agreement statistics from the MET data. For each of the four protocols, the domain with the smallest and largest discrepancy in the inter-rater reliability (simple kappa) between rater pairs where both were trained normally and rater pairs where one rater was an expert coder are shown. While rates of exact agreement are often above 70% for the domains in Table 3, agreement in the kappa metric is sometimes not much above chance. Low kappa statistics here reveal that when there are only a few categories to choose from, and some categories are rarely selected by any rater, 70-75% agreement is not particularly impressive.
For the FFT protocol, where 90% of the time raters were adjudicating between just one of two middle categories, basic and proficient, exact agreement by two local raters ranged from only 47.3% to 65.8% across domains, and simple kappas ranged from .05 to .28 (or .21-.45 quad weighted). However, reliability improves substantially for "back-scored" videos, where one of the raters was more extensively trained; the simple kappa's jump to .39-.48, with about 70% exact agreement. Other protocols show similarly large increases in reliability when expert raters are used on some domains.
Overall, both the statistics in Table 3 and results published elsewhere show that in practice the reliability of classroom observations is quite variable, depends on adequate training and monitoring, and at the low end is problematic Cash et al., 2012;Cohen & Gitomer et al., 2014;Goldhaber, 2016;Kane et al., 2012). White's (2018) analyses of the UTQ data suggest that current standards for rater accuracy and consistency may be too low. However, it would not be appropriate to label these protocols simply as "unreliable," as levels of agreement among expert raters are well above chance/independence (see quadratic weighted kappas) despite the inherent complexity of the phenomena being rated.

Identifying the Teachers' Own Contribution to Instruction
Do global observation protocols primarily reflect the teachers' own contribution to instruction, or are they heavily impacted by features of the learning environment beyond their control? Given their widespread use in evaluation, ideally, global protocols would carefully distinguish between "teacher moves," the teachers' own contribution to instruction, with the enacted quality of instruction influenced by social norms and processes beyond the teachers' control. Consider for example the FFT sub-domain, "creating an environment of respect and rapport." Rubric examples for the proficient category include: "teacher greets students by name as they enter the class or during the lesson"-more obviously a teacher move, as well as "students attend fully to what the teacher is saying"-which is less obviously related to teacher moves. A focus on the enacted quality of instruction may be desirable in assessing instruction and instructional growth toward a target level in an applied setting but may hinder causal research on teacher effectiveness.
To provide indirect evidence on how sensitive the MET protocols might have been to influences of social norms and process beyond the teachers' control, Table 4 reports regression models showing the association between compositional features of the classroom and protocol scores. As in Campbell and Ronfeldt (2018), we find that classroom achievement composition, racial composition, percentage free lunch, and even percentage male are associated with overall FFT scores (e.g. a coefficient of .091 for class mean achievement in our analyses is the same to two significant digits as estimated by Campbell and Ronfeldt). Classroom composition measures are jointly significant predictors of FFT and CLASS at the 99% confidence level, but this is not the case for MQI or PLATO. Yet, examining the change in R 2 from the model with and without classroom controls, the additional explanatory power from including the rich classroom composition measures is quite small, at most 0.003. Considering the magnitude of estimated effects of average classroom initial achievement, teachers who have 1 standard deviation higher average classroom achievement will be rated 0.09 of a standard deviation higher on FFT and 0.07 on CLASS. It remains an open question whether the magnitude of these effects would substantially contaminate either teacher evaluation, evaluations of curricular reform, etc., but does raise a note of caution particularly when comparing teachers with very different student compositions in high stakes settings. We also note that the correlations with classroom composition could reflect teacher adaptation, and thus a potentially productive aspect of teaching, though we cannot distinguish a measurement problem from adaptation in these protocols.  Notes: *** denotes significance at the 1% level, ** at the 5% level and * at the 10% level. Standard errors are clustered at the teacher level. Regressions also control for indicators for year, grade and whether the lesson was an ELA lesson in the case of CLASS and FFT. a The dependent variable for these regression is an average of the scores for each protocols components. b These variables starting with black students and ending with FRPL students represent proportions at the classroom level. c Within the MQI instrument HOL refers to a rating of the whole lesson while Non-HOL is at the segment level. (Each lesson was divided into 7.5 minutes segments). The MQI components Classroom Work Connected to Mathematics and Explicitness and Thoroughness are excluded since they greatly reduce the sample.

A Focus on a Continuum of Effective Practice
Are existing global observational protocols only effective at analyzing instruction along a continuum of effective practice? While a focus on effective practice aids use in evaluation, it may hinder basic research on instruction and learning which involves tradeoffs in terms of how time is spent and in emphasis. In contrast, many fine-grained approaches to classroom observation, as well as measures of assignment quality, measure instruction more agnostically. For example, in Nystrand and Gamoran's program of research on ELA instruction, classroom time-use is exhaustively coded, but no a-priori judgement is made about the most appropriate ratio of say, small group work to whole-class instruction (Nystrand & Gamoran, 1997). Likewise, assignment quality protocols like the Intellectual Demand Assignment Protocol, do not privilege particular teaching practices, or do so less inherently (Joyce et al., 2018;Wenzel et al., 2002). Thus, in such systems, it is imminently possible to detect trade-offs in instructional practice (as the case of time-use exemplifies).
To investigate the potential difficulty in detecting instructional trade-offs, we examined the covariance matrix of the sub-domains of the MET protocols. At an aggregate level across the Year 1 observations as a whole, all-every single one-of the pair-wise correlations between sub-domains within the protocols are positive, suggesting that teachers who are effective in one domain are generally effective in other domains. 10 However, at the level of a given observation, the protocols may in fact detect trade-offs, with a teacher scoring low on one domain but high on other(s). Table 5 shows that across all sub-domains, instances (proportions) of pair-wise negative correlations within class sections (a specific teacher and group of students) ranged from .08 (FFT) to .17 (CLASS). Interestingly, for MQI, this is somewhat more likely to occur in segment level scoring a For this analysis a correlation is considered negative if it is < -0.2 b Notes: The average proportion across components pairs a single component with every other component from that observation protocol and takes the average of the proportion of negative correlations across pairs. c Raters performed scoring using the MQI protocol on segments of a lesson (Non-HOL) but also gave a score to the whole lesson (HOL). than in holistic scoring. 11 Although there are several instances in which particular sub-domains are highly unlikely to negatively co-vary with other domains of instructional practice (e.g. those listed in the low column), there are also a number of other domains that routinely, if not typically, negatively co-vary with other domains (e.g. CLASS's Behavior Management domain or PLATO's Intellectual Challenge). Thus, we conclude from this analysis that while detecting trade-offs in instructional practice is clearly not a strength of these protocols (in particular FFT), it does in fact occur in practice.

Discussion
While we have focused on posing and answering questions about the limitations of global protocols for teacher observation it is important to remember the advantages of such tools, which are used in many districts to measure teacher effectiveness. An underlying goal of global observational protocols is to organize and communicate best practices for teachers and those who support them. As such, some of the characteristics we have highlighted, like their comprehensive nature and focus on a continuum of effective practice, are logical design features and give the protocols a wide array of instructional improvement uses. These features may also make the protocols especially useful for some research purposes, helping to move educational research beyond studies of achievement alone to create a richer understanding of opportunity to learn. For example, we have used the MET protocols to document the distribution of opportunity to learn across schools. How disparate is instruction in different schools and are school-to-school differences in instruction associated with students' family background and other compositional features of schools ? Comprehensive measures appropriately rooted in the best practices literature are well suited to answering those questions.
The utility of global observational protocols must also be understood in the context of the use of standardized test score data to identify teacher effectiveness. Because simple observable characteristics like degree attainment, certification and even experience do not adequately capture teacher effectiveness, the literature has focused on a fairly low-cost alternative (at least in a regime with annual student testing) of value-added based measures of teacher effectiveness (Gamoran, 2012;Kane et al., 2012). Value-added measures have several well-known limitations (Koedel et al., 2015;Jackson et al., 2014;Stacy et al., 2018), including concerns about imprecision and that they do not provide useful information to help teachers improve their teaching. Combining insights of value-added measures with global observation protocols has the potential to help address both of these concerns and is a practice that is increasingly being adopted in districts. In fact, among the important insights of the MET study was that combining multiple measures of teaching practice, including global observation protocol scores, provides more stable estimates of teacher effectiveness than value-added measures alone (Cantrell and Kane, 2013;Mihaly et al., 2013). The MET study also revealed important insights for increasing reliability of ratings, such as having more than 1 observer per teacher, observing more than 1 lesson and supplementing full lesson observations with short observations (Cantrell & Kane, 2013).
Yet in this study, we sought to document some of the limitations that have emerged in our analysis of the global protocols as used in MET, particularly in their potential to help teachers improve practice and for researchers to answer important questions about effective teaching practice. We find that global protocols provide only imprecise discriminations in lesson quality; some of the distinctions are so rough they may only be useful for guiding instructional reform for a small minority of teachers. In other cases, including examples in MQI and CLASS, greater discriminatory power emerges.
Each of the protocols used in MET are comprised of multiple sub-domains, such that it is possible in theory to identify teachers strong in one area, but not in others, and to provide feedback on specific areas that need improvement. However, analysis of the covariance structure of the subdomain scores finds that they are not as orthogonal in practice as they appear to be in their construction. It is not clear whether this phenomenon reflects true underlying similarity in teacher competence across domains, or some issue in measurement. If the latter, then this feature limits the protocols' use in formative feedback.
We also summarize a finding evident in analyses of the MET rating process published in the Observation Measures Report; the reliability of the protocol is highly dependent on training and monitoring, and highly variable under the conditions of the MET study. At the low end, levels of agreement are only slightly better than chance. A further concern was that global observation protocols might have difficulty separating the teachers' own contribution to instruction from what students bring to class at the start of the year. This concern seems evident at times in the construction of the protocols themselves, and there is indirect evidence of this possibility in Campbell and Ronfeldt (2018), Steinberg and Garrett (2016), and our own replication of Campbell and Ronfeldt's analysis of the relationship between classroom composition and FFT scores. However, when we examined the larger set of protocols, we found that in fact classroom composition is not always related to protocol scores, and certainly not in a predictable fashion. Thus, in our view, it is possible to design global protocols that adequately capture teachers' own contribution to instruction.
Finally, we raised the concern that many global observation protocols focus only on a continuum of effective practice and cannot readily detect tradeoffs in instructional emphasis that occur as teachers adapt to students. This seems highly evident in the construction of the protocols. However, our analyses searching for examples of instructional trade-offs occurring in the data found rates of negatively correlated sub-domain scores that suggests the protocols can at times document tradeoffs, though it is not a strength of these protocols.
These limitations may combine to make many inferences about important instructional processes difficult. For instance, in our work with the MET data (Auce jo et al., 2019), we studied how the benefits of a given practice vary with the composition of the classroom and how teachers adapt to classroom compositions by adjusting their practice. One of our original intentions was to see if these adaptations in themselves might be a measure of teacher effectiveness (Corno, 2008;Nurmi et al., 2013). Our hypotheses implicitly assume that teachers face tradeoffs in how they spend their time in the classroom, and in choice of curriculum and pedagogy. We anticipated, in theory based on descriptions of subdomains in the different protocols, that underlying aspects of teaching practice, such as student-centered approaches, would be common across protocols and thus separable from other aspects of practice. In reality, we found that subdomains within protocols were not as separable as might be hoped for in examining instructional tradeoffs. We also anticipated being able to exploit multiple measures of teaching practice across multiple years to study teacher adaptivity, but found little systematic adaptation, perhaps because many of the measures confound teacher and classroom moves such that it was not really possible to identify teacher adaptations. Ultimately, while we made some useful progress in understanding how instructional effectiveness is moderated by classroom composition and elucidated important associated implications for accountability systems, we gleaned from our experience that it is simply not possible to adequately test many important hypotheses about instruction, such as the tradeoffs teachers face in adapting instruction to diverse student needs, with many global protocols.

Policy Implications of Unreliability and other Global Protocol Limitations
In Rothstein and Mathis's (2013) response to final reports prepared by the Measures of Effective Teaching Study researchers (as opposed to researchers later conducting secondary data analysis), they argue that findings from MET on the reliability of observational methods and on the relationship between observational scores and value-added, "say little about how best to conduct teacher evaluations in the real world." In this analysis, we have taken a more expansive view of observational protocols, which can be used for evaluation, but also professional development and research purposes.
Depending on the use case, some of the concerns we have outlined here present more serious policy implications than others. Problems of sub-domain independence (limitation #2) and focusing only on a continuum of effective practice (#5) are grave concerns for research. In contrast, imprecise discrimination in lesson quality (#1) is a major problem for use in professional development activities. Limitations 1, 3, & 4 are all potentially problematic for use in evaluation, although limitations 1 and 3 seem most serious in these data. However, at this time we believe that educational professionals setting and implementing policy should not make decisions about the role of observations in teacher evaluation on the basis of these limitations alone, because the system-level impacts of teacher evaluations at the school level and beyond are such critical factors.
Although system-level evidence comparing, for example, districts placing high-emphasis on teacher observation vs. low-emphasis is not available, research has evaluated principals' use of observation data more broadly. As part of a study of principals' data use for human capital decision making in six districts, Goldring et al. (2015) report principals find "numerous productive uses [of observational tools] for decision making in their schools." Cannata et al. (2017), analyzing the six-district data along with Charter Management Organization administrators, report that 70% of principals utilized teacher observation data in making hiring decisions, while 60% reported teacher observations were "very important" in making hiring decisions. Yet, some principals question the reliability of observational data (Cannata et al., 2017, p. 210), and overall, studies find principals exercise much discretion in carrying out teacher observations as part of teacher evaluations, both in the number, duration, and formality of the evaluations and the extent to which the observations generate critical performance feedback (Cohen et al., 2019;Donaldson & Woulfin, 2018).
Even if teachers are primarily allocated to middle evaluation categories, and unreliably so, the observational frameworks themselves may still focus teacher attention and reflection on appropriate domains of instruction or enhance professionalization by providing a shared pedagogical language. For example, the principals in Cohen et al.'s study reported "… using the observation rubrics as ongoing frameworks for high-quality practice and useful tools for promoting more formative conversations about instructional improvement" (2019, p. 20). That is to say, unreliability does not appear to completely preclude positive impacts. Nor do certain forms of bias necessarily reduce the effectiveness of an evaluation system. Harris et al. (2013) show that school principals' valuing of teachers' sociability and organizational contributions to the school do affect principal ratings, but the ultimate impact of this "bias" on school effectiveness is not easy to predict since it involves factors school leaders understand to be important to school functioning. Kelly (2012) argued that if teacher evaluation systems were to be implemented in such a way that large numbers of teachers were erroneously labeled as failures (as was the case with school accountability labels in the 2000s), this would be a policy disaster that would erode teacher motivation and commitment. However, as shown in Table 1, across a wide array of protocols, most teachers end up as rated in the middle, and this appears to be the case in evaluative use as well. Kraft and Gilmour (2017) report that in many states less than 1% of teachers are rated as unsatisfactory, although that finding references overall ratings, not the observation component alone. Taking into consideration these low rates of negative evaluation, and existing evidence on use value, we cannot say that the major limitations outlined here preclude effective use in teacher evaluation.
Would incremental improvement in global protocols help address the limitations outlined here, further improving their use value? On this question we are more pessimistic. In the MET data we do find variation across protocols, suggesting some are more prone to specific limitations than others. Yet, all of the protocols suffer from each of these limitations to a greater or lesser extent, and we hypothesize that the overall quality of measurement may stem inherently from the ambitious, comprehensive goal of these protocols-to rate the entirety of a teachers' instructional effort. Thus, we conclude with consideration of an alternative to global protocols. We argue that the limitations of global observation protocols should no longer be accepted so readily, because fine-grained measures are emerging as an alternative. These tools reveal the limitations of global protocols in especially sharp relief. Moreover, unless a compelling alternative exists, many educational professionals and researchers may not be much persuaded to address the measurement limitations described here, incrementally or otherwise.

Fine-Grained Measures as an Alternative to Global Protocols
In studies of alcohol use, bold social scientists have pioneered the use of breathalyzers to collect fine-grained, occasion-specific measures of alcohol consumption as an alternative to traditional self-reported survey measures (Beirness et al., 2004;Smith et al., 2001;Wells et al., 1997). 12 Global observation protocols now give researchers and practitioners an occasion-specific view of classroom instruction, but they are not yet fine-grained. Fine-grained observational systems record and carefully analyze individual utterances, questions, turns at talk, etc., offering the potential for exceptionally reliable and precise teacher assessment and feedback. Fine-grained measures also offer the possibility of a more fundamental shift in the quality of information gleaned from classroom observation featuring: greater independence in assessing individual components of instruction; greater ability to identify teachers' own contribution to instruction; and an agnostic coding of instruction better suited to understanding teacher adaptation and change.
Historically, fine-grained measures of instruction, observational or otherwise, have been critical in documenting important basic features of American schooling, such as the low prevalence of genuine discussions in American classrooms (Nystrand & Gamoran, 1997), the wide variability in content and test standards in the US (Porter et al., 2011), or more recently, the content of texts read by diverse students (Northrop et al., 2019).
Yet, by their very nature, fine-grained measures tend to be difficult and expensive to collect/implement. Thus, in the past, due to their labor-intensive nature, such systems have been primarily used in research settings (e.g. Gamoran & Kelly, 2003;Howe et al., 2019;Murphy et al., 2009;Taylor et al., 2003) and in pre-service teacher preparation (e.g. Caughlan et al., 2013;Kucan, 2009).
Fine-grained systems also tend to narrow the focus of inquiry and inference. For example, while Nystrand and Gamoran's system of observation provided time summary statistics on more than a dozen basic classroom instructional formats (e.g. lecture, various forms of small group work, etc.), analyses of classroom discourse focused on only a few basic features such as authentic questions, uptake, and cognitive level. Indeed as Kachur and Prendergast's (1997) analysis in Opening Dialogue indicates, a class that seems non-dialogic based on Nystrand's primary indicators may in fact have an overall learning environment that takes student ideas seriously and generates high levels of engagement. In contrast, the global observational protocols in use today are exceptionally comprehensive. Some protocols even include elements of curricular planning that go beyond lesson observation (e.g. FFT). Moreover, beyond the protocol rubrics, documentation, and training materials, they elicit a qualitative, nuanced appraisal that additionally draws on the expert rater's internal frame of reference, memory, and training .
Fine-grained measures of instruction are thus promising but not fully tested. Automated methods under development by teams of educational researchers and computer scientists may soon overcome much of the inherent difficulty and expense associated with human observation and coding, which is a crucial step in making fine-grained measurement more widely available to researchers and practitioners , Wang et al., 2014. For example, Kelly et al. (2018) demonstrated that it is possible to automatically detect and estimate the proportion of authentic questions in a class session with a reliability sufficient to complement or even replace human coding in research efforts. That result was obtained under technical requirements and constraints that preface wide spread use; only a teacher mic was used, without cameras or individually mic'ing students (see full discussion in D' Mello et al., 2015). More recently, these results have been replicated with teachers collecting data autonomously, without the need for research staff present (Stone et al., 2019). At the same time, experiments with a range of recording systems showed that the fidelity of the audio recording itself is important and must be evaluated in any automated measurement system; low or even medium quality audio may not yield sufficiently reliable estimates of classroom discourse properties.
Overall, the potential for widespread use of automated, fine-grained measures of instruction may mean that instead of incremental improvement of existing global protocols, researchers should pursue entirely new approaches. Yet, as promising as the automated systems sound, it is reasonable to wonder what might be sacrificed by a focus on more specific, discrete aspects of instruction? Even if fine-grained measures are inherently more precise and reliable, might they miss the forest for the trees, presenting a quantitatively accurate but qualitatively misleading portrait? Just as we have interrogated global protocols in this analysis, researchers must provide balanced evaluations of finegrained measures of instruction that takes such possible limitations seriously and validate them on the many dimensions that affect robust use.

SPECIAL ISSUE Policies and Practices of Promise in Teacher Evaluation
education policy analysis archives Volume 28 Number 62 April 13, 2020ISSN 1068-2341 Readers are free to copy, display, distribute, and adapt this article, as long as the work is attributed to the author(s) and Education Policy Analysis Archives, the changes are identified, and the same license applies to the derivative work. More details of this Creative Commons license are available at https://creativecommons. Please send errata notes to Audrey Amrein-Beardsley at audrey.beardsley@asu.edu Join EPAA's Facebook community at https://www.facebook.com/EPAAAAPE and Twitter feed @epaa_aape.