Putting Teacher Evaluation Systems on the Map: An Overview of State’s Teacher Evaluation Systems Post–Every Student Succeeds Act

The Every Students Succeeds Act (ESSA) loosened the federal policy grip over states’ teacher accountability systems. We present information, collected via surveys sent to state department of education personnel, about all states’ teacher evaluation systems post–ESSA, while also highlighting differences before and after ESSA. We found that states have decreased their use of growth or value-added models (VAMs) within their teacher evaluation systems. In addition, many states are offering more alternatives for measuring the relationships between student achievement and teacher effectiveness Putting teacher evaluation systems on the map 2 besides using test score growth. State teacher evaluation plans also contain more language supporting formative teacher feedback. States are also allowing districts to develop and implement more unique teacher evaluation systems, while acknowledging challenges with states’ being able to support varied systems, as well as incomparable data across schools and districts in effect.


The Policy Topography
Six years before the publication of this article, Collins and Amrein-Beardsley (2014) researched and presented an overview of states' teacher evaluation systems throughout the US after the passage of Race to the Top, a program used to incentivize states into reforming their teacher evaluation systems, primarily via states' consequential uses of data that linked teacher performance to their students' test scores (2011, with data collected in 2012). This descriptive study is an update in the wake of the federal government passing the Every Student Succeeds Act (ESSA, 2016) which eliminated much of the federal role in enforcing test-based accountability across states' teacher evaluation systems. As stated in Ross and Walsh's recent NCTQ report (2019): The ESSA indicated that states would have more freedom to alter their teacher evaluation policies while (re)embracing more local control (Klein, 2016). However, the rhetoric surrounding ESSA may now be at odds with the current course of teacher evaluation development in which states have already invested significant financial and human resources developing teacher evaluation systems based on previous federal incentives (Jones, Khalil, & Dixon, 2017). In others words, despite the intention or ESSA, some states may be staying the prior course despite the passage of ESSA for a multitude of reasons which may be as varied as the states themselves (Slotnik, Bugler, & Liang, 2016). The specifics of ESSA gave states more freedom to interpret federally mandated concepts, such as including quantitative or test-based "data on student growth…as a significant factor" of their teacher evaluation systems (e.g., using growth or value-added models, henceforth referred to as "VAMs"; USDOE, 2012). ESSA also allowed states and districts to develop homegrown teacher evaluation systems that used alternative methods and measures to evaluate and attribute student growth to teachers and their effects. However, it is unclear whether states are, in practice, reducing the use of VAMs in teacher evaluation systems or continuing to use VAM output, combined with other measures, in consequential ways. It is also unclear whether states are using the new-found flexibility provided by ESSA to ameliorate what many have argued are the harmful side effects of VAM use (see, for example, Education Week, 2015) and the harmful effects of educational accountability that also characterized NCLB.
A study by the National Council on Teacher Quality (NCTQ), for example, indicated that states are not making huge changes post-ESSA (Walsh, Joseph, Lakis, & Lubell, 2017). Researchers in that study used handbooks, guidelines, state websites, and references to legislation to assess such changes. For this study, we collected survey and interview data to investigate how and to what extent states have changed the purposes of, as well as their actual teacher evaluations systems pre-and post-ESSA; the degree to which states are, in practice, reducing the use of VAMs in teacher evaluation systems; and the degree to which states are actually using VAMs in consequential ways.

Re-Surveying the Terrain
The purpose of this article, accordingly, is to provide an updated overview of all states' teacher evaluation systems following the passage of ESSA (2016), and to also include insights into how state department of education personnel view the strengths and weaknesses of their new and re-reformed teacher evaluation systems. Our two-fold objectives for this study draw strength from providing both an outside view (i.e., a summary of state plans post-ESSA) and an inside view (i.e., an aggregated analysis of common perceptions from the personnel who created and oversee states' evaluation systems).
We collected the same general data as in Collins and Amrein-Beardsley's (2014) prior study, but we asked refined questions to better match the current context. For example, in the earlier study (Collins & Amrein-Beardsley, 2014), VAMs and the high-stakes consequences tied to teacher evaluation systems that relied heavily on VAM output dominated the discourse around states' teacher evaluation systems. However, because ESSA allowed states more leniency over their states' teacher evaluation systems, researchers sought more holistic information in this study about states' teacher evaluation measures, including but not limited to only VAMs.
We present findings in visual (e.g., a series of maps) and raw versions (e.g., a table displaying data on each state's current teacher evaluation measures) so readers can directly access states' information. Comparable data before and after ESSA is also presented as a series illustrating changes over time including a table detailing how certain features of each state's teacher evaluation systems have changed post-ESSA. Prior to presenting findings, though, it is important to review the relevant literature used to both situate and frame this study.

Relevant Literature
With the passage of the NCLB (2002), the early 2000s throughout the US marked a new era in educational accountability policies, with federal policies increasingly promoting accountabilitybased systems that held students, teachers, and schools responsible for improved student achievement results. Some research indicated that teachers affected student performance and that teacher performance differed within schools (Rivkin, Hanushek, & Kain, 2005;Rockoff, 2004). Despite this, most teacher evaluation systems, as based primarily on principal observation, indicated that almost all teachers received satisfactory results (Weisberg, Sexton, Mulhern, & Keeling, 2009). Hence, the theory of change was that by holding schools, teachers, and students accountable for meeting higher standards, as measured by student performance on standardized assessments, administrators would supervise public schools better, teachers would teach better, and students would take their learning more seriously. As a result, students would learn and achieve, or rather progress more, particularly in the lowest performing schools.
However, many researchers now agree that NCLB did not meet its intended effects (100% student mastery of higher standards by 2014). More specifically, research suggests that since the passage of NCLB, many students, especially those in the country's lowest performing schools, have been increasingly susceptible to unprofessional test-based practices including teaching to the tests (not to be confused with teaching to the standards); teaching using scripted and prefabricated curricula to ensure that what is taught aligns with what is tested; teaching test preparation, test practice, and test rehearsals instead of curricular content; teaching while hyper-emphasizing the rote memorization of facts and basic skills likely to be on tests; narrowing the curriculum to match the content and concept areas tested; and, related, teaching the tested subject areas that "count" (i.e., mathematics and reading/language arts) while marginalizing or even eliminating other curricular areas and activities that do not "count" on high-stakes tests (i.e., social studies, sciences, art, music, physical education, library sciences, and recess; see, for example, Amrein & Berliner, 2002;Haney, 2002;Nichols & Berliner, 2007). Also, typically low-scoring students, including inordinate numbers of non-English proficient and special education students have been purged (i.e., expelled, suspended, or simply excused) from school during test administrations to keep them from participating and pulling test scores down. Students have also been counseled out of school, convinced to explore other options (e.g., alternative, "last chance," or adult education schools), or persuaded to strive for General Education Diplomas (GEDs) instead of traditional high school certificates. Eliminating undesirable students eliminates their scores; the scores that if included or preserved would pull composite test scores down (see, for example, Amrein-Beardsley & Berliner, 2002;Haney, 2000;Nichols & Berliner, 2007).
Students whom educators have deemed the least likely to post high enough test scores, the same students as mentioned above, have also been academically shunned. This has occurred particularly during the weeks leading up to high-stakes tests as students are often perceived by educators being held accountable as the most hopeless, and hence, the most undesirable when test scores punitively matter. Undesirable students have also been known to be retained in grade or credit hours to keep them from being eligible for high stakes testing cycles (e.g., by thwarting progression in high school whereas sophomores/juniors might not be eligible to test in their sophomore/junior year; see for example Haney, 2000). In some cases, undesirable students have altogether disappeared from school rosters when administrators have created rosters and registered students for high stakes testing purposes (see also Amrein & Berliner, 2002;Nichols & Berliner, 2007. Otherwise, underperforming students have been wrongly moved into exempt categories (e.g., special education and English Language Learner [ELL] categories), as misclassifying these students will prevent them from dragging down the performance of the teachers or the schools as a whole (Amrein & Berliner, 2002;Haney, 2000). Recognizing this as an issue, the federal government started mandating minimum rates of test participation (NCLB, 2002), but it seems such practices are still occurring.
Conversely, educators have focused inordinately on the students who are on the edge of passing high-stakes tests. The belief here is that if educators teach to the test well enough, these students just might clear the cut scores and pass, thus helping to bump composite test scores, even if ever so slightly upwards. Educators have used "selective seating" practices in which the students expected to post high scores are seated among the students expected to post low scores, covertly encouraging cheating. Educators have also overtly cheated, for example, by erasing and changing students' incorrect answers to correct, explicitly giving students correct answers, persuading students to revisit incorrect answers, and the like. Such cheating instances have been widely publicized, for example, in Atlanta and Washington D.C. (Perry & Vogell, 2009;Rhee, 2011) as well as in the Arizona (Amrein-Beardsley, Berliner, & Rideau, 2010;see also Toppo, Amos, Gillum, & Upton, 2011).
Likewise, some argue that these unintended effects (as well as others; see also Darling-Hammond, 2007;Figlio & Getzler, 2006) may have outweighed some of the positive effects noted, including but not limited to an increased focus on measuring and monitoring the gaps between marginalized and non-marginalized student populations (see, for example, Grodsky, Warren, & Kalogrides, 2009;Koretz, 2017;Nichols & Berliner, 2007). The results, of course, are controversial with others arguing that the NCLB era positive effects outweighed the negative effects (Dee & Wyckoff, 2015;Winters, Trivitt, & Greene, 2010; see also Hanushek & Raymond, 2005;Stotsky, Bradley, & Warren, 2005).
Regardless, after collectively acknowledging some of the issues with NCLB, the federal government used federal funds again to entice states and districts to move in new directions. Consequently, the federal government (e.g., via Race to the Top, 2011 and the aforementioned NCLB waivers [USDOE, 2014] 1 ) incentivized states to adopt new and improved tests (e.g., those developed by the Partnership for Assessment of Readiness for College and Careers [PARCC] or Smarter Balanced Assessment Consortium [SBAC]), to adopt and implement new and improved educational policies, and to use both (i.e., improved tests and improved test-based accountability policies) to hold teachers accountable for their students' growth in learning and achievement over time. The federal government began advocating the use of test results not only to measure students' growth in learning over time, but also to measure teachers' causal impacts on students' growth in learning over time.
Soon after Race to the Top (2011) was underway, 40 states and the District of Columbia were using, piloting, or developing some type of VAM, again, as federally incentivized (Collins & Amrein-Beardsley, 2014). The tests required under NCLB were used across states for measuring teacher-level value-added. The most common open-source VAM was the student growth percentiles (SGP) model (Betebenner, 2009(Betebenner, , 2011, with multiple states adopting or endorsing it for teachers statewide (i.e., Arizona, Colorado, Georgia, Massachusetts, and Washington). The SGP model compares the growth from one year to the next with similar peers. 2 The most common proprietary model was the Education Value-Added Assessment System (EVAAS; Sanders & Horn, 1994;Sanders, Wright, Rivers, & Leandro, 2009;SAS Institute Inc., n.d.), with five states adopting it statewide (i.e., North Carolina, Ohio, Pennsylvania, South Carolina, and Tennessee). Unlike the SGP model, the EVAAS model is a proprietary statistical model with an unknown algorithm for measuring the impact of teachers on student learning. The most common high-stakes consequences being attached to systems that included VAM results included but were not limited to teacher tenure, termination, and teacher compensation or merit pay (Collins & Amrein-Beardsley, 2014).
While all teacher evaluation systems adopted and implemented at this time included at least one other indicator or measure of teacher effectiveness (i.e., systemic classroom observations of teachers), the primary focus across states was on the objective, assessment-based (and often VAMbased) components to "meaningfully differentiate [teacher] performance…including as a significant factor, data on student growth [in achievement over time] for all students" (USDOE, 2012). Some research supported the use of such teacher evaluation systems (Chetty, Friedman, & Rockoff, 2014a;Kane & Staiger, 2012). This strategy was written into federal policy and subsequently implemented across the nation, although some states (e.g., Florida, Louisiana, Nevada, New Mexico, New York, Tennessee, and Texas) valued or systemically weighted student growth (i.e., teachers' value-added) much more heavily in their systems than others (e.g., California, Connecticut, Vermont, Washington, and Wisconsin).
1 It is important to also note here that the federal government also granted states waivers from not meeting No Child Left Behind (NCLB, 2002) goals for their students to reach 100% academic proficiency by 2014 if states also created and adopted stricter teacher evaluation systems as based, at least in part, on VAMs (US Department of Education, 2014). Most states applied for these waivers, also making shifting most state's teacher evaluation systems to their highest-accountability versions. 2 The main differences between growth models and value-added models (VAMs) are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the student growth percentiles (SGP) model is more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. Thereafter, determinations are made in terms of whether students increase, maintain, or decrease in growth percentile rankings as compared to their academically similar peers. Accordingly, researchers refer to both models as generalized VAMs throughout the rest of this manuscript unless distinctions between growth models and VAMs are needed or required.
Around this time, research on VAMs, especially in conjunction with teacher evaluation systems, increased heavily. VAMs, in the simplest of terms, classify teachers' effectiveness according to teachers' statistically measurable (and purportedly) causal impacts on their students' standardized test scores over time. While there is debate about the extent to which VAMs can be used to separate out a teacher's impact from other classroom-level factors (see, for example, Rothstein, 2009Rothstein, , 2010, the intent of VAMs is to help to identify teachers whose students outperform their projected levels of growth as effective or of "value-added" and teachers whose students fall short as the inverse (Sanders, 2003). Views on such assessment-based systems are controversial when attaching highstakes consequences to such measures of teacher effectiveness (American Statistical Association These controversial views led to court challenges to states' VAM-based teacher evaluation systems (i.e., in Florida, Louisiana, Nevada, New Mexico, New York, Tennessee, and Texas; see Education Week, 2015). 3 Plaintiffs argued the following main points of criticism regarding VAM models within teacher evaluations systems including that VAMs can be: (1) unreliable, whereby current research suggests that teachers classified as "effective" one year will have a 25%-59% chance of being classified as "ineffective" the next year, or vice versa, with other permutations possible (Chiang, McCullough, Lipscomb, & Gill, 2016;Martinez, Schweig, & Goldschmidt, 2016;Schochet & Chiang, 2013;Shaw & Bovaird, 2011;Yeh, 2013); (2) invalid, whereby very limited research evidence supports the claim that VAMs can be used to draw accurate inferences about the extents to which different teachers cause changes (i.e., add value) in a collective groups of students' test performance over time (see, for example, Amrein-Beardsley, 2008;Braun, 2005Braun, , 2015Hill, Kapitula, & Umland, 2011); (3) biased, whereby current research suggests that, almost regardless of the sophistication of the statistical controls used to block bias, VAM-based estimates sometimes present biased results, especially when relatively homogeneous sets of students (i.e., ELLs, gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated in schools and teachers' classrooms (Baker et al. 2010;Capitol Hill Briefing, 2011;Collins, 2014;Green, Baker, & Oluwole, 2012;Kappler Hewitt, 2015;Koedel, Mihaly, & Rockoff, 2015;McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004;Newton, Darling-Hammond, Haertel, & Thomas, 2010;Rothstein & Mathis, 2013); (4) not transparent, with the main issue being that VAMbased estimates do not often make sense to those at the receiving ends of the estimates (e.g., teachers and principals) and, subsequently, these same groups are reportedly quite-to-very unlikely to use VAM-based output for formative purposes (see, for example, Eckert & Dabrowski, 2010;Gabriel & Lester, 2013;Goldring et al., 2015;Graue, Delaney, & Karch, 2013); and (5) unfair, with the fundamental issue being that states and districts can only produce VAM-based estimates for approximately 30-40% of all teachers, leaving the other 60-70% (which sometimes includes entire campuses of teachers) ineligible under comparable evaluation and accountability systems (Baker, Oluwole, & Green, 2013;Gabriel & Lester, 2013;Harris, 2011).
In light of the recent critical research and court cases regarding VAMs, observational systems used for similar teacher evaluation purposes, which were also deeply criticized and subsequently spurred some of the federal government's reforms (Weisberg et al., 2009; see also Kraft & Gilmour, 2017), are now even more common across states' new and re-reformed (i.e., post-ESSA, 2016) teacher evaluation systems (Ross & Walsh, 2019;Steinberg & Donaldson, 2019). They are still, however, also confronting their own sets of empirical issues. Such issues include but are not limited to whether the observational systems are psychometrically sound for such purposes, how output from observational systems might be biased by the supervisors observing teachers in practice, and how output might also be biased by contextual factors like the types of students with whom a teacher works, how a teacher's gender interplays with his/her students' gender, and other factors (Bailey, Bocala, Shakman, & Zweig, 2016;Geiger & Amrein-Beardsley, 2017;Steinberg & Garrett, 2016;Whitehurst, Chingos, & Lindquist, 2014). The same sorts of potential biases seem to hold true with student surveys, regardless of whether also used to evaluate teachers in Pre-K or evaluate instructors in higher education, given selection biases. 4 Nonetheless, the new freedom that ESSA (2016) has afforded states means they could be (and anecdotally are) moving away from such high-stakes and assessment-based accountability models, especially from those based primarily on VAMs. Ideal components of a teacher evaluation system would include standards-based teacher observations across the year, systems that provide timely formative feedback, multiple sources of evidence of student learning, and greater collaboration between teachers or between teacher and administrators (Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012). Essentially, ideal components of a teacher evaluation system would reflect the latest standards of educational and psychological testing, meaning the results would reliable, valid, fair, unbiased, and transparent (AERA, NCME, APA, 2014). However, policymakers need also be wary of the unintended consequences caused by imposing new measures. The potential for unintended consequences is one reason that Darling-Hammond et al. (2012) recommend teacher evaluation systems that encourage greater collaboration between teachers or between teacher and administrators.
Accordingly, this study aims to uncover whether states are actually taking advantage of the purported flexibility within ESSA (2016) policy and to what extent, for example, by uncovering whether states are moving in new directions, away from such common-because-they were-federallyincentivized models, and away from using VAMs as their primary teacher evaluation and accountability measures.

Methodology
We conducted a survey research study using an electronic survey along with phone interviews to contact non-respondents, to follow-up for clarification, and for validation purposes. We engaged these methods to gather central and supplementary information about all states' restructured teacher evaluation systems post-ESSA. We collected all survey-and phone-based information from state department of education personnel directly. Some state department personnel referred us to pertinent state-level documents (e.g., state policies and other legislative pieces, as well as state ESSA plans) online. Additionally, for states that did not respond to survey invitations or phone calls we evaluated ESSA plans and referred to state education department websites.
The four research questions that we examined for this study were: (1) What measures are being used by each state to evaluate teachers? (2) How have states' teacher evaluation systems changed following the adoption of ESSA? (3) What do state personnel see as the strengths and weaknesses of their post-ESSA teacher evaluation systems? (4) How have state personnel's perceptions of the strengths and weaknesses changed post-ESSA?

Participants
Study participants included state education personnel from every state and the District of Columbia, hereafter generally referred to in plural as states (n = 51), representing those most knowledgeable of each state's teacher evaluation system post-ESSA. To locate the most knowledgeable personnel to participate in this study, we first searched for state personnel online looking at job titles relating to teacher evaluation, teacher quality, or teacher accountability. We then emailed or called to verify they were the best source of information for our study or if we should contact a different source. In some cases, where we did not find appropriate job titles, we simply called the state department of education and asked with whom we should contact. Contacts were provided a description of our study along with a description of the survey before choosing to do the study to ensure that we ultimately communicated with those who were the most knowledgeable.
Participants ultimately included leaders and directors of states' teacher quality departments, leadership divisions, evaluation offices, and accountability and assessment divisions. Of the 51 departments contacted, personnel representing 34 (67%) states responded to the online survey and personnel representing four (8%) states answered survey questions via phone interviews. Additionally, representatives from four (8%) state departments did not answer the questions specifically, so we referred to online resources instead. In sum, personnel from 42 (82%) state departments of education responded via survey, and for the other nine departments (18%), we captured states' missing information by reading publicly available state websites and states' ESSA plans. Accordingly, we indicate their sources of information by state (e.g., whether information was collected through personal contact or through state websites) within the findings presented.

Survey Instrument
We developed the survey instrument used to collect state data over the course of three months in order to increase the validity, accuracy, and relevancy of the instrument, but also to increase the likelihood of states' participation. To develop the survey instrument, we developed overarching questions based on Collins and Amrein-Beardsley's (2014) study prior to ESSA. Thereafter, we developed additional questions given the aforementioned, and expanded goals and objectives for this study.
Following guidelines for effectively conducting survey research studies (Kelley, Clark, Brown, & Sitzia, 2003), we first conducted content analysis with state department of education personnel within our own state and pilot tested the instrument with three other state personnel and teacher evaluation experts to ensure that the content and format of the survey were clear, comprehensive, and relevant given states' realities and expectations post-ESSA. The pilot tests included observing and asking the participants whether each question made sense, whether their responses were indeed the information we were intending to gather via the survey, and overall feedback on wording, length of survey, and other practical questions. For the states that participated via phone interview or for which we analyzed documents (e.g., states' post-ESSA teacher evaluation plans), we manually input data into the same survey instrument to allow for one primary database which kept all data collected constant, consistent, and comparable. Click here for the full survey instrument that we validated and used for these purposes.

Procedures
We distributed the survey instrument to all state personnel online via Qualtrics Survey Software (2019). As explained prior, data collection also consisted of making phone calls to state personnel in order to encourage the participation of non-respondents, and to also ask clarifying questions, to ensure responses were accurately represented, to verify that nothing had changed from previous communications, and to ensure that states' data were accurate and representative of the current and most up-to-date teacher evaluation situations by state. Again, these data collected via phone interviews were inserted into the same survey instrument as if the person on the phone were completing the survey themselves.

Data Analyses
For the survey items that yielded quantitative information, we calculated frequencies and descriptive statistics. For the survey instrument items that yielded qualitative responses (e.g., items that solicited personnel opinions on the strengths and weaknesses of their teacher evaluation systems), we aggregated these data to protect the anonymity of the state responses. Once aggregated, we followed the methods and procedures outlined in Miles and Huberman (1994) using a sourcebook to "[track] out lawful and stable relationships among social phenomena based on the regularities and sequences that link these phenomena" (p. 174) during the processes of data reduction, data display, and drawing conclusions. Lastly, we used Tableau Software (2019) for constructing map visualizations of the descriptive data for ease in interpretation.
It should be noted here, though, that because state plans often change, some state-level information may have changed between data collection and publication. On the flipside, the reported and perceived strengths and weakness of states' systems from participating personnel may indicate the direction of said changes. Regardless, both of these points should be noted so that consumers do not interpret the forthcoming results as fixed.

Results: The National Landscape
The results section maps onto aforementioned research questions, and within each section we present results in three ways: (1) as aggregate tables, (2) as series of maps, and (3) in prose. We chose to present the results in these ways because the purpose of this paper is to present as complete a picture of the state of states' teacher evaluation systems within the constraints of a journal article. We understand that tables containing information from all 50 states and the District of Columbia would be unwieldy, so we designed the presentation of data in such a way that readers might have direct or immediate access to what we deemed to be the most important results (e.g., via the maps and prose forthcoming). However, we also created easily accessible larger, searchable, and sortable tables including results that provide more in-depth and by-state data that we uploaded online within a set of accessible and anonymous spreadsheets.

Research Question 1: States' Teacher Evaluation Measures
In this section, we break down the results of the survey by each teacher evaluation measure now being used by states including (1) VAMs (defined prior), (2) Teacher-Level Observations (used to purposefully examine teachers' teaching practices in context through systematic processes of data collection, analysis, and reflection; Bailey, 2001), (3) Student Surveys (used to systematically obtain students' opinions about different aspects of their teachers' attitudes, instruction, and pedagogical practices; Geiger & Amrein-Beardsley, 2017), and (4) Student Learning Objectives (SLOs; used to measure teachers' students' growth using one or more traditional [e.g., state-wide standardized tests] or non-traditional assessments [e.g., district benchmarks, school-based assessments, teacher and classroom-based measures]; see Lacireno-Paquet, Morgan, & Mello, 2014; USDOE, n.d., p. 1 5 ). For each of these measures, we provide a map illustrating which states adopted which of these measures post-ESSA (2016). This section concludes by presenting an anonymized link to a full table indicating each states' teacher evaluation measures.
Value-added models (VAMs). As stated previously, the state of states' continued uses of VAMs post-ESSA (2016) was unknown, given ESSA rolled back some test-and growth-based mandates for all states' teacher evaluation systems. Findings herein indicate that 15 states explicitly use or encourage state-wide use of VAMs (29%, 15/51), many of which offer VAMs as statesupported or endorsed options for districts that do not have the resources (e.g., budget or personnel hours) to develop a homegrown VAM or growth model. Twenty-two states explicitly do not use or encourage state-wide use of VAMs (43%, 22/51), and 14 states (27%, 14/51) report the use of "other" approaches regarding VAMs (See Figure 1). For the roughly one-third of states claiming they now use or endorse "other" approaches, 10 of those states (20%, 10/51) reported that they had passed these choices onto districts in the name of local control (i.e., local educational authorities such as school districts can choose to use VAMs), two states reported that VAMs were now being used formatively or for only informative purposes (4%, 2/51), one state reported that their state's VAM was still in development (2%, 1/51), and one state's current situation in this regard remains unknown (2%, 1/51).
Examples of states that offer, but do not mandate state-wide VAMs include Maine which has two models from which local districts can choose to evaluate teacher performance. One model uses a VAM to measure student growth, and the other uses a SLO as a way to measure student growth. Another state, Texas, emphasizes local control. Their department of education allows student growth to be measured several ways including SLOs, portfolios, district-level pre-and posttests, and VAMs in state-tested subjects.
Yet other states are still using VAMs, but they are using them in less traditional ways. For example, North Carolina uses and reports scores from the aforementioned EVAAS, but state personnel use the results to drive teacher professional development and no longer as a high-stakes teacher evaluation measure. In fact, in their ESSA plan, North Carolina recommends that student growth scores be discussed with teachers mid-year as a way of checking on progress towards instructional practice goals set at the beginning of the school year. The plan explicitly calls for EVAAS scores to be used to stimulate discussion as one of multiple measures of teacher effectiveness. Put differently, although North Carolina technically encourages the use of one VAM to evaluate its teachers, the state encourages VAMs' formative over summative uses, which was not nearly as prevalent prior to the passage of ESSA (2016; see more on this forthcoming; see also Figure 1).
Teacher-level observations. Teacher-level observations are also a dominant feature across states' current teacher evaluation systems with 36 of 51 (71%) states reporting use. States which do not report using teacher level observations, such as Wyoming, along with six other states (12%, 6/51), may ultimately use teacher level observations, given local control to select elements of their teacher evaluation plans; however, they do not explicitly indicate their use, as compared with the 36 states that indicated their widespread use. Additionally, five states (10%, 5/51) explicitly (e.g., via state-level policy) allow for local control in terms of using teacher-level observation systems (see Figure 2).
Of the 36 states in which teacher-level observations are encouraged, 18 of the 36 states (50%) use or encourage the Danielson's Framework for Teaching observational system, or a modified version (Danielson, 2012;Danielson & McGreal, 2000), and 11 of the 36 states (31%) use or encourage the Marzano Causal Teacher Evaluation Model (Marzano & Toth, 2013). 6 There is some overlap among states that use or encourage Danielson's Framework for Teaching and the Marzano Casual Teacher Evaluation Model, with eight of the 36 states (22%) either using or encouraging both of these models or others. For example, Alabama uses an observation framework based on a combination of its Alabama Quality Teaching Standards and the 6 Briefly, both of these models are based on a specific conceptualization of the elements of teaching. Danielson's Framework for Teaching conceptualizes teaching as a complex activity with four main responsibility domains: a) planning and preparation, b) classroom environments, c) instruction, and d) professional responsibilities. Within each of these domains, the activity of teaching is further broken down into 22 components with 76 subcomponents (Alvarez & Anderson-Ketchmark, 2011). Danielson's observational framework emphasizes collecting evidence based on these componenets, interpreting such evidence, and conducting professional conversations with teachers the evidence (Danielson, 2012). The Marzano Causal Teacher Evaluation Model uses a similar framework based on four domains: a) classroom strategies and domains, b) planning and preparing, c) reflecting on teaching, and d) collegiality and professionalism. Within these domains, like the Danielson Framework, teaching is broken down into 60 elements with the majority falling under the umbrella of classroom strategies and domains (Marzano, 2012). work of both Danielson and Marzano. Alaska allows local school districts to select from several major frameworks including but not limited to Danielson and Marzano. Other states encourage various observational systems or multiple observation systems including homegrown rubrics that are developed from a state model (8%, 3/36), outside rubrics aligned to a state rubric (8%, 3/36), and the National Institute for Excellence in Teaching's (NIET's) TAP System for Teacher and Student Advancement (11%, 4/36) (NIET, n.d.; see also, Barnett, Rinthapol, & Hudgens, 2014).

Figure 2. States that include observations as part of their teacher evaluation systems (2018).
Note: Thirty-six states use teacher observations (71%, 36/51), six states do not use teacher observations (12%, 6/51), five states report local control (10%, 5/51), and four are classified as "other" (8%, 4/51). Student surveys. Student surveys of their teachers are used much less frequently than VAMs and observations, but they are on the rise in terms of development, adoption, and implementation (Geiger & Amrein-Beardsley, in press). Indeed, 14 of 51 states (27%) reported using or encouraging the use of student surveys to evaluate their teachers, and one state (2%), Washington, is currently piloting a state-wide student survey system. While 16 of 51 states (31%) explicitly noted not using or encouraging student surveys, it is evident that teacher evaluation measures are more common now than post-Race to the Top (2011; see also Kane & Staiger, 2012). Additionally, 13 of 51 states (25%) allow local control with regard to student survey systems that can also take many forms. For example, the Colorado Department of Education neither specifies nor recommends specific student surveys; however, their state statute requires that the use of a student survey as a viable option for districts when evaluating their teachers. In other words, local educational authorities can decide whether or not to even use the measure. Arkansas, on the other hand, encourages the use of perceptual data from multiple stakeholders including students, but the formats via which these data are collected are left to local authorities to decide. As not all states clearly distinguish whether they use student surveys, 7 of 51 (14%) states also remain unknown in this regard (see Figure 3). Note: Fourteen states include student surveys (27%, 14/51), 16 states do not include student surveys (31%, 16/51), 13 states report local control (25%, 13/51), one state is classified as "other" (2%, 1/51), and seven states are unknown in this regard (14%, 7/51).

Student learning objectives (SLOs).
More than half of the states (28 of 51 states; 55%) use or encourage SLOs in their teacher evaluation systems with seven of 51 (14%) not explicitly using or encouraging SLOs statewide. Another nine of 51 (18%) use SLOs as a substitute for VAM data for teachers whose subject areas do not align with state tests (e.g., for primary grade and noncore subject area teachers). Three of 51 states (6%) report local control for this indicator including Texas, which encourages teachers to set goals for student learning but does not prescribe that local education agencies use SLOs specifically. Lastly, four of 51 states (8%) do not clearly state whether they use SLOs and are accordingly classified as "unknown" (see Figure 4).
Unlike teacher-level observation frameworks which are relatively well-developed and have been around and in development and refinement for decades, (Sloat, Amrein-Beardsley, & Sabo, 2017), SLOs do not appear to be nearly as well-developed, conventionally used, or established in comparison to all of the other teacher evaluation measures in play across states given these observational frameworks (see also USDOE, n.d.). For example, in Nebraska SLOs are officially encouraged, but their use is not yet widespread. In Nevada, teachers and their supervisors use tools to create Student Learning Goals (SLGs), but the processes by which these are created vary widely by teacher and supervisor. Both practices are akin to what other SLOs might involve or look like, but nowhere are SLGs differentiated from SLOs, even despite their similarities. In Illinois, SLOs are the default teacher evaluation measure. If school districts cannot come to consensus on another socalled growth-based system, 50% of all teachers' overall evaluation scores rely upon their SLO data. Note: Twenty-eight states include SLOs (55%, 28/51), seven states do not include SLOs (14%, 7/51), three states report local control (6%, 3/51), nine states use SLOs as a substitute for VAMs (18%, 9/51), and four states are unknown in this regard (8%, 4/51).
While the preceding section illustrates the results to our first research question in this study, via the use of maps, descriptive statistics, and summary paragraphs, we gathered more detailed information about states' teacher evaluation systems that can be found in Table 1. Again, this online anonymous table includes the state-by-state information that yields much more in-depth information than the figures and text included above. This table includes information collected via the survey instrument such as VAM-specific legislation, types of assessments used to measure student growth, consequences attached to teacher evaluation measures, and percentage of overall teacher evaluation determined by student growth.

Research Question 2: How States' Teacher Evaluation Systems Have Changed Post-ESSA?
The following section transitions from explaining the status of state's current teacher evaluation systems and measures to highlighting how states' systems may have changed since Collins and Amrein-Beardsley (2014) last collected data post-Race to the Top (2011). Again, the 2014 study collected information specifically regarding VAMs and VAM use. Therefore, the information included next includes only comparative data on states' VAM-related information given no other information about states' teacher evaluation measures were collected in Collins and Amrein-Beardsley (2014).
Accordingly, and in order to compare the actual data from 2014 with the data from this study, we recreated maps from Collins and Amrein-Beardsley (2014) using the raw data available in that particular study that we reclassified into more general bins for comparative purposes (i.e., to more easily compare the data, then and now; see Figure 5). Note: The number of states using VAMs decreased from 21 to 15 (41% to 29%, which is a decrease of 29%). The number of states not using VAMs increased from 7 to 22 (from 14% to 43% of states, which is an increase of 314%). The number of states reporting local control increased from 3 to 10 (from 6% to 20%, which is an increase of 333%). The number of states using VAMs only formatively increased from zero to three (from 0% to 6% of states, which is an increase of 300%). The number of states with VAMs in development decreased from 18 to one (from 35% to 2% of states, which is a decrease of 94%). Lastly, the number of states classified as "other" decreased from two to one (from 4% to 2%, which is a decrease of 50%).
Most important to note from Figure 5 is that the number of states using state-wide VAMs decreased since 2012 (as per Collins & Amrein-Beardsley, 2014) from 41% to 29% of states (i.e., a decrease of 29%). Related, and perhaps more notably, the number of states that explicitly do not use or encourage VAM use substantially increased from seven of 51 states (14%) to 22 of 51 states (43%; i.e., an increase of 314%). Another important result of note is that in 2012 many states were still developing or piloting VAMs (including the aforementioned SGP), but in 2018 many of these states reversed their former VAM plans and trajectories. More specifically, in 2012, 18 of 51 states (35%) were piloting or developing a VAM; yet, in 2018 only the state of Mississippi (2%) reported having a VAM in development (i.e., a decrease of 94%). Additionally, the number of states that now leave decisions about VAM use to local school districts has increased from three to 10 (6% to 20%; i.e., an increase of 333%). This also demonstrates a substantial, and perhaps anticipated change, post-ESSA's (2016) shift toward more local control.
Additional state-by-state details regarding VAMs, as per this research question, can be found in Table 2. Again, this online table includes information about the types of assessments and grade level included in states VAMs (if used), the consequences attached to VAMs (if used), and the percentage of teachers' evaluation score for which VAMs are to count (if used) for both in 2012 (as per Collins & Amrein-Beardsley, 2014) and 2018. Likewise, Table 2 is an extension of Table 1, but it also includes state-by-state information from 2012 and 2018 side-by-side so that readers can compare specifics that are too detailed and space exhaustive to include along with these general results.

Research Question 3: Perceived Strengths and Weaknesses of States' Post-ESSA Systems
The following section includes an explanation of the perceived strengths and weaknesses of states' post-ESSA (2016) teacher evaluation systems (i.e., for states for which state personnel responded to this part of the study). Recall that 39 (76%) personnel from states' departments of education responded to the survey in total. Of those 39 individuals, 36 (71%) responded to this part of the survey and only 22 (44%) were willing to discuss their states' weaknesses. We aggregated these data to come up with broad themes protect the anonymity of their state sources, hence no illustrative maps revealing state by state data.

Strengths and weaknesses.
The two overarching themes regarding strengths were increased stakeholder input in the process and increased formative feedback in the process. In terms of weaknesses, four overarching themes were evident. State department personnel were concerned that there was too much variety among teacher evaluation systems. Related, personnel were concerned that there was not enough capacity to support such variety and that there was a dearth of communication between states and local educational authorities (e.g., districts). Additionally, some personnel felt that the language of official policies should change to reflect a different attitude towards teachers (See Table 3). Strengths. For strengths, one major theme reflected the increased local control supposedly provided by ESSA. A majority of state department respondents (24 of 36; 67%) presented increased stakeholder inputs as their new systems' primary strengths. This was a common theme as per the results in both sections above regarding increased local control. Department personnel identified increased stakeholder inputs, particularly at the local level, as the primary factor that also helped to change and improve relationships between teachers and other education leaders and authorities (e.g., from "combative to cooperative"). Less prevalent, but still widely evident in the data (12 of 36; 33%) many state department personnel indicated that systems meant to be more collaborative than punitive were a strength of their states' post-ESSA (2016) teacher evaluation systems. These respondents emphasized the collaborative nature of their post-ESSA systems noting, more specifically, that they built their new systems with their conceptualizations of and understandings about how their states' teachers are to be evaluated in a new and, perhaps, reformed light. Instead of employing tools for measurement as imposed in authoritative manners, respondents noted that a strength of their new teacher evaluation systems are, again, meant to be collaborative and also help teachers improve their pedagogical practices via professional development and training.
Weaknesses. As for weaknesses, or areas for improvement across states' teacher evaluation systems, 22 of 36 (61%) state department personnel provided feedback. Maybe paradoxically, seven of these personnel (7 of 22; 32%) revealed difficulties with the sheer variety of teacher evaluation systems created by local school districts now causing states difficulties when conducting comparisons of districts within and across their states. More specific concerns in this area (5 of 22; 23%) included the extent to which personnel (on behalf of their respective states) felt that they might be able to provide policy and system support on a state-level scale (e.g., interpreting data from multiple albeit unfamiliar and unique district systems). Also, state department personnel (5 of 22; 23%) considered communication and contact points with local school districts to be an area of weakness regarding teacher evaluation systems. This could include, for example, improvements of states websites and the teacher evaluation information made public online and states' communication systems for training and support regarding states' teacher evaluation systems.
Lastly, other personnel (5 of 22; 23%) in this group reported that their states' teacher evaluation system language does not often match their new philosophies, policies, and general takes on their states' approaches to teacher evaluation. For example, statements explicating that states' systems are now meant to be more formative than summative are missing, as are broad statements about how ranking teachers as "ineffective" does not contribute to the philosophies and intentions underlying states' new teacher evaluation systems. In other words, these state department personnel would like to change the language or the content in official policies to include or reflect more about the intention of evaluation systems to help teachers learn, not to punish teachers.

Research Question 4: How Have Perceived Strengths and Weaknesses of States' Systems Changed Post-ESSA
Collins and Amrein-Beardsley (2014) included a similar set of questions posed to state department personnel about the perceived strengths and weaknesses of their states' teacher evaluation system prior to ESSA; hence, below are some key results also pertinent to state differences between now and then.
In 2012 (i.e., post-Race to the Top, 2011) the main concerns expressed by state personnel regarding their states' teacher evaluation systems largely pertained to issues with assessing student progress in non-tested areas (i.e., fairness, as described prior), general validity (as defined prior), and challenges with or desires to use the models formatively (versus summatively, which was the primary intent written into Race to the Top, 2011 and the NCLB waivers the federal government put into place around the same time, as also explained prior). Inversely, state department personnel in 2012 cited system strengths such as having comparable scores across districts (given states were federally incentivized to have uniform teacher evaluation systems at the time), having similar scores for core teachers across their states, having more measures for evaluating teachers (which were largely noted as the teacher-level observation systems described prior; see also Kane & Staiger, 2012), and having more "predictive power" (see also predictive validity described prior 7 ) regarding future student success, again, as largely based on VAMs.
In 2018, state personnel's strength and weakness responses centered around the seemingly changed perceptions and intentions of states, as made explicit via states' post-ESSA (2016) teacher evaluation systems. Namely, that states are now to allow for more formative feedback to help teachers improve upon their pedagogical practices, more collaboration, and more stakeholder input and feedback (e.g., in the development, execution, and refinement of states' systems). However, while some state department personnel lauded increased communication between teachers and training offered to teachers, other state department personnel warned and worried that more local control meant less capacity for state departments to support diverse and multifarious teacher evaluation systems (e.g., in terms of providing districts support, training, appropriate communication systems, and appropriate quality controls). This was clearly evidenced as a policy and practice conundrum. A related issue, for example, was the extent to which states are now permitting districts to use multiple assessments to measure student growth, in varied ways, but also the extent to which districts understand how important it is to have the assessments that they adopt and use validated for their intended purposes. This is also, now more than prior, a noteworthy challenge (see also Sloat, Amrein-Beardsley, & Holloway, 2018).
The notable shifts in responses pre-and post-ESSA indicate states have taken more holistic views of and approaches towards their teacher evaluation systems, especially in comparison to the relatively more objective teacher evaluation systems in place prior. States' teacher evaluation policies and systems encourage more flexibility in practice, given multiple ways of measuring teacher effectiveness (also given the competing strengths and weaknesses of those additional measures). Put more simply, among state department personnel, there has been a profound change in how state leaders and personnel are talking and thinking about teacher evaluation post-ESSA.

Conclusions
We addressed what states' teacher evaluation systems look like post-ESSA (2016) and how states' teacher evaluation systems were in 2012 post-Race to the Top (2011) as compared to now (i.e., how states' teacher evaluation systems have changed over this 2012-2016 period of time of significant education policy enactment). While the purpose of this study was not to discover the underlying causes of such a complex shift in teacher evaluation systems in the US, researchers can infer the role that predominantly federal policies have played and continue to play in the state-level policies reviewed herein and prior. Rather, the purpose of this study was to provide an overview of data related to all states' teacher evaluation systems before and after the passage of ESSA (2015), especially because the rhetoric of ESSA may not match the actual policies.
First, VAMs are still in use as a component of teacher evaluation systems, but they are losing traction among state departments of education. This general trend is clear as per the data presented herein, as well as what would likely be expected after ESSA (2015) loosened the reigns on the federal incentives tied to states' use(s) of states' formerly reformed teacher evaluation models. More specifically, while some states continue to use VAMs, they do not include them as parts of the teacher evaluation scores or processes nearly as often, for nearly as much weight if still used, and definitely not nearly as often for high-stakes, consequential purposes. Instead, if VAMs are still being encouraged or used, they are being used to yield data which teachers might use to understand and then improve upon their own pedagogy and practice, as best they can (e.g., given some of the transparency and formative use issues with using VAMs, as discussed prior, are still at play). The implication of this finding is that VAMs may still play an important role in new wave of teacher evaluation systems, despite some belief that the passage of ESSA may eliminate VAMs. However, post-ESSA teacher evaluation systems which continue to use VAMs, overall, have reduced the weight of such measures in teachers' overall evaluation and have reduced or removed consequences tied to VAMs.
Second, the Danielson and Marzano observational frameworks seem to now be driving much of the action across teacher evaluation systems across the US, as likely related to the renewed and formative values and intentions clearly inherent in states' post-ESSA teacher evaluation systems. Such observational systems align better with states' new and apparent enthusiasms for teacher evaluation systems bent on formative use is also clear as per the evidence collected herein. This is also in line with recent research about effective teacher evaluation practices (Reinhorn, Moore Johnson, & Simon, 2017). Hence, we are starting to see a shift away from quantitative test score measures towards measures using scores from research-based conceptual frameworks like the Danielson or Marzano frameworks which break the complex activity of teaching into scored subcomponents meant to be used for formative purposes (e.g., discussion and professional development). The implication here is that policymakers or practitioners working on teacher evaluation systems in the current era should consider these additional evaluation frameworks, or at minimum, recognize the additional subcomponents that can be factored into teacher evaluation data.
Third, while there is still a legacy of emphases on VAMs as student growth measures, the definition of student growth is changing as well. In 2012, student growth essentially referred to growth as measured by states' standardized assessments of student achievement, aggregated, and then attributed to students' teachers' effects (e.g., as measured via VAMs). In 2018, student growth now includes other, more diverse, multiple measures, still including observational systems but also now including student surveys and SLOs. Put differently, the underlying construct (i.e., student growth) is the same, but the ways of defining and measuring it are different, more custom-made, and more holistic, given ESSA.
Fourth, there is a heightened emphasis on local control post-ESSA (2015) across states. While state department personnel expressed concerns about efficiently training and supporting local school districts with a large variety of systems, states have apparently responded to ESSA (2015), byand-large, by allowing districts within their states to create what are essentially endorsed, curated, or completely homegrown teacher evaluation systems that can be customized to local school districts' desires, philosophies, and needs. State practices in this area unquestionably walk the line between manageability and flexibility. However, such practices may also set precedent for future teacher evaluation systems by providing both flexibility and support to local districts in the future. Additional research should consider whether local control creates a better environment for navigating the practical challenges of creating and implementing a teacher evaluation system. We recommend that policymakers continue to monitor how the heightened emphasis on local control plays out with regards to teacher evaluation systems.
Finally, the myriad lawsuits filed between teacher unions and state departments of education over the last decade (Education Week, 2015; see also Amrein-Beardsley & Close, 2019) may have driven some of the philosophical changes noted, especially in terms of more cooperative and formative, and less punitive and consequential teacher evaluation systems. Some state department personnel cited that new teacher evaluation systems with a focus on stakeholder involvement even changed teachers' and state leaders' relationships from "combative to cooperative." Perhaps this new era of teacher evaluation even reflects an honest effort to correct some of the pugnaciousness of the previous federal policies.
What is ultimately evidenced: ESSA has impacted the ways in which states are thinking about and enacting or endorsing teacher evaluation systems that do look different now than they did post-Race to the Top (2011). The reversal of trends, many would argue, constitute steps in the right direction, though those who still believe in high-stakes accountability systems at the teacher level, or student-or school-level may argue these steps are in the wrong direction. Regardless of stance, any persons interested in or concerned about the current state of states' teacher evaluation systems post-ESSA (2016) should have data, via this study, to understand changes these systems over time, at minimum. This should, accordingly, be of historical but also timely "value-added."

About the Authors Kevin Close
Arizona State University Email: kevin.close@asu.edu ORCID: https://orcid.org/0000-0003-1643-5124 Kevin Close is currently pursuing a PhD in the Learning, Literacies, and Technologies program at Arizona State University. His research focused on digital adaptive assessments, nation-wide teacher evaluation systems based on high-stakes tests, and design in education. His interests lie in using technology to change the way we assess and measure progress.

Audrey Amrein-Beardsley
Arizona State University Email: audrey.beardsley@asu.edu ORCID: https://orcid.org/0000-0002-1250-2281 Audrey Amrein-Beardsley, PhD., is a Professor in the Mary Lou Fulton Teachers College at Arizona State University. Her research focuses on the use of value-added models (VAMs) in and across states before and since the passage of the Every Student Succeeds Act (ESSA). More specifically, she is conducting validation studies on multiple system components, as well as serving as an expert witness in many legal cases surrounding the (mis)use of VAM-based output.

Clarin Collins
Arizona State University Email: clarin.collins@asu.edu ORCID: https://orcid.org/0000-0003-1630-9881 Clarin Collins, Ph.D., is Director of Scholarly Initiatives in the Mary Lou Fulton Teachers College at Arizona State University. Her research interests include national and state policy implementation at the local level, teacher interaction with and influence on education policy, and education accountability and evaluation systems.

About the Guest Editor Audrey Amrein-Beardsley
Arizona State University audrey.beardsley@asu.edu Audrey Amrein-Beardsley, PhD., is a Professor in the Mary Lou Fulton Teachers College at Arizona State University. Her research focuses on the use of value-added models (VAMs) in and across states before and since the passage of the Every Student Succeeds Act (ESSA). More specifically, she is conducting validation studies on multiple system components, as well as serving as an expert witness in many legal cases surrounding the (mis)use of VAM-based output.

SPECIAL ISSUE Policies and Practices of Promise in Teacher Evaluation
education policy analysis archives Volume 28 Number 58 April 13, 2020ISSN 1068-2341 Readers are free to copy, display, distribute, and adapt this article, as long as the work is attributed to the author(s) and Education Policy Analysis Archives, the changes are identified, and the same license applies to the derivative work. More details of this Creative Commons license are available at https://creativecommons. Please send errata notes to Audrey Amrein-Beardsley at audrey.beardsley@asu.edu Join EPAA's Facebook community at https://www.facebook.com/EPAAAAPE and Twitter feed @epaa_aape.