The Stability of Teacher Performance and Effectiveness: Implications for Policies Concerning Teacher Evaluation

The last five to ten years has seen a renewed interest in the stability of teacher behavior and effectiveness. Data on teacher performance and teacher effectiveness are being used increasingly as the basis for decisions about continued employment, tenure and promotion, and financial bonuses. The purpose of this study is to explore the stability of both teacher performance and effectiveness by determining the extent to which performances and effectiveness of individual teachers fluctuate over time. The sample consisted of 132 teachers for whom both observational and state standardized test data were available for five consecutive years. Neither teacher performance nor effectiveness were highly stable over multiple years of epaa aape Education Policy Analysis Archives Vol. 22 No. 95 2 the study. The observed relationship between teacher performance and teacher effectiveness was reasonably stable over time, but the magnitude of the relationship was quite small. Teacher performance was also likely to be inflated in low performing schools. We also discuss when different observed patterns may be acceptable based on the purpose for which the data are used.


Introduction
Over the last decade, we have seen renewed desire on the part of an increasing number of educators, parents, and policy makers to ensure that every child is taught by a highly-qualified, competent teacher. This may be attributed in part to empirical evidence that teachers are the most important school-based determinant of student achievement (Rivkin, Hanushek, & Kain, 2005;Hattie, 2009) as well as philanthropic efforts, such as those by the Bill and Melinda Gates Foundation, to improve teacher quality. Furthermore, state and federal laws (e.g., Race to the Top, No Child Left Behind) have created high-stakes situations for teachers by linking their classroom performance and effect on students to decisions about their employment, promotion, and compensation (Welsh, 2011). Not surprisingly, increased attention has been given to the most valid and reliable ways of measuring teacher performance (i.e., what they do) and effectiveness (i.e., what impact they have on their students) (Medley, 1982). At the same time, however, there is remarkably little research to guide such critical decisions as which teachers to hire, retain, remunerate, and promote (Rice, 2003).
Examining the validity of the effectiveness data is quite complex and requires multiple examinations and inferences. Classroom observations are the primary sources of data on teacher performance although document analysis (e.g., lesson plans, student assignments) and teacher surveys may also be used. Student test scores and, occasionally, student surveys are used to collect data on teacher effectiveness. Validity evidence must be collected and shown for teacher performance and effectiveness data.
The validity of the performance measures stems primarily from the connection between the items included on the instruments and results derived from research on teaching (Grossman et al., 2010;Hill et al., 2008) and/or frameworks developed by professional associations and organizations, such as the Interstate Teacher Assessment and Support Consortium (InTASC) (Danielson, 2007) and the California Commission on Teacher Credentialing (2009). The individual items are organized around standards or components which, in turn, are combined to form domains. Danielson's (2007) Framework for Teaching, for example, exemplifies this structure, with 76 elements, 22 components, and four domains. The four most common domains are Planning, Classroom Environment, Instruction, and Professional Responsibilities, although each of these domains can be, and has been, subdivided. Instruction, for example, has been broken down into Teaching Strategies and Assessment/Evaluation. Similarly, Professional Responsibilities has been parsed into Collegiality/ Professionalism and Communication/ Community Relations.
Occasionally, the construct validity evidence for the instruments is examined using factor analysis (see, for example, Cambridge Education, 2013;Fish & Dane, 2000;Piburn et al., 2000). In other cases, the criterion-related validity of instruments is explored using correlational and/or regression analysis. This can be done using the total score or overall rating on the performance measures as the independent variable and a measure of student achievement as the dependent variable. Increasingly, the student achievement measure is based on change in test scores over time, resulting in what are referred to as "value-added" scores (Polikoff, 2013). Regression analysis using value-added scores has shown that the variance in value-added scores that can be attributed to teacher performance rarely exceeds 10 percent (Daley & Kim, 2010;Mihaly, McCaffrey, Staiger, & Lockwood, 2013). Explanations for this relatively low degree of relationship include measurement error (Goldhaber & Hansen, 2010), restricted range of teacher performance scores (Ho & Kane, 2013), and context effects such as subject matter, grade level, and class size (Angrist & Lavy, 1999;Polikoff, 2013).
The first step in arguing for the validity of value-added ratings is to review the alignment of items on the achievement tests with the curriculum standards (be they state or federal) (Anderson & Krathwohl, 2001;McDonnell, 1995). The second step is to check the accuracy with which students are matched across multiple years, are matched with appropriate teachers, and have been enrolled in the teacher's classes for some minimum length of time (American Statistical Association, 2014;Bill & Melinda Gates Foundation, 2010;Reardon & Raudenbush, 2009). The third step is to determine that the minimum number of students needed to ensure credible value-added data has been reached, examine the class composition and decide whether adjustments need to be made based on differences in class composition, and conduct other similar investigations (Amrein-Beardsley, 2008;Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012;Reardon & Raudenbush, 2009).
Once sufficient validity of the data and the inferences made based on the data have been established, concerns generally shift to the reliability of the data. The reliability of the document analysis and, particularly, the observational data is typically limited to the extent of agreement between trained observers (Daley & Kim, 2010). However, in a few studies reliability has been examined using statistical methods associated with generalizability theory (Goldhaber & Hansen, 2010;Hill, Charalambous, & Kraft, 2012;Shavelson & Dempsey-Atwood, 1976). The results of these studies suggest that teacher performance is inconsistent across grade levels and when classrooms have different student compositions (e.g., racial composition, class size). Furthermore, some components of teacher performance are more stable than others (Polikoff, 2013).

Why and When is Stability Important?
The last five to ten years has seen a renewed interest in the stability of teacher behavior and effectiveness (Darling-Hammond et al., 2012;Goldhaber & Hansen, 2010;Papay, 2011;Polikoff, 2013;Welsh, 2011). Data on teacher performance and teacher effectiveness are being used increasingly as the basis for decisions about continued employment, tenure and promotion, and financial bonuses. The awarding of financial bonuses can arguably tolerate a reasonable degree of instability because they are awarded primarily on evidence of effective teaching at one point in time; however, sound decisions about continued employment, tenure, and promotion are predicated on some degree of stability over time. It is imprudent to make such decisions of the performance and effectiveness of a teacher as "Excellent" one year and "Mediocre" the next.
In light of these considerations, the desirability of stability is largely a function of the purpose for which the data are to be used. Rogosa, Floden, and Willett (1984) identified four possible patterns that may be observed that remain relevant to current policy decisions: absolute invariance, trend, and scatter (including trend plus scatter). Invariance indicates that the same rating was observed for a teacher across time. Trend indicates that the observed ratings for a teacher increased for each time point. Scatter indicates random fluctuation, which may or may not be accompanied by an overall trend.
Different "patterns" are acceptable for different purposes. For employment/dismissal and promotion decisions, stability is very important (i.e., consistently poor and consistently high ratings, respectively). For evaluating professional development programs, one may only be interested in more "positive trends" (i.e., improvement over time). Conversely, professional development programs designed to provide remediation of some sort may benefit from identifying people with "negative trends." For compensation, desirability of stability further depends on what pay increases are linked to. If the pay is definitely linked to one particular year, then consistency of ratings over time is not considered. If the pay is intended to retain excellent teachers, then stability is important (i.e., consistently high ratings). If the pay is intended to reward improvement over time, then positive trends are most appropriate. Regardless of the purpose, looking at score stability of individual teachers across time is the only way to obtain the information necessary to make these determinations.

Research Questions
In light of the previous discussion the primary purpose of this study is to explore the stability of both teacher performance (behavior) and effectiveness by determining the extent to which performances and effectiveness of individual teachers are stable over time. Specifically, two questions guided the analysis of the data collected and presented here. They are: 1) How stable are teacher ratings based on expert observations and value-added student achievement over four years? 2) How stable are the differences between observational and value-added ratings across schools and, if instability exists, is it related to the overall school value-added rating?

Method Population and Sample
The population consisted of 160 upper elementary (grades 4 and 5) and 128 middle grade (grades 6 through 8) teachers in 23 South Carolina schools that participated in a five-year program funded by the federal Teacher Incentive Fund (TIF). The poverty indices for the schools ranged from 71 to 99, with a median of 90. A poverty index of 90 means that 90% of the students in the school are eligible for Medicaid services, qualify for free or reduced-price lunches, or both.
The four primary goals of the program were to: • Improve student achievement by increasing teacher and principal effectiveness; • Reform teacher and principal compensation systems so that teachers and principals are rewarded for increases in student achievement; • Increase the number of effective teachers teaching poor, minority, and disadvantaged students in hard-to-staff subjects; and • Create sustainable performance-based compensation systems.
To achieve these goals, the program emphasized ongoing professional development opportunities and activities led by master and mentor teachers with content determined primary from classroom observations and students' test scores. In terms of compensation, teachers received bonuses based in part on (1) their observed teaching performance, (2) improved test scores of their students, and (3) overall school effectiveness.
Teachers were observed four times per year with the results of each observation summarized on a structured, standardized rubric (see discussion below). Prior research has suggested that from three to four observations per year yields relatively stable estimates of teacher performance (Hill, Charalambous, & Kraft, 2012;Smolkowski & Gunn, 2012). Each year students in grades 3 through 8 were administered the Palmetto Assessment of State Standards (PASS) in mid-March (writing) and early May (reading and mathematics). Students in grades 4 and 7 also were administered science and social studies tests in early-May. Students in grades 3, 5, 6, and 8 were administered either science or social studies tests in alternate years, also in early May. Students in the majority of schools were administered the Measures of Academic Progress (MAP) periodically each year. MAP results were used primarily for professional development, whereas PASS scores were used (along with observation ratings) to determine teacher bonuses.
The final analytic sample consisted of 132 teachers for whom both observational and PASS data were available for five consecutive years. Of this sample, 79 taught students in grades 4 and 5 and 53 taught students in grades 6, 7, or 8. Sixty-one teachers were lost over the five year period because they (1) changed grade levels, (2) changed schools and/or school districts, or (3) retired. Because Year 1 was primarily a planning year, the Year 1 data were considered baseline data. Therefore, four years of data are included in this study (Years 2 through 5).

Variables
There were three primary variables: observation ratings, teacher value-added ratings, and overall school ratings. Each is discussed in this section.
Observational ratings. The rubric used to record the observations was developed by researchers at the National Institute for Excellence in Teaching, who operate the TAP program. The instrument contains 19 indicators that "provide sufficient breadth to ensure that evaluation ratings reflect the kind of effective instructional practices that predict positive learning outcomes" (Jerald & Van Hook, 2011, p. 4). Each of the 19 indicators measured one of three domains: Designing and Planning Instruction, Learning Environment, and Instruction. Designing and Planning Instruction, for example, was associated with three indicators: Instructional Plans, Student Work, and Assessment. Learning Environment was associated with four indicators and Instruction was associated with 12. Figure 1 contains an example of the rating scale for one indicator associated with the domain, Learning Environment.

Exemplary (5)
Proficient (  Each teacher was observed four times each year by a school administrator, master teacher, or mentor teacher. For each of the 19 indicators, the observer assigned a rating from 1 to 5 based on what was observed. Narrative descriptions were available for three of the rating categories: Exemplary (5), Proficient (3), and Emerging (1). Although narrative descriptions were not available for the other two rating categories (2 and 4), observers were trained to assign these intermediate ratings when what they observed did not fit clearly in any of the categories with narrative descriptions.
For each teacher, the results of each observation can be summarized as a vector with 19 digits, with each digit representing a particular indicator and ranging from 1 to 5. Over four observations, the results can be summarized as a 19 × 4 matrix. In part because there are a greater number of indicators associated with the Instruction domain, ratings on that domain received a greater weight (75%) in a computational formula. Using the computational formula, a single rating from 1 to 5 was assigned to each teacher for each observation. The ratings for the four observations were then averaged and the result rounded to the nearest half point. For any given year, nine possible observational ratings could be assigned: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0.
Teacher value-added ratings. The Palmetto Assessment of State Standards (PASS) is a battery of tests that are administered to students in grades 3 through 8. The tests have been carefully aligned with the state academic standards. PASS tests in English language arts and mathematics are administered to students enrolled in all six-grade levels every year. PASS tests in science and social studies are administered to students enrolled in grades 4 and 7 every year. For students in grades 3, 5, 6, and 8 PASS tests in science or social studies are administered in alternate years.
To support the TIF program, schools contracted with SAS data services in Cary, North Carolina to compute value-added ratings based on two-year, longitudinally matched sets of PASS scores. SAS uses the Education Value-Added Accountability System (EVAAS) to compute the ratings. Using mixed model equations, EVAAS uses the covariance matrix from the longitudinal data sets to estimate student progress viz-à-viz state-level normative data. In the model, each student acts as his or her own control and no other covariates are used. Because two years of data are needed, third grade teachers are not included in the sample since PASS tests are first administered in grade 3. Once the EVAAS scores have been computed, they are converted to a five-point scale, with 5 being the most positive. Unlike the observation ratings, half points are not assigned.
School effectiveness ratings. School effectiveness ratings are also based on EVAAS scores. The scores used to determine the school effectiveness ratings are aggregated directly to the school level. Like the teacher value-added ratings, the school effectiveness ratings are on a five-point scale, with 5 being the most positive.
Relationships between teacher ratings. The estimated polychoric correlation coefficients for years 2 through 5 of the study were respectively .19, .22, .29, and .42. Although the correlation coefficients increase across time, the highest correlation observed was .42. In other words, the variability explained between value-added and observational ratings was at best about 18% in this study.
We also tested the equality of the four correlation coefficients by computing simultaneous 95% confidence intervals using 1,000 bootstrapped samples. To control the Type I error rate at 5%, we used Bonferroni adjustment such that each confidence interval was based on the .00625 and .99375 quantiles of the bootstrapped sampling distribution. The confidence interval for the year 2 was (-.05, .42), year 3 was (.00, .44), year 4 was (.03, .50), and year 5 was (.22, .60). Due to the overlap between the intervals, we could not conclude that the correlation between value-added and observational ratings at the teacher-level in any given year was statistically different from any other year.

Data Analysis
In this study, we conceptualized stability as the change in scores and/or observational ratings across time, and there are multiple aspects of change. For example, change can be shown by the number of different scores across time and/or by the magnitude of the differences in scores. For each research question, we conducted multiple types of analyses in an effort to capture different aspects of score stability and ultimately provide a more robust examination of the fluctuation of scores and ratings across time. The analysis was primarily descriptive in nature and did not require distributional assumptions about the errors. Furthermore, all 132 records were complete.
To answer the first research question, we first generated frequency distributions of the valueadded and observational rating range for each teacher. Second, we generated frequency distributions for the number of adjacent years for which teachers had discrepant value-added and observational ratings. Third, we constructed a panel plot for a random subsample of nine teachers in order to examine the variation in individual value-added and observational rating patterns. Last, we categorized each of the teachers into one of the four longitudinal patterns discussed above (i.e., invariant, trend, scatter, trend plus scatter).
For the second research question, we computed the mean school-level value-added rating across the years of the study and then classified the schools on the basis on the computed means. For teachers within each school, we subtracted the value-added rating from the observational rating in each year of the study. Next, we estimated the mean difference for all teachers within each school. Finally we examined the distribution of mean difference scores with each school-level value-added category.

Results
For each research question, we conducted multiple types of analysis in an effort to capture different indicators of score stability. The results from this study are organized by research question.

Research Question 1
First, we computed the range of scores for each teacher across the four years of the study. For example, a teacher with a score pattern of 1-1-1-5 has a range of four, and a teacher with a score pattern of 3-4-3-4 has a range of one. Among the value-added ratings, eight teachers (6.1%) had the same score in all four years, 38 teachers (28.8%) had a range of one, 68 teachers (51.5%) had a range of two, 16 teachers (12.1%) had a range of three, and two (1.5%) teachers had a range of four. Nearly two-thirds of the teachers had value-added ratings that differed two or more points over the course of the study. This suggests that the value-added rating may not be stable across time.
For the observational rating distribution, 14 teachers (10.6%) had a range of zero, 63 teachers (47.7%) had a range of 0.5, 43 teachers (32.6%) had a range of 1.0, 11 (8.3%) teachers had a range of 1.5, and one teacher (0.8%) had a range of 2.0. It should be noted that it would be very difficult to observe observational ratings at the extremes (i.e., near 1 or 5) given that the observational rating for each year is based on the mean of the teacher observations across multiple measurements within that academic year. The observational ratings observed in the study ranged from 2.0 (n = 5) to 5.0 (n = 1). In light of this distribution, that nearly 90% of different observational ratings across the study and over 40% had observational ratings that differed by at least 1.0 point suggests that the observational ratings may not be stable.
Third, we reported the number of year-to-year score discrepancies per teacher. For example, a teacher with a score pattern of 1-1-1-5 would receive a one, and a teacher with a score pattern of 3-4-3-4 would receive a three. A discrepancy value of zero indicates that ratings were the same across the years of the study, and the maximum discrepancy score is three, which indicates that a teacher's score was never the same in adjacent years. The discrepancy scores for value-added and observational ratings and their associated frequencies are presented in Table 1 below. Nearly 94% of the teachers had a different value-added rating in at least two years, and one out of four teachers had a different value-added rating in every year of the study. Similarly, almost 90% of the teachers had different observational ratings in two or more of the years of the study, and half of the teachers had different observational ratings in three or four of the years. This analysis provides strong evidence for the instability of the value-added and observational ratings of teachers.
Fourth, we generated line graphs of small subsamples to demonstrate the variation in teacher value-added and observational rating patterns. The panel plot of value-added ratings for five teachers is presented in Figure 2, and the panel plot of observational ratings for six teachers is presented in Figure 3. Ideally one of two patterns would be observed: 1) scores would generally increase as time progressed (i.e., left-to-right in the plot), or 2) scores that were high at in Year 2 and remained high across all years of the study. As indicated in the plots, there is considerable variation the patterns amongst the selected teachers. For both plots, the scale of the vertical axis should be noted. That is, the plot of value-added ratings (Figure 2) shows values from the entire range (i.e., one to five), and the plot of observational ratings (Figure 3) only shows values between 2.5 and 4.5. The restriction in range is discussed in a previous section.  Panel A shows a general upward trend from Times 1 to 4 although no change was observed between Times 2 and 3. Panels B and C shows no change in observational rating score from Times 1 to 4. Panel D shows sizeable changes in value-added scores of adjacent years. Panel E shows an upward trend from Times 1 to 3 but then shows a drastic decline at Time 4. Panel A is an example of a positive trend. Panels B and C are examples of invariant patterns. Panels C and D are examples of scatter. Panel A shows sizeable increases from Time 1 to Time 3 before leveling off in Time 3. Panel B shows a general upward trend from Times 1 to 4 although no change was observed between Times 2 and 3. Panel C shows initial decrease from Time 1 to Time 2, then a no change between Times 2 and 3, and then a sharp increase in Time 4. Panel D shows a general downward trend from Times 1 to 4 although no change was observed between Times 2 and 3. Panels E and F show no change in observational rating score from Times 1 to 4. Panels A and B are examples of positive trend. Panel C is an example of scatter. Panel D is an example of negative trend. Panels E and F are examples of invariant patterns.
Last, we categorized each teacher's change pattern into one of the four patterns described by Rogosa et al. (1984). The frequency with which each of these patterns was observed in the present study is provided in Table 2. Invariant patterns were those that did not fluctuate at all. Positive trend (with and without scatter) patterns were those that showed (a) positive changes between at least two consecutive years and (b) no negative changes. Negative trend (with and without scatter) patterns were those that showed (a) negative changes between at least two consecutive years and (b) no positive changes. Scatter patterns were those that showed (a) change between only one pair of consecutive years or (b) positive and negative changes within the same pattern. Examples of these patterns are shown in Figure 2 and/or 3. Among the value-added ratings, scatter was the most commonly observed pattern (62.9%), and the most commonly observed pattern of observational ratings was positive trend (43.9%). A "trend" does not have to be "perfect." For example, a value-added ratings pattern of 2-3-4-5 is clearly a positive trend. A value-added rating pattern of 2-3-3-4 was also considered as a positive trend as well since the overall change across time was positive.

Research Question 2
To explore to potential relationship between the teacher-level value-added and observational ratings based on the school-level value-added rating, we first computed the mean school valueadded rating across the four years. Next, we classified each of the 23 schools into one of three categories on the basis of the school-level value-added means. Schools with means between 1.0 and 2.99 were assigned to group 1 (n = 4), between 3.0 and 3.99 were assigned to group 2 (n = 10), and between 4 and 5 were assigned to group 3 (n = 9). To examine possible discrepancy between teacher-level value-added and observational ratings, we computed the difference between the two ratings such that positive difference scores indicated that the observational rating was higher than the value-added ratings, negative difference scores indicated that the observational rating was lower than the value-added ratings, and a zero difference score indicated that the observation and valueadded ratings were the same. The mean difference score was computed for each school, which allowed us to generate a scatterplot of the school-level value-added ratings against the computed mean difference scores of teachers within each school. The scatterplot is shown in Figure 5. The mean difference scores for groups 1 through 3 were respectively 0.70, -0.07, and 0.14. Of the schools that were classified into groups 2 or 3, the distribution of differences scores was centered near zero. This suggests that on average the observational ratings were the same as the value-added ratings in schools performing average or above average. On the other hand, schools that were poor performing (i.e., group 1) tended to have higher mean observational ratings than value-added ratings for their teachers. This indicates that on average the observational ratings may overestimate teacher effectiveness in lower performing schools.

Discussion
This study focused on the stability of individual teacher performance and effectiveness ratings across time. The importance of why individual teacher stability should be addressed can be exemplified by the work of Ho and Kane (2013) that examined administrator ratings of teachers. In their study, administrators rated the performance of their own teachers higher than administrators from other schools; however, the correlation of administrators' ratings of teachers was 0.87. Thus, even though there was bias in the performance data, the bias did not impact the reliability of the differences between teachers. Thus, the absence of within-teacher examination prevents one from having a complete set of information on which to draw conclusions.
In the current study, neither teacher performance nor teacher effectiveness ratings was highly stable. Teacher performance is somewhat more stable than teacher effectiveness, but this may be the result of the restricted range of the teacher performance ratings (see also Ho & Kane, 2013). Additional discussion is warranted regarding the desirability of stability. On the one hand, economic theory commonly models unobserved worker quality as a given parameter that is relatively fixed over time (Goldhaber & Hansen, 2010). With this theory, stability is a requisite for many of the policy implications offered by economists. For example, the feasibility of Hanushek's (2011) suggestion that replacing the bottom 5-8 percent of teachers with average teachers could move the U.S. near the top of international math and science rankings would first require that teacher quality be a stable characteristic. On the other hand, many educators argue that good teaching requires adaptations and accommodations that require some degree of instability. Berliner (1976), for example, stated that the standard of excellence in teaching commonly held implies a teacher whose behavior is inherently unstable.
In our study, we showed the frequency with which each of the patterns identified by Rogosa et al. (1984) was observed in Table 2. The most commonly observed patterns for value-added ratings and observational ratings were scatter and positive trend, respectively. These data suggest that while teachers in the study tended to improve their performance ratings over time, the improvement in classroom performance did not translate into improved effectiveness ratings, on average. Though not a part of the guiding research questions, this finding is consistent with our analysis of the relationship between the value-added and observational ratings. We observed a generally weak positive relationship between the two sets of ratings. These findings support the use of multiple measures of teacher competence, teacher performance, and teacher effectiveness in all teacher evaluation systems, which has been widely recommended (American Statistics Association, 2014;Bill & Melinda Gates Foundation, 2010;Steele, Hamilton, & Stecher, 2010) There are several possible explanations for the inconsistency between teacher performance and teacher effectiveness ratings. As suggested by Goldhaber and Hansen (2010), the first source is measurement error. Correcting correlations for attenuation can increases percent of variation accounted for by 10 to 15 percent (see Polikoff, 2013). Variables have also been proposed that may mediate relationship between teacher behavior and student achievement, such as task perception, self-regulation, motivation, teacher efficacy, and curriculum alignment (Doyle, 1978;Gallagher, 2004;Winne, 1987). Finally, there are contextual differences, such as grade level, subject matter, and class size and composition. It is possible that different observation protocols are necessary for observing teachers of different grade levels. Subject-specific observation forms may also be necessary to accurately reflect the differences in practices employed by teachers of different subjects as has been recommended by Bill & Melinda Gates Foundation (2010).
Finally, our analysis indicated that teacher performance was inflated in low performing schools on average. This suggests that performance ratings may be school-specific. Performance ratings are based on the performance of teachers relative to other teachers in that school, which are independent of where the school stands in terms of their overall effectiveness (e.g., value-added test scores). This must also be taken into account in teacher quality evaluation programs. The use of independent, external observers may mitigate this school effect.
The stability of teacher performance and effectiveness has important implications for the formation of teacher policies. If, as Goldhaber and Hansen (2010) suggest, the assertion that quality is relatively fixed over time is valid, then perhaps the best way to improve our educational system is to weed out poor performers. The validity of this assumption lies at the heart Hanushek's (2011) assertion mentioned previously. If, however, teacher performance and/or effectiveness tend to be an unstable characteristic, then, it may be necessary to "radically re-think the direction of teacherbased accountability" (Goldhaber & Hansen, 2010, p. 2). Readers are free to copy, display, and distribute this article, as long as the work is attributed to the author(s) and Education Policy Analysis Archives, it is distributed for noncommercial purposes only, and no alteration or transformation is made in the work. More details of this Creative Commons license are available at education policy analysis archives editorial board