The Effect of Summer on Value-added Assessments of Teacher and School Performance

This study examines the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the early grades. The results indicate that 40-62% of the variance in VAA estimates originates from the summer period, depending on the outcome (i.e., reading or math achievement gains). Furthermore, when summer is omitted from the VAA model, 51-61% of the teachers and 58-61% of the schools change performance quintiles, with many changing 2-3 quintiles. Extensive statistical controls for student background and classroom and school context reduce the summer effect, but 36-47% of the teachers and 42-49% of the schools are still in different quintiles. Furthermore, besides misclassifying teachers and schools, the results show that including summer tends to bias VAA estimates against schools with concentrated poverty. The results suggest that removing summer effects from VAA estimates will likely require biannual achievement assessments (i.e., fall and spring).


Introduction
As the educational accountability movement gained traction over the past three decades, federal, state, and local policies have increasingly tied teacher and school performance assessment to student achievement test scores.These policies have led to considerable research and debate on how to best gauge the contributions that individual teachers and schools make to their students' achievement.In recent years value-added assessment (VAA) has emerged as the most recommended statistical approach for this purpose (Glazerman et al., 2010;Harris, 2011;McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2003;Tekwe et al., 2004).Consequently, the application of VAA for teacher and school accountability assessment has experienced an enormous expansion over the past decade.
The increased use of VAA has not gone without criticism, particularly for high stakes applications that may result in sanctions against "low performing" teachers or schools.The primary concerns are that VAA estimates can be unreliable and that measurement issues with achievement test scores can bias VAA estimates.For instance, research shows that VAA estimates of teacher effectiveness tend to vary substantially from year to year and from one standardized achievement test to another (Lockwood et al., 2007;Papay, 2011).Moreover, measurement issues with achievement tests, such as ceiling or floor effects, nonlinearity in the test's scale, or imperfections in the vertical equating of tests that are used to estimate change in achievement over time, can bias VAA estimates (Haertel, 2013;Koedel & Betts, 2010;Reardon & Raudenbush, 2009).These and other shortcomings have led several prominent education scholars to advise against using VAA for high-stakes personnel decisions (Amrein-Beardsley, 2008;Baker et al., 2010;Braun, Chudowsky, & Koenig, 2010;McCaffrey et al., 2003).
One factor that potentially impacts VAA that has not received adequate research attention is the inclusion of the summer period when students are not attending school.This is a noteworthy deficit in the literature because VAA is typically based on annual gains in student achievement from one spring to the next, which includes the summer period when teachers and schools tend to have little control over what students do.This is problematic because research indicates that student achievement tends to drop over summer and that demographic achievement gaps primarily develop during summer (Alexander, Entwisle, & Olson, 2001;Cooper, Nye, Charlton, Lindsay, & Greathouse, 1996;Heyns, 1978).Hence, using annual spring-to-spring assessments may introduce variation originating from summer that biases VAA estimates.Moreover, given that the rate at which achievement gaps develop accelerates over summer, VAA estimates that include summer may be biased against teachers and schools serving inordinately high proportions of disadvantaged children (Baker et al., 2010;Harris, 2011).However, the degree to which including summer impacts VAA estimates and whether estimates are biased against teachers and schools serving disadvantaged populations remains unclear.It is also unclear whether any summer effects on VAA can be ameliorated using statistical control covariates that are typically available for accountability modeling (e.g., student demographics and free or reduced lunch status).

Research Questions
The present study examines the impact of including the summer period on VAA estimates of teacher and school performance based on gains in student reading and math achievement test scores.VAA estimates derived from the typical annual achievement gains testing schedule (spring of one year to spring of the next year) are compared with VAA estimates derived from the school year gains testing schedule (fall to spring of the same school year).Comparisons are based on correlations, the proportion of the variance in VAA estimates derived from annual-gains that originates from summer, and quintile classification differences.A nationally representative sample of first graders and their teachers and schools were used to address the following research questions: 1. To what extent does including summer impact VAA estimates of teacher and school performance?2. Can any summer effect be ameliorated without biannual assessments (i.e., fall and spring) using control covariates that are typically available to school districts, such as student demographics and contextual characteristics of classrooms and schools? 3. To what degree does including summer in VAA estimates result in biases against teachers and schools serving low income and ethnic minority children?

Background Summer Learning and Achievement Gaps
Research has documented substantial differences in the rate at which children learn during the school year compared with over summer when they are not attending school (Alexander, Entwisle, & Olson, 2001;Borman & Boulay, 2004;Cooper et al., 1996;Heyns, 1978).In one of the first major studies on this topic, Heyns (1978) conceptualized student achievement as the result of innate ability and a mixture of three environmental influences: home, community, and school.Whereas the home and community factors are essentially year-round influences, the effect of schooling is mostly limited to when school is in session.Heyns found that the socioeconomic achievement gap primarily develops over summer, suggesting it is largely the product of socioeconomic differences in home and community influences.She concluded that because children spend far less time in those settings when school is in session, the rate of increase in socioeconomic achievement gaps tends to slow dramatically during the school year.
Over the past 30 years several additional studies have been conducted on summer effects that mostly support Heyns' findings (Alexander et al., 2001;Borman & Boulay, 2004;Cooper et al., 1996).A meta-analysis by Cooper et al. (1996) of 13 select studies found that, on average, students lose approximately one grade-equivalent month of achievement over summer, although the magnitude of the summer loss tends to be larger for math than reading.Moreover, the meta-analysis indicated that the summer loss is associated with SES, but only on reading.In fact, middle-class students tended to show summer gains in reading achievement.However, some recent research suggests that socioeconomic and ethnic achievement gaps do increase during the school year and not just over summer (Palardy, 2015;Palardy & Rumberger, 2008) and that year-school increases are due in part to differences in instructional practices and teacher effectiveness (Betts, Zau, & Rice, 2003;Murnane, Willett, Bub, & McCartney, 2006;Stipek, 2004).If, as these recent studies conclude, achievement gaps increase during the school year and teachers contribute to that increase, it is less clear whether VAA that include summer in will bias estimates against teachers and schools serving children from educationally disadvantaged backgrounds.It is also unclear as to whether any such bias in VAA estimates can be addressed by controlling for student demographics.

Value-Added Models and Summer Bias
An implication of the research on summer learning to the assessment of school and teacher performance is that the timing of achievement test administration impacts estimates of achievement gains and, by extension, may impact VAA estimates.Consistent with this implication, a recent study concluded that test timing is the largest single source of measurement error and instability in VAA of teacher effectiveness; it is more important than the specification of the model, the sample of students, or the achievement test used (Papay, 2011).One may assume that the optimal testing schedule for VAA will provide data to accurately estimate change in achievement during the period school is in session.Because American schools are typically not in session for several weeks over summer, optimal estimation of VAA may require a minimum of two annual achievement tests, one administered in fall near the beginning of the school year and one in spring near the end.However, fall achievement testing is uncommon in U.S. schools, which has resulted in VAA typically being based on a spring-spring testing schedule.
Considering the literature on summer learning and its potential implications to VAA, there is surprisingly little research on the impact of including summer on VAA estimates.However, one study found that including summer can impact school VAA quintile rankings.Downey, von Hippel, and Hughes (2008) found that of schools classified in the bottom or top VAA quintile when summer was included, 20% to 35%, depending on the achievement test subject, were classified in another quintile when summer was excluded.This suggests that a substantial percentage of the schools that are classified as failures or successes based on a spring-spring achievement gains are classified as satisfactory when assessed based on a fall-spring achievement gains.While their results were highly revealing, Downey et al. (2008) limited their focus to schools, omitting teacher performance assessment, and did not investigate whether summer effects on school performance can be reduced using control covariates or whether including summer results in biased VAA estimates against high-poverty schools.Another recent study on VAA that speaks to the issue of test timing, argued that ignoring the summer period in VAA is tantamount to ignoring non-linearity in a growth model (Palardy, 2010).The results of that study indicated that ignoring non-linearity in VAA will inflate the variance in teacher effectiveness and bias VAA estimates against teachers and schools whose students have the most negative summer achievement gains (Palardy, 2010).Given the prevalent use of VAA in the U.S., more research is needed to better understand the effects of including the summer period.

Methodology Data Source
This study uses data from ECLS-K, a nationally representative and longitudinal sample of 1998 kindergarteners, their parents, teachers, and schools (NCES, 2002).1 Several characteristics of ECLS-K make it highly suitable for addressing the research questions of this study.First, the student sample is approximately nationally representative.This is desirable because accountability practices are commonly implemented in response to federal legislation (e.g., No Child Left Behind).Having a national sample, as opposed to a local sample, broadens the generalizability of the results so that they are more applicable to federal policy.Second, ECLS is the only national database that includes both fall and spring student achievement test scores, which is necessary for studying summer effects.Third, these test scores were set to an interval scale for each testing period and vertically scaled using item response theory (IRT) methods across testing periods.Interval scaled test scores are essential for assuring the gain unit is equivalent across the distribution or scores, while vertical scaling links tests of different difficulty such as the kindergarten and first grade tests.Fourth, ECLS-K includes many measures of student demographics and classroom and school context that are necessary for examining the viability of using control covariates to address any summer effects on VAA estimates.
The ECLS-K first grade longitudinal sample has 5,034 children.Students without teacher or school IDs were omitted, as were a small number of students who had missing test scores or who changed schools during first grade.We also limited our analysis to public schools because federal accountability legislation typically applies to the public sector and because private schools are more prone to selectivity biases that can confound VAA estimates. 2The sample for the present study included 2,251 students, 682 classrooms, and 168 schools.3

Value-added Assessment Models
All VAA models used in this study have the same general form, differing only in terms of the covariates that are included.The general form was selected based on its strong performance for recovering value-added estimates in two recent simulation studies (Guarino, Reckase, & Wooldridge, 2015;Henry, Rose, & Lauen, 2014).It is a three-level hierarchical linear model (HLM) with an outcome of year-over-year (YoY) or school-year (SY) achievement gains in reading or math.Levels one, two, and three correspond to students, classrooms, and schools, respectively.An advantage of the three-level HLM, as opposed to a two-level HLM, is that teachers are effectively compared with other teachers working in the same school.This helps separate teacher effects from school effects.The teacher and school VAA estimates are the level-two and level-three residuals, respectively.The teacher residuals are essentially the mean gains of the students in the respective teacher's classroom adjusted for the covariates that are included in the model.Similarly, the school residuals are essentially the mean classroom gains at the respective school, again adjusted for the covariates in the model. 4Henry et al. ( 2014) used a three-level HLM that is highly similar to that of the present study, which they found to perform better than five other commonly used and highly sophisticated models they tested for recovering teacher value-added estimates.Details on the model, model building, and model specification are provided next.
Model building.For each outcome, four sequential models were estimated: null, base, demographics, and context.Each subsequent model includes the covariates from the previous model plus a new set of covariates.The null model only includes a covariate that adjusts for student differences in the amount of time between the first and second achievement test administration, which varied across schools.Compared with the null model, the base model has only one additional covariate: a measure of achievement at the start of the gain score period.This removes the dependency of achievement gains on achievement at the start of the period (Cohen, Cohen, West, & Aiken, 2003). 5This model is considered the base model in that the control covariates are limited to what is typically recommended as the minimal controls for VAA.Comparing the null and base model results is instructive because recent research suggests that prior achievement is the most critical control for reducing selection biases in VAA (Chetty, Friedman, & Rockoff, 2014;Kane & Staiger, 2008).However, it is unclear whether controlling for prior achievement is also critical for addressing summer effects on VAA estimates.
The demographics model adds eleven student background and demographic variables, six classroom demographic composition variables, and six school demographic composition variables to the base model (see the Appendix Table for list of demographic variables used in this study).The demographic composition variables are included because student composition may be associated with summer learning above and beyond the demographic backgrounds of individual students.For example, if a school intakes an inordinately high percentage of students with demographic characteristics correlated with negative summer achievement trajectories, the instructional progression at the school may need to be altered to accommodate a more predominate summer setback, which may result in a smaller average achievement gain during the school year.
The context model includes all demographic model variables plus the nine additional measures of classroom context and ten additional measures of school context.The additional variables measure aspects of the educational context that previous research suggests are associated with student learning and may also be associated with summer effects.For example, the contextual variable "proportion new" measures the proportion of the students who transfer into the school after the start of the school year.Recent research suggests that such transfer students tend to disrupt the learning environment (Palardy, 2015).Moreover, the rate at which students transfer in after the start of the school year is arguably more of a proxy of neighborhood instability than a measure of teacher or school effectiveness, and if so is likely to be associated with summer effects.
Model specification.
As described above, the outcome is YoY or SY gains in reading or math achievement either from spring-to-spring or fall-to-spring, respectively.The subscripts (i, c, and s) denote the nested structure of the data; students (i) are nested in classrooms (c), which are nested in schools (s).The model controls for prior achievement, the time duration between the test administrations, and whether the child attended summer school.For the YoY outcome, spring of kindergarten achievement test scores are the prior achievement control, whereas for the SY outcome, fall of first grade scores are used.In addition, a set of eleven student (mostly demographic) background control variables are included (see the Appendix Table for descriptions of the variables used in this study).To adjust VAA estimates for differences in student inputs, continuous control variables are grand mean centered, while dummy variables are uncentered.All slope coefficients are fixed.π 0cs represents the conditional mean of the outcome for each c classroom.e ics represents the student residuals, which describes the deviation in each child's achievement gains compared to the mean gain of the classroom of which the student is a member.σ 2 is the estimated variance of the student residuals in the population.
The level two (classroom) equations are: Conditional classroom mean achievement gains (π 0cs ) in reading or math are the outcomes.β 00s represents the conditional mean on the outcome for each s school.r 0cs represents the classroom residuals, which describe the deviation in the adjusted mean achievement gains for each classroom from the mean classroom achievement gains of the school. 6These residuals are also the teachers' value-added estimates.τ β is the variance in the classroom residuals and describes the variance in achievement gains among classrooms within schools.
The level three (school) equations are: The outcome is conditional reading or math achievement gains at each school (β 00s ).The intercept, γ 000 , is the adjusted grand mean achievement gains.The school model includes two sets of covariates: six measures of school demographics and ten measures of school context.All of these covariates are grand mean centered.The school residuals (u oos ) represent the deviation in the adjusted mean achievement gains of each school from the grand mean of achievement gains.This residual is also the school value-added estimate.τ γ represents the estimated variance in the school residuals.
It is worth noting that in comparing YoY and SY VAA estimates, the research design used in the study factors out many conditions that potentially confound such comparisons, including that the same sample of children, teachers, and schools are used for the YoY and the SY estimates, and estimates are based on the same achievement test batteries.The only difference is whether summer is included.This strengthens the internal validity of the design for making inferences about summer effects.

Research Question 1
To promote policy relevance, the first question is addressed using estimates from the null and base models because they are specified to be consistent with recently-enacted the federal accountability guidelines.Specifically, to receive a waivers from NCLB accountability regulations, states are forbidden from adjusting for demographics such as race/ethnicity, free or reduced price lunch (FRL), or school composition in their accountability models (US DOE, 2010).We quantify the effect of including summer on VAA estimates using two methods: (a) linear associations, including the correlation and squared correlations (R 2 ) between YoY and SY VAA estimates; and (b) the quintile ranking differences of YoY and SY VAA estimates.Quintile ranks are of policy relevance because VAA are often used to identify and target low-performing teachers and schools for professional development or other remediation and to recognize high-performing teachers and schools for exemplary status.
YoY-SY correlation and R 2 .The null model YoY-SY VAA correlations range from 0.61 for schools on math gains to 0.77 for teachers on reading gains (see Table 1).While these correlations are moderate to strong in an absolute sense, they are rather weak for variables purported to measure the same outcome (i.e., teacher or school performance based on gains in student achievement test scores in reading or math).The R 2 values show this more vividly.The R 2 values indicate that the null model SY VAA estimates account for only 38-60% of the variance in YoY estimates, depending on whether the outcome is reading or math achievement gains.The rest of the variance in YoY VAM estimates, 40-62%, originates from the summer period.Note that the YoY-SY associations tend to be weaker for math than for reading.That pattern was expected because math learning is more school-based than is reading.That is, children across demographic groups tend to have little exposure to math over summer, but children from higher-SES families tend to engage in considerable verbal and some written communications with their more educated parents over summer, which can maintain or even build reading and literacy skills over summer (Burkam, Ready, Lee, & Logerfo, 2004;Cooper et al., 1996).
Compared with the null model, the base model correlations are all higher, now ranging from 0.827 to 0.909, and the R 2 values are substantially higher in some cases, now ranging from 0.68 to 0.83.This indicates that controlling for the dependency between the achievement gain outcome and prior achievement reduces the effect of including summer on VAA estimates.The HLM results (not shown due to space limitations) provide an explanation: controlling for prior achievement accounts for considerable variation in mean classroom and mean school summer achievement gains, but a much smaller proportion of mean classroom and mean school SY gains.Note that the summer period is the difference between YoY and SY VAA estimates.Hence, in controlling for prior achievement and reducing the summer effect, the associations between YoY and SY VAA estimates are strengthened.

Table 1 YoY-SY Correlation and Quintile Rank Difference by Model
Percent quintile rank differences describe the percent of teachers or schools whose YoY and SY VAA ranks differ by zero, one, two, three, and four quintiles, where zero indicates no difference.For example, the null model results for schools on the reading gains outcome show that 39.3% of the schools have the same YoY and SY rank, while 0.6% differ by four quintiles.The mean is the average YoY-SY quintiles difference.For example, in a sample of two schools, if one school has no quintile difference and the other school has a two quintile difference, the mean is 1.00.
YoY-SY quintile rank differences.The results of the null model quintile comparisons indicate that a large percentage of the teachers and schools are in different effectiveness quintiles for YoY and SY VAA estimates (see Table 1).Between 50.9% and 61.1% of the teachers and schools were in different YoY and SY performance quintiles, depending on whether the outcome was reading or math achievement gains.Moreover, 13.7% to 22.4% differed by two or more quintiles, with several differing by 3 and even 4 quintiles. 7These differences in YoY-SY quintile ranks are solely due to whether the summer period was included in the achievement gains estimates.
Similar to the results for linear associations, controlling for prior achievement in the base model reduced the quintile rank differences considerably.The magnitude of the reduction can be gauged by comparing the mean quintile rank difference for the null and base models.The mean 7 Note that a two-quintile difference equates to an average teacher (middle quintile) being classified as very low-performing or very high-performing (quintiles 1 or 5) and a four-quintile difference equates to a very low-performing teacher (quintile 5) being classified as very high-performing (quintile 1) or vice-versa.

Comparison
Null quintile differences for teachers on the reading and math gains outcomes were reduced by 31% (from 0.68 to 0.47) and 27% (from 0.92 to 0.67), respectively.The reductions for schools were highly similar.However, even after controlling for prior achievement, between 40.6% and 51.3% of the teachers and schools were still in different quintiles for YoY and SY VAA estimates, and between 4.2% and 12.2% differed by two or more quintiles, depending on the outcome.Figures 1a-d show that there is approximately the same number of positive and negative quintile misclassifications.That is, teachers and schools are approximately as likely to be underestimated based on YoY VAA quintile rankings as they are to be overestimated.Note, however, that teachers and schools in the lowest YoY quintile can only be classified in equal or higher on SY quintile rank because there is no possibility of being in a lower quintile.Similarly, teachers and schools in the highest YoY quintile are systematically lower on SY rank.It follows that the misclassification rate is highest in the middle YoY quintiles because the teachers and schools can be classified higher or lower on SY.That is, teachers and schools in YoY quintile 3 are more likely to be misclassified than are teachers and schools in YoY quintiles 1 or 5.

Research Question 2
The purpose of this question is to determine whether summer effects on VAA estimates can be ameliorated using control covariates that are predictive of summer learning, or if biannual assessments (fall and spring) are necessary.To address this, two additional sequential models were fit, including the demographics model and context model (described above).Relevant to policy, the measures included in these models are typically available to districts and thus can be implemented in VAA.
The results (see Table 1 and Figure 2a-d) show that compared with the base model, the demographics model provides only minor improvements in terms of the strength of the linear association between YoY and SY VAA estimates and differences in quintile rank.Similarly, compared with the demographics model, the context model reduced the linear association between YoY and SY only slightly and the differences in quintile rankings are minor.Therefore, including an extensive number of demographic and contextual variables does not substantively reduce the summer effects on VAA estimates.Moreover, after controlling for these extensive sets of variables, substantial YoY-SY quintile rank differences remain.These results suggest that twice-annual assessments (fall and spring) may be necessary to remove the summer effects from VAA estimates.

Research Question 3
To address this research question, the summer part of the base model YoY VAA estimates was isolated from the school-year part.Again, the base model was used because it conforms to the new federal accountability waiver provision that forbids adjustments for demographics (US DOE, 2010).The summer part was isolated by regressing the base model YoY VAA estimates on the SY estimates and saving the model residuals.That was done for teachers and schools separately.These summer VAA effects were then regressed on two measures of student composition that may bias VAA estimates against teachers and schools serving disadvantaged populations: 1) the proportion of students in the classroom or school who receive FRL, and 2) the proportion of students who are black or Hispanic.
The base model results (see Table 2) show no biases among teachers in the same schools.This was expected because first grade teachers in the same school tend to serve highly similar students in terms of students' economic and ethnic backgrounds.However, a significant negative association was found between mean summer gains and the proportion of students at the school who quality for FRL (reading gains = -0.29,p < 0.01; math gains = -0.18,p <0.05).Whether proportion FRL and proportion minority are associated with summer biases was also tested for the demographics model.Recall that the demographics model controls for those and other demographics factors, so no biases were expected.The results (see Table 2) confirm that.

Table 2
The association between the summer component of VAA estimates and proportion underserved students in the classroom or school.
This result is consistent with the literature on seasonal effects, which indicates that students from low SES families tend to have greater declines in reading achievement over summer, but learn at similar rates as other children during the school year (Alexander et al., 2001;Cooper et al., 1996;Heyns, 1978).

Implications to Practices for Reducing Summer Biases
Twice-annual assessments.The findings for research question 2 suggest that addressing summer effects on teacher and school VAA estimates will require twice-annual assessments (fall and spring).That is because even after employing extensive statistical controls for student background and demographics, as well as controls for classroom and school context, substantial differences in YoY and SY VAA estimates remained.
Controlling for prior achievement.A comparison of the results for the null and base models shows that controlling for prior achievement reduces the summer effect considerably.Previous research has shown that controlling for prior achievement reduced selection biases in VAA estimates (Chetty, Friedman, & Rockoff, 2014;Kane & Staiger, 2008).That conclusion appears to extend to selection biases originating from the summer.Hence, if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.
Student assignment practices.The results suggest that once enrolled at a school, first graders are not randomly assigned to teachers.That is, students attending the same school are expected to vary in terms of summer learning rates, but if the first graders enrolled at a given school are randomly assignment to their first grade classrooms, then the mean summer learning rates among classrooms in the same school would be expected to exhibit only random variation.Yet, the results show a substantial degree of variation in summer learning rates among classrooms in the same school, suggesting that children are not randomly assigned to classrooms.This finding is not surprising, as previous research has concluded that random assignment of students to classrooms is uncommon (Authors, 2015;Burns & Mason, 1998;Kalogrides & Loeb, 2013;Paufler & Amrein-Beardsley, 2014;Praisner, 2003;Rothstein, 2010).The results for the base model suggest that students' prior achievement plays a role in student assignment because controlling for prior achievement substantially reduced summer effects on teacher VAA.However, the results for the demographics model suggest that demographics play only a very minor role.Hence, other than prior achievement, it is not clear what the precise student placement mechanisms are that contribute to summer effects in VAA teacher estimates.
A recent study by Paufler and Amrein-Beardsley (2014) provides insight into what those student assignment mechanisms might be.The authors surveyed over 300 elementary school principals in Arizona, 98% of whom reported using student and teacher information during the placement process in an effort to match learning and teaching styles, personalities, and special needs with the objective of maximizing student outcomes.The student information that principals reported giving the strongest consideration to was prior academic achievement, prior behavioral issues and/or perceived behavioral needs, language status and/or proficiency, and prior grades.The present study controls for prior achievement and language status, but not behavioral issues and needs or grades, because good measures of those variables were not available for the ECLS data.Research is needed to examine whether student assignment practices that take into account students' behavioral issues/needs or grades contribute to the summer effects on VAA estimates.
Reducing measurement error.The high rates of quintile rank differences between YoY and SY VAA indicate that including summer adds considerable measurement error to VAA estimates, which undermines their reliability.8Previous research on VAA has shown that teacherand to a lesser extent, school-VAA are unreliable from year to year; however, those studies did not examine the degree to which including summer contributed to the unreliability (Lockwood et al., 2007;Papay, 2011).Similarly, previous research has shown that the reliability of teacher VAA estimates can be improved substantially by pooling data across multiple years (see McCaffrey, Sass, Lockwood, & Mihaly, 2009).However, it is not clear whether pooling data across multiple years will address the summer effects.Pooling data across years improves reliability of VAA estimates by accounting for year-to-year fluctuations due to measurement error and other random factors, as well as year-to-year fluctuations in true performance.Yet, if the summer effect is based on the same mechanisms across years (e.g., student assignment practices), pooling the data across years will not likely reduce it.Research is needed to determine the degree to which summer effects are consistent across years and whether twice-annual testing addresses the more general issue of year-to-year instability among VAA estimates.

Policy Implications
Cost-benefits of twice-annual assessments.The results of this study have important implications for educational policy regarding the inclusion of the summer period in VAA.Perhaps the most critical implication is that fully addressing summer effects will likely require twice-annual achievement testing.However, such a proposal may be met with opposition due to concerns about costs associated with additional testing and the time it would take from learning activities.Yet, the validity of the cost concern is questionable.For example, a recent Brookings Institute study found that achievement test batteries cost an average of $27 per pupil in grades 3-9, which represents a miniscule percentage of total annual per-pupil expenditures (Chingos, 2012). 9In addition, the study concluded that the already low costs of testing can be reduced by a third or more if states participate in a testing consortium such as Smarter Balanced Assessment, which distributes test development and scoring expenses across an extremely large number of students.With the onset of Common Core, most states have recently joined a testing consortium already.Therefore, concern about additional expenses is not a good reason for rejecting biannual testing.
A more realistic concern than the monetary expenses associate with additional testing is the time it will take from learning activities.Standardized achievement test batteries typically take 3-8 hours to administer.Furthermore, when stakes are attached to the results, time may be spent on test-specific preparation that is of questionable value to academic development.Given the results of this study, federal and state agencies should consider policies that encourage exploration of the costbenefits of twice-annual testing.Critical to that analysis is a better understanding of how much time an additional annual test battery is expected to take from learning activities and whether that can be reduced.
High-stakes personnel decisions.The results show that when summer is included, VAA model estimates can be very different compared to when summer is not included.This raises concerns about the use of YoY VAA estimates for high-stakes personnel decisions.While an argument can be made that YoY VAA estimates still contain useful information about the performance of individual teachers or schools (e.g., see Glazerman et al., 2010), their marginal reliability suggests they should not be the sole basis for gauging performance for high stakes decisions.
Biases against schools serving disadvantaged children.Another finding of this study with policy implications is that VAA estimates are biased against schools serving higher percentages of children who quality for FRL.This summer effect can easily be addressed by controlling for differences among schools in the proportion of students who receive FRL.However, recent federal policy on accountability waivers forbids the use of such controls (US DOE, 2010).The results of this study challenge the fairness of that policy, suggesting that it will result in systematic bias against high-poverty schools, which can create false perceptions that such schools and the teachers and administrators working there are ineffective, when their performance is average or even above average.Biased VAA estimates and the perceptions they create can have negative consequences on staff morale and efforts to recruit and retain effective teachers and administrators.

Limitations and Future Research
A limitation of this study is that the results are based on data from first grade.The reason for that limitation is that ECLS-K only has spring-fall test scores for one year-between kindergarten and first grade.It is unclear whether the results of this study generalize to higher grade levels.However, due to age proximity and similarities in instructional methods and classroom structure in early elementary school, the results may generalize to second and third grades.Research is needed to examine summer effects on VAA at higher grade levels.
Another limitation is the size of the classroom sample. 10The average classroom sampled had 3.3 students.Having data on all students in each classroom would improve the reliability of the individual teacher VAA estimates.However, it is not clear how this affects VAA quintile misclassification rates.To examine that, a sensitivity analysis was conducted.The analysis used a subsample of teachers who had the largest number of children in their sample.The cut-points for being included in the sensitivity analysis were teachers with 8 or more students sampled (n = 31).The results for this analysis (see Table 3) were compared with the results for the full sample (Table 1), shows consistency in misclassification rates.There is no evidence that sample size impacts misclassification rate in a systematic manner.The results of this sensitivity analysis are not surprising because this study does not examine the reliability of VAA estimates per se, but rather the impact of summer on VAA misclassification.These are two different issues, with the former highly impacted by sample size and the latter apparently much less so.
It is also worth noting while a substantial number of control variables were used to test whether the summer effects were due to student demographics or classroom and school context, other types of variables may have contributed to summer effects.The control variables used in this study were selected for two reasons: 1) they are typically available to school personnel and therefore can readily be implemented in accountability models; and 2) previous research suggests they are associated with summer effects.Research is needed to examine whether other types of control variables, such as summer activities and neighborhood effects, can reduce summer effects on VAA estimates.

Summary and Conclusions
The findings of this study show that between 40% and 62% of the variance in YoY VAA estimates, depending on the outcome, originates from the summer period.This summer measurement error alters teacher and school quintile rankings considerably.For example, 51% to 61% of the teachers and 58% to 61% of the schools, depending on the outcome, change performance quintile rank when the summer period is omitted, and many teachers and schools change 2 to 3 quintiles and a few changing 4 quintiles.Furthermore, this summer effect invariably underestimated the performance of teachers and schools in the lowest quintile of summer change and overestimated the performance of teachers and schools in the highest quintile of summer change.While controlling for prior achievement reduces the YoY-SY VAA differences, extensive statistical controls for student background, demographics, and classroom and school context did not substantially alter the summer effect.Finally, including the summer period in VAA estimates created biases against schools serving high concentrations of children who qualify for FRL, and while statistical controls can neutralize those biases, current federal policy forbids their use for accountability assessments.Together, these findings indicate that including summer in VAA substantially undermines the reliability of VAA estimates and that addressing the problem will likely require biannual (fall and spring) achievement testing.

Comparison
Null

Figures
Figures 1a-d: Base Model Quintile Misclassification by YoY Quintiles.

Figures
Figures 2a-d: School and Teacher YoY-SY Quintile Rank Differences by Model.
The multilevel equations for the context model are shown below.Note that the other models are reduced forms of the context model for which sets of covariates are omitted.The level one (student) equation is: