Policy Incentives in Canadian Large-Scale Assessment: How Policy Levers Influence Teacher Decisions about Instructional Change

Large-scale assessment (LSA) is a tool used by education authorities for several purposes, including the promotion of teacher-based instructional change. In Canada, all 10 provinces engage in large-scale testing across several grade levels and subjects, and also have the common expectation that the results data will be used to improve instruction in classrooms. Yet despite agreement between ministries that instructional change based on LSA results is a positive development and employs data-based decision making at its heart, there remain significant differences in the kinds of incentives written into assessment policies in Canada. It is also true that implementation of the policies is less than uniform between schools and school divisions. Using mixed methods (survey data and follow-up interviews), this study examines which policy factors have the most significant impact on teacher decisions regarding the use of data. The findings indicate that highly incentivized policies correlate well to instructional change, including aspects of 1 Due to extenuating circumstances, and with the permission of the author and the lead editor, this article was published post-peer review and revisions, but prior to a final proof by the author or article metadata translations into Spanish and Portuguese (translations may be available upon request). Education Policy Analysis Archives Vol. 25 No. 115 2 both teaching (to) the curriculum as well as teaching to the test. Since the latter will be examined as a neither an educationally defensible practice nor a stated policy goal, the statement that ‘incentives work’ does not fully capture the nature of these impacts.

used will be examined; f) the limitations of the study will be enumerated; g) the results from qualitative and quantitative data gathering will be detailed; h) a discussion of the policy implications of these findings will be identified; and i) the paper will be summarized in the conclusion.

Literature Review
Large-scale assessment has been a fundamental tool for the purposes of educational accountability for decades in some jurisdictions and for at least 10 years across all provinces in Canada (Klinger, DeLuca & Miller, 2008).Canadian ministries have set out several specific and wideranging goals for these assessments (see figure 3 below).Such large-scale policies and policy goals are intended to be explicit and clear to teachers (Huber & Skedsmo, 2016;Linn, 2003).Yet having more policy goals for such assessments makes the administration of LSAs more problematic and complex.Classroom-level changes require disaggregated census-style tests while policy-level goals could use less onerous sample-style assessments much like the Pan-Canadian Assessment Program and the Programme for International Student Assessment (Hargreaves & Shirley, 2011;Morris, 2011;Volante & Ben Jafaar, 2008).

Unintended Consequences
LSA policies that promote the use of the results data must also take into consideration the instructional methods used to fulfill those expectations.Choices made about item-types and content influence teacher choices and in some cases come to supersede the full and expansive curriculum (Koretz, 2009;Luna & Turner, 2001;Shepard, 2000).The issue of curriculum narrowing is well documented and is generally considered to expend too much time and other resources on a limited scope of tested material (Holcombe, Jennings & Koretz, 2013;Volante, 2004).Limiting the manner in which students are allowed to demonstrate mastery of given outcomes is not considered the best classroom practice, but it is a key feature of standardized LSAs adopted by teachers to prepare their students (Bauer, 2000;Cizek, 2000;Datnow, Park & Kennedy-Lewis, 2012).
There are other unintended consequences of LSA.For example, when social actors are aware of the metrics that have been devised to monitor their behaviour, they react in predictable ways to have the results show them in the best possible light (Abrams, 2004;Olah, Lawrence & Riggins, 2010;Webb, 2006).This is defined in this paper as 'reactivity' which is outlined in the theoretical framework section below.In short, the more importance that is applied to a given metric, the more likely it is that reactive behaviours will be apparent.The focus of this paper is incentives, which act to increase the relative import of LSAs when pressure, stakes, or sanctions are conditional upon results scores (Altrichter & Kemethofer, 2015;Dee & Wyckoff, 2013;Hamilton & Berends, 2006).

High Stakes
Concerns about how teachers will use these data are particularly acute in cases where highstakes testing occurs.Stakes are almost always evident to the students who must take these assessments, but this paper examines the considerable policy-related and public pressure on teachers (Spillane et al., 2002;Wößmann, 2003;Young, 2006).These professionals bear much of the blame when results are below expectations, and have justified concerns in some cases about professional consequences (Amrein & Berliner, 2002;Ben Jaafar & Earl, 2008).While not all provinces has LSAs that are high-stakes (for example only Ontario and New Brunswick have graduation requirement tests), many teachers across Canada feel the stakes involved are high as is the pressure applied to improve instruction.On the other hand, the vocal support for high-stakes testing comes from researchers who assert LSA can: ensure learning is taking place: hold schools and teachers accountable for their work; and to level the playing field for teachers in any one jurisdiction (Allan, 2002;Bishop & Wößmann, 2004;Finnigan & Gross, 2007;Marsh, Farrell, & Bertrand, 2014;Wiliam, 2010).

Improving Instruction
All provincial ministries promote the use of LSA data to guide decision making.In general terms, the use of data to inform school-based or classroom decisions is known as data-driven decision making (DDDM).Making appropriate use of organized and meaningfully collected information, both qualitative and quantitative, helps to 'guide decisions' in schools (Schildkamp, Poortman & Handelzalts, 2016) and goes hand in glove with accountability policy.It is only reasonable to expect that data from LSAs are put to some constructive use.Datnow, Park and Kennedy-Lewis (2012) point out that being driven by data does not fix all in the education system, nor does simply accessing data change instructional practices for the better.Goertz, Oláh and Riggan (2009) support this thesis, stating that assessment literacy and professional development are necessary to make effective use of these data.An expansive perspective on DDDM is shown in Mandinach and Gummer (2016) who noted 53 skills and sources of knowledge required by teachers to effectively and appropriately use data.A summary perspective is given in Hamilton et al. (2009) in which focused on five research-approved instructional DDDM practices.

Other Testing Models
Different models of test-based accountability policies have been proposed, employed and studied.Nichols and Harris (2016) point out that lower-stakes testing (Australia), using sample-style assessments (Finland); and relying on experts via school inspections (New Zealand) are all reasonable alternatives to high-stakes, sanctions-driven policies.Altrichter and Kemethofer (2015) examined the school-inspection model in Europe and found both some positive and also some questionable reactivity effects.This was also true of Ehren and Shackleton's (2016) examination of the Dutch inspectorate.Even the Canadian and Australian 'low-stakes' models suffer from extreme public pressures after media reports of LSA data.Publication, ranking schools and the desire of teachers to appear effective tend to increase the stakes for teachers involved in testing.High-stakes leads to more well-documented unintended consequences of LSAs, namely political pressure for educational reform and the risks of teachers using gaming behaviours (Breakspear, 2012;Nichols & Harris, 2016;Rutkowski & Rutkowski, 2016).
In the end, this study sets out to focus on the provincial policy incentives which drive teacher behaviours in Canada.It is clear that policy choices can effect DDDM practices, although what drives data use may be teacher attitudes, appropriate and available support, or policy incentives.An extensive amount of research and examination of policies has been done with regards to individual schools, school divisions or provinces (Ben Jaafar & Earl, 2008;Campbell & Levin, 2009;Fullan, 2009;Hargreaves & Shirley, 2011;Levin, Glaze & Fullan, 2008;Scott, Webber, Aitken & Lupart, 2011;Volante & Cherubini, 2010;Wideman, 2002).The research study upon which this paper is based (Copp, 2015) was the first on the implications of LSA policy on a national scale ever done.This paper is one of several published and awaiting peer-review on different aspects of LSA policy.

Research Questions
This study asks how and why teachers change their instructional methods in reaction to LSA results and more specifically about the policy incentives intended to promote these changes.The research questions addressed here are: 1) Are policy incentives regarding LSA correlated to instructional change in classrooms? 2) Which incentives variables are most closely correlated with the instructional use of LSA data? 3) Are classroom-based instructional changes more strongly correlated with teaching (to) the curriculum or towards teaching to the test?
Instructional change made in reaction to LSA data is the dependent variable of this study.It had to first be established that teachers do indeed 'react' in some way to LSA results in order to move on to the more detailed questions about which incentives promote change, and what kinds of changes are made in practice.

Theoretical Framework
Reactivity explains the noticeable changes in behaviour when people know they are being externally observed or evaluated.Whether conscious or unconscious, reactivity has a direct impact on the objectivity of the metrics used.The concept was first examined in detail by Campbell (1957) as one possible design flaw in social sciences experimental methodology.In the context of this paper, it is hypothesized that teachers are reactive to LSAs at least in part because of the incentives written into related policies.The reactivity framework was chosen to provide this practical perspective on policy incentives.
Reactivity studies since Campbell's time have looked at different fields of and found the same general flaw in external assessment methods: that many people are clever enough to figure out how they are being evaluated and find ways to 'game' the results.Espeland and Sauder (2007) examined US law schools and how they were reactive to ranking tables from US News & World Report.Public service performance targets can also lead to gaming behaviours, as examined by Hood (2006) in the UK.Manipulation of metrics is 'the performance paradox' explained by Van Thiel and Leeuw (2002) or the 'choreographed performances' noted by Webb (2006).Whatever term is used, they are known to create distortions of reality when the performance metrics are known.
In this study, teachers were asked to respond to questions about which kinds of instructional changes were made in their classrooms in response to LSA results.They were presented 10 instructional strategies and asked to indicate the frequency of their use from the choices 'always', 'sometimes' or 'never'.A key feature of this unique model was the differentiation of these 10 strategies into two groupings based on their educational defensibility and related alignment with stated policy goals.Figure 1 shows the grouped survey prompts.
There are five strategies that closely adhere to the professional standards set out in (for example) the Saskatchewan Teachers' Federation (2015) document called their Code of Professional Conduct as well as stated ministry policy goals.These are designated as teaching (to) the curriculum (TTC).TTC includes only those practices which are considered to be both ethical and increase the number or variety of outcomes taught to students.These appro aches are less likely to result in higher scores on any specific test, but they should provide the skills to improve achievement in different situations by avoiding the potential pitfalls of teaching geared to a single instrument (Popham, 2001).
Teaching to the test (TTT) includes those educational strategies which are considered to be are either unethical or decrease the number or variety of outcomes taught to students.TTT methods are the most direct way to improve scores on a specific test, but eve n those practices which might be ethical in this grouping do not have the transferability, the increased 'leverage,' upon which TTC strategies are based (Au, 2007;Cullen & Reback, 2006;Jacob, 2002;Van Theil & Leeuw, 2002).These strategies do not meet the terms of the Saskatchewan Teachers' Federation (STF) code of conduct, especially when used to excess or by default.They also do not align well with the ministry goals as set out in policy documents (Copp, 2016b).The dependent variable of this study is the instructional change teachers make in reaction to LSA data.The use of either TTC or TTT strategies qualifies in this regard.The distinction made between TTC and TTT strategies was based on the judgment of the author, the terms of the STF code of conduct, as well as the education ministries ' policy goals.These three considerations showed great congruence in the selection of the groups.There is a distinction made in the literature about types of reactivity, and it seemed only appropriate to treat TTC and TTT as substantially different reactions to LSA data (Koretz, 2009).It is important to note that it not being argued that the use of these strategies i n specific cases is necessarily unethical.Teachers suit their instructional methods to student needs and make choices about what will work best throughout the school year, and this is just as it should be.What is suggested by the author is that the use of TTT strategies when used by default, regularity, or with entire class groups is less appropriate than using TTC strategies.The difference between the groups is one based on judgement and was not determined or expected to be shown with statistical methods.It is understood that some readers will disagree.

Canadian Context
In Canada each province has jurisdiction over education policy including which subject areas and grade levels are assessed using LSAs.Figure 2 shows the information gathered from provincial ministry websites (in 2015).It indicates that grades 3, 6, 9 and 12 have the most testing, but subjects, methods, and relative stakes for students and teachers differ.The policies also differ in terms of incentives for improvement.In each province, the results are made public either through an education ministry release and/or publishing in newspapers and other media.

Teaching (to) the Curriculum
Teaching to the Test I have looked for Professional Development to improve my instructional practices.I cover material I know will be on the test very well.
I have requested additional resources related to testing.
I focus more on test-taking strategies like the process of elimination.I have worked with other teachers to make sense of the data.
I use the format of the test to give similar types of practice questions.
I cover a wider range of topics in the curriculum.
I focus more on subjects that have provincial tests.
I hold group study sessions or provide extra help after school.I review old exam questions.
Q: Think about the ways your instruction may have changed in classes which write provincial assessments as compared to those classes that do not write these tests.Choose a response for all the following statements: (Response choices for each statement were: not at all; somewhat; a great deal.) Policy-level choices made based on the results vary from staying the course, tweaking tests or test design, monetary support for new programs, all the way to the complete overhaul of school curricula.In many provinces, the test items are a closely guarded secret so as to avoid the risk of teachers giving students more information than they need.In these provinces teachers do not even see the tests until the day of the assessment (Prince Edward Island and Alberta are examples).Other jurisdictions freely distribute the tests and even suggest running through practice tests (Ontario and Saskatchwan are examples).There has been a furor in the press in Ontario recently about the fact that education funding has increased but provincial math assessment (EQAO) scores are going down (Maharaj, 2017).The controversy and scrutiny over scores from LSA results do have a significant policy impact.Note: Short forms for the provinces: Alberta (AB); British Colombia (BC); Manitoba (MB); New Brunswick (NB); Newfoundland and Labrador (NL); Nova Scotia (NS); Ontario (ON); Prince Edward Island (PEI; Quebec (QC) and Saskatchewan (SK).
The assessed subjects always include English or French, mathematics, and regularly science and social studies.Other subject assessments are administered only in British Columbia, and these will be discontinued in the 2016-2017 school year.There is great variation seen in the kinds of tests given to students, and what stakes are applied to teachers.Certainly there is no talk in Canada of firing teachers or principals, or of shutting schools as a result of poor LSA results.There is pressure to improve applied in each province, although the type of test somewhat dictates the kind of pressure.Public and parent scrutiny is likely the most cited by interview respondents, especially when tests are a graduation requirement or when the grades are significant factors for university admission and scholarship opportunities (more on this topic follows).Gr. 12 EF,M,S,SS  Figure 3 shows the numerous policy goals set by the 10 provincial education ministries in Canada.These goals are intended to be addressed by the administration and analysis of LSAs.There is a clear tendency to include both policy-level and cassroom-level goals for these assessments.

Methodology
Mixed methods are used in this study to identify both statistical correlations and also to expose explanatory details from interviews (Flick, 2006).This study employed a sequential explanatory design with a survey as the primary data collection, followed by the second phase interviews, which were analyzed in terms of the established frames from the quantitative analyses (Blaikie, 2000).

Surveys
Surveys were emailed to participating schools in all Canadian provinces and to teachers at all grade levels Pre-K through 12.A review of the literature revealed useful field-tested questions from relevant research studies including from Skwarchuk (2004), Hamilton and Berends (2006), Brown (2004), Wayman, Cho, Jimerson and Spikes (2012), Boyle, Lamprianou and Boyle (2005).Questions from these studies were adapted to meet the needs of this study.There were several themes from the literature examined: test design and data; teacher attitudes; supports for data use; and policy 4 Adapted fromCopp, 2016b, p. 7.

*
Total number of stated/implied purposes 6 8 6 8 6 8 9 5 8 7 X Purpose evident from ministry literature * Not explicitly stated, but apparent from ministry literature † Exams must be written but need not have a passing grade ‡ Exams are mandatory when teachers are not accredited incentives.Each of these is treated separately in one paper in a series from the author (Copp, 2016a(Copp, , 2016b(Copp, , 2017)).The complete survey is shown in Appendix 2 alongside the values assigned to different responses for analysis.All surveys were sent in the 2013-2014 school year in a cross-sectional design.The sampling unit was individual teachers, Canadian public school teachers who administer LSAs were the target population, and the probability sampling was clustered in school divisions.School divisions provided non-overlapping geographical areas of study.The selection of participants was random, yet it was subject to the reality that many school divisions chose not to take part making this as much a voluntary sample (at the division level) as a random one.
While the target population included several strata, there is a lack of demographic data available about teachers at both the national and the provincial level.Statistics Canada collects national data on only two of these strata (sex and age).National sample age data are 92.6%congruent with Statistics Canada (2007) numbers and sex data are 99.4% congruent.This level of comparability, even if for a limited number of comparators make it more likely that these results could be generalized to the wider Canadian teaching population (McMillan & Schumacher, 2010).Single province samples were too small for anything more than summary analyses, but the larger 'n' of the nation-wide data meant that all strata were well represented.

Table 1
Response rates for the teacher survey 5   All teachers in a given school had the opportunity to respond (n=1071) but the data and analysis here come from the subset of teachers (n=453) who administer LSAs in their classrooms.The minimum number of respondents in this group from each province was set at 30.The overall response rate was only moderate overall (Nardi, 2006).It is uncertain, though, whether simple nonresponse itself casts doubt on the results of any reasonably responded-to instrument as nonresponse bias is a function of both the rate of response and the difference between responses for these two groups (Couper, 2000;Olsen, 2006).Response rates for the survey are given in Table 1.
The dependent variable from the larger study was teacher reactivity, of both TTT and TTC types.Other grouped variables examined were tests and test design; teacher attitudes about testing; supports to use data; and incentive to use data.from the survey reactivity questions.As a result if the low numbers of respondents in some provinces, the rankings should be considered cautiously.Table 3 shows the results from regressions using these grouped variables.Background information about the respondent teachers were also collected and analyzed.The only data closely examined in this paper are those related to reactivity and incentives (the full survey is in Appendix 2).The quantitative analyses are built around a conception of the dependent variable (Y) in this study (Diez, Barr & Cetinkaya-Rundel, 2016): Yi represents the dependent variable or reactivity level for teacher i (use of test data); X1i is the first independent variable, X2i is the second, etc.These are explanatory variables for the teacher i (all of the several incentives variables are named in Table 3); Intercept β0 is the expected value of Y when all Xs equal 0; β1 is the regression coefficient of X1; β2 is the regression coefficient of X2, . .., βk ; and ei is the residual of the regression.The survey data were operationalized by assigning values to responses.Any practice or policy that seemed to promote or make easier teachers using LSA data was assigned a positive value.A practice or policy that was thought to make using these data more difficult was assigned a negative value (shown in Appendix 2).For reactivity variables, a similar process was used.Three choices were presented to respondents about the frequency of using specific strategies as a result of LSA data.Responses of 'always' were scored ±1, 'sometimes' was scored ±0.5, and 'never' was scored as 0 (TTT responses received negative values).These values are shown in alongside the survey in Appendix 2.
Each of the four lines of inquiry consisted of several survey questions which were aggregated for the preliminary analysis seen in Table 3.Each of the four aggregated scales were given equal weight in this regression in order to see which lines of inquiry had the most statistical impact on respondents' use of data.
Within each of the groupings, the questions asked were analyzed to see which of the independent variables had the greatest statistical impact in the use of LSA data to improve instructional practices (Table 4).Incentives variables are the main focus of this paper.Five questions were asked about incentives: i.Which jurisdiction (the school, school division, or provincial ministry, or none) had expectations that the data be used to improve instruction; 2. Which jurisdiction followed up on how LSA data were used in the classroom to improve instruction (the school, school division, or provincial ministry, or none); 3. Perceived pressure from LSAs was rated between the choices 'none', 'low' and 'high.' 4. Perceived stakes for the LSAs were rated on the same scale.5. Respondents were rated on their awareness of results by answering whether their class, school or school division results were higher than average, lower than average, or average.The other options were 'I don't recall' or 'the results were not provided.'Spearman's rank order correlation tests were done for these items (and items for all other lines of inquiry) to verify the validity of these numerical scales.Significant correlations between the independent variables indicated that the scales aligned, that many of the variables were relevant, and that values had been ascribed properly.The Spearman's correlations proved to be both significant and positive, but no distortion from multicollinearity was noted (Appendix, figure A1).
Cronbach's alpha was calculated to verify the internal consistency of scale variables in this table and in the analyses that follow.The alpha is often used as an index of reliability to check that items appear to measure the underlying construct (Jimerson, 2016;Santos, 1999).All reactivity scale items had alpha scores over 0.70 or more and the scale alpha was 0.742 indicating adequate congruence of the items in the aggregated score.For the incentives scale, the alpha was 0.731 while the lowest item score was 0.649 (Appendix, figures A2 and A3).Figures for these and other statistical analyses are found in Appendix 1. Alpha values are subject to interpretation which is best left to the reader, noting that both scales have good consistency across items (Tavakol & Dennick, 2011).

Interviews
The interview data collection used a semi-structured format to follow up upon the themes from the survey instrument.This was in order to triangulate the qualitative and quantitative data (Hamilton et al., 2009;Jick, 1979).While interviews were done to explore the lines of inquiry in more depth, contradictory data were also unearthed.Knowing there is a wide range of opinion on the topic of LSA, the data sets are complementary in the sense that only through the exploration qualitative data can one hope to fully explain relationships found using quantitative methods (Onwuegbuzie & Leech, 2005).
Subjects for interviews were purposively selected to represent the instructional strategies selfreported by teachers (Flick, 2006).Selection was based on a stratified sampling at the high and low ends of the reactivity range (those teachers with high scores for TTC and also those with high scores for TTT were contacted).The sample was multilevel since it also included in-school and divisionlevel administrators (McMillan & Schumacher, 2010).Only classroom teachers had completed the survey, but these in-school and division-level staff were interviewed in order to compare their responses to those of teachers.
The interview sample included teachers and administrators from across Canada and was made up of 13 classroom teachers, 10 in-school administrators, and four division-level staff.Respondents from such a small sample could not be expected to be representative of the survey population or to meet an external validity standard.They were chosen purposively to provide extra insight into the quantitative results (Flick, 2006).Interview data were not collected for generalization but rather to elaborate on specific topics.

Limitations
The data set from the survey was extensive but lacked some variability in locations where many schools and/or school divisions chose not to participate.The data were collected under strict confidentiality which meant the exclusion of school identifiers which might have made possible analysis using hierarchical linear modeling methods.HLM methods would be well suited to these data if the author had collected identifiers, and as a result these findings may be seen as less accurate.The response rate for the survey was reasonable but not high enough to dismiss the concern of nonresponse bias.While survey items were drawn from peer-reviewed and influential education studies, this is no guarantee of the construct validity.The interviews provided valuable insights into both policy and instructional choices, but were drawn from a small sample, and thus have their value limited to the insight they provide in conjunction with the survey data analysis.

Findings Four Lines of Inquiry
The independent variables used in the survey were identified initially from the literature on large-scale assessment and included four main lines of inquiry operationalized for quantitative analysis purposes: (1) test data and design; (2) supports provided for teachers; (3) incentives for teachers to use these data; and (4) teachers' attitudes regarding the large-scale assessment.The first category asked about the types of items on the LSA, and how the data were returned to teachers.The supports category asked what type and number of supports were made available to teachers.The third grouped variable consisted of questions about incentives which are seen below in Table 4.The attitudes variable asked teachers' opinions on how they thought the data might best be used.The complete survey is found in Appendix 2 and makes clear the categorized variable groupings.
Table 3 presents the lines of inquiry with beta (standardized) coefficients in order to show the relative strength of scale variables in an unbiased comparison.Coefficients are from multivariate regressions for the nationwide dataset, which allows one to see which scale variables had the most significant effects on different types of reactivity.For detailed analysis of lines of inquiry other than incentives, see Copp (2016aCopp ( , 2016bCopp ( , 2017)).
The variables with the greatest correlation with TTC, based on beta scores, are the attitudes variables (Copp, 2016a).They have no significant impact on TTT effects.Similarly, supports variables have a significant correlation with TTC but none on TTT (Copp, 2016c).Tests and data variables have no significant impacts on either TTC or TTT.Most relevant to this paper, though, are incentives variables which have a uniquely significant impact on both TTC and TTT.With three significant scale variables in play, the amount of variance in TTC explained by these lines of inquiry is a relatively high 20%.With only one significant scale variable working on TTT effects, the adjusted R 2 value falls to only 3%.It can be seen that incentives variables are significant in their correlation to LSA-based instructional change.There were five independent variables under the grouping 'incentives' examined in this study.The most significant of these (as seen in Table 4) was 'perceived pressure.'It shows highly significant correlations with both TTC and TTT effects before and after the addition of provincial dummy variables (these being included to account somewhat for variations between testing programs in different jurisdictions).Perceived pressure was also the only variable that proved significant at all on the TTT side of the table after provincial dummies were included in the regression model.

Table 3
Four lines of inquiry from the survey correlated with instructional changes (TTC and TTT). 7  7 Adapted from Copp, 2016a, p. 12  The other variables that indicated significant results were 'follow up data use' and 'perceived stakes' these having significant correlations with only TTC effects.Each of these variables will be considered in turn alongside qualitative (interview) data used to either support or challenge the statistical correlations shown here.
Perceived pressure.The perceived pressure variable was highly significant in terms of TTC and TTT effects on teachers.There is evidence here to support the idea that incentives used to increase the pressure on teachers to use LSA data are quite effective at this task.It is less encouraging that perceived pressure does not significantly increase the chances that teachers are using the data in the manner expected by provincial education ministries or the STF Code of Professional Conduct (2015).Incentives that increase pressure on teachers increase both TTT and TTC effects, and make less certain the educational effectiveness of the instructional changes to which their application and implementation appears to contribute (Appendix, figure A8).
Interview respondents frequently commented on the high levels of pressure teachers felt when teaching classes that administered LSAs.
As an administrator I try to assure them [teachers] that this is part of your teaching, this is all doing the things you do normally.I still think there is a stress there, that their kids, if their kids haven't met the standards there is obviously a stress there that they are not doing their job.(MB, Elementary school principal, female) We talk about the journey being a 10 year journey and so if you're missing some essential foundational learnings, and umm, then you have a Math assessment at the end of grade 9 and you haven't really achieved, I think our grade 9 teacher feels like, you know, a lot of pressure for her… We keep trying to remind her that they didn't 'not learn it' over one year, they 'not learned it' over a number of years.(PEI, K-9 school principal, female) And that's the unfortunate side-effect of a test culture, right?Like if you put all of the worth on the test, then of course your energies are going to be focused on the test which is not where there should be focused.(AB, High school Math teacher, female) It was true that the source of the pressure was sometimes attributed to the public pressure for accountability made explicit through the publication of assessment results and implicit expectations from school or senior administration to perform well.More commonly, though, respondents spoke about professional responsibility being a source of pressure.
A teacher is a teacher because of their intrinsic drive.I don't think they need any external forces pushing them to be better.(NB, Middle years homeroom teacher, female) You know I feel under a lot of pressure, but you know, it's self-induced, really, to make sure they do well, and the students are under that same pressure… You see across the hallway there are students over there that do course that don't have provincial exams and they seem to really be enjoying their courses… But in my room, you know, I don't want anyone knocking on the door from 9:00 until 10:00 -we're doing problems and nobody moves.(NL, High school Science teacher, male) It was difficult in most cases to fault the teachers for their conduct knowing that the accountability systems in which they acted were not of their creation and were not designed according to their wishes.The understanding of this reality was especially clear in the minds of division-level administrators.
It makes up such big part of kids' marks that in lots of ways it determines the kids' marks and those marks are, for most kids, only relevant when it comes to applying for post-secondary and applying for scholarships.The OSSLT is distinguished from the other 3 tests, namely grade 3, grade 6, and grade 9, because it is a graduation requirement.So, you know, it is not ethical as an educator, no matter where you stood on the fence with this, to not do your absolute best to allow students to be successful.(ON, Division staff, male) It was clear from interviews (which also proved supportive of the quantitative findings) that pressure was the single most important driver of teachers' response to LSA testing and the release of results data.Whether internally or externally imposed, pressure caused significant stress to teachers and in some instances affected their choices about which grades or subjects they wanted to teach.
Follow up on data use.Following up on data use is an important aspect of policies being faithfully implemented (i.e. as policy makers had intended).The results from Table 2 appear to bear out this conclusion, yet the correlation is only weakly significant.Teachers had vastly differing experiences with this kind of oversight, some saying it was not evident, while others indicated that it was both prominent and important.The follow up that was noted by respondents came from different levels of administration, but school-level oversight was most frequently cited (Appendix, figures A5 and A6).
It would vary widely depending on the administrative team at a given school.I have worked with principals for whom diploma results [LSA scores] would be 'be-all and end-all' and for whom every kid would be exempted from diploma exams if that were an option… it is highly inconsistent.(AB, High school Math teacher, female) The current administration is content with the fact that we are talking with each other and we talk to them and we discuss the concerns that we have.Everything tends to be a proactive approach.I think that's one of the values I share with my current administration.(AB, High school Science teacher, male) No [division-level follow up], none.None whatsoever.We choose to do [common assessments] because that's the only way we can grow as a school, to monitor the progress of our students.And it is important for us to have that data… to try to become a more data-driven school and show how, ahh, how we can improve as a result of using the data from previous years.(NB, High school principal, male) For other schools, it was noted that the leadership to follow up on data use was not modeled from the division level or from the school administration.Some respondents painted the picture of schools and teachers sailing in a ship without a rudder in terms of data use.
The division seems to be scattered and all over the place. . .We don't see anyone from the board. . .And, you can see it kind of falling away now -they don't ask us for reports anymore, 'Yeah, go ahead and do that…" (NL, High school Science teacher, male) Well, I know as an administrator you're certainly in and checking and following along with what [teachers] should be doing, but I don't know there would be necessarily a prescription for a specific teacher to follow this type of direction.And I would not be made aware of it from a board or department level that, 'Mr.Smith needs to improve on this portion of his lesson and it is important for you… to go in and watch and see that this is happening.'(PEI, High school Math teacher, male I don't think that expectation was ever given to us on what to do with them.It was sort of like, I felt like, oh, you have to give it to see where students are, and then there was no follow up from anyone.It was just, this is what it is… and that was it.(SK, Middle years homeroom teacher, female) Considering the wide range of responses regarding following up on data use it is no surprise that the correlation is only weakly significant.A stronger correlation to reactivity effects would be dependent upon both more teachers indicating that some kind of follow up on their data use practices was done and also upon policy documents making it very clear to school and senior administration that they have this particular role to play to ensure common, faithful implementation.
Perceived stakes.High stakes testing is quite different in the United States as compared to Canadian LSAs.The stakes in Canada are directly applied only to students, and indirectly to teachers and schools.The evidence on the efficacy of stakes as an incentive is inconclusive, but has been more closely tied to TTT effects than the TTC variety.
Many teachers made note of the lack of professional stakes applied to teachers based on their students' LSA scores (Appendix, figure A9).
I can't recall anyone having paid a professional price for bad results.So, yes, it is high stakes, teachers are apprehensive, but I think that it has been enough of a routine that teachers manage that.(AB, High school Science teacher, male) As an administrator the conversation about, 'you need to improve those scores or else', kind of thing, has never happened.Yeah, we don't go that road.(MB, Elementary school principal, female) I see the newer teachers worried like, 'Am I going to get to teach this class again?'Because there is still a perceived hierarchy in the courses you get to teach.(AB, High school Math teacher, female) It is also the case that many respondents saw the stakes they perceive which are applied from higher jurisdictional levels and can, for good or for ill, have an effect on their professional development and/or their teaching practices.Yeah, just from knowing [the provincial exam] is coming up, you know, even weighing my decision about whether or not I wanted to loop with my kids or not.I wanted to loop but I knew that [the exam] was at the end of the year, I was going to have to face that exam, those exams.(QC, Middle years homeroom teacher, female) What happens in literacy is entirely, well, not entirely, but the teacher has the plasticity around what happens in that time frame.The same way in their math block… It is less about robbing Peter to pay Paul and more about looking at much broader strokes of how we approach literacy and numeracy.(ON, Division staff, male) Whereas the math [assessment], yes, absolutely.I know 100% that it is used in our board to compare teachers to teachers.(ON, High school English consultant, female) Some concern was raised about how the data can help create perceptions that are not entirely accurate.A snapshot was not seen as a very good metric for measuring teacher effectiveness.I have to be honest, I only look at our own school, you know.And generally I think we are getting better, or the marks are getting better.Whether that means [teaching is] getting better, I don't know. . .Sometimes I think it is just the group of kids you have.
(PEI, Elementary homeroom teacher, female) When you are comparing from year to year and using that as data for improvement it is kind of like trying to judge the population trends of height, for example, by watching people pass by a window.A tall person passes, then a medium person passes, then a short person passes.If you use that process to do your data gathering, then you are going to conclude that people are getting shorter.(QC, High school English teacher, male) Since stakes for teachers are applied at much lower levels than it is in some other high stakes environments (many interview respondents referenced the US education system), this proved to be a less significant variable than the pressure which is more clearly apparent from students, parents and the public at large.
Other variables.Some of the variables examined did not prove significant despite their prominence in the LSA literature.Being aware of the results of LSAs is clearly the entry point to using the data (Datnow, Park & Kennedy-Lewis, 2012;Wideman, 2002), but this variable did not prove significant for either instructional change strategy barring the weak correlation to TTT before the addition of provincial dummies.Teachers reported their levels of awareness were quite dependent on the attitudes of school and divisional leadership (Appendix, figure A7).
Our administration is very supportive and do not use the data for anything within our school but provincially the data is used to rank schools in the whole province.(BC, Elementary homeroom teacher, female) Now... all of our resources are supposed to be allocated and all of our teaching is supposed to be data-driven.Right?So we are all about collecting the data and we've got testing up the wazoo and stuff.So there is on some level, there is an expectation that we are trying to improve our results every year.(PEI, Elementary homeroom teacher, female) Another aspect of implementation literature examined in this study was the effectiveness of making expectations explicit and clear.Even when expectations were relayed to teachers, no correlation with reactivity effects appears in the quantitative data.The qualitative data indicate that teachers have very different experiences regarding this variable (Appendix, figure A4).
It is district-driven.We have a school improvement plan that we have to do every year and basically it is instructional-based… So we do that at the beginning of the year.When we go in in August we're gonna talk about what are we looking at that we really want to focus our instruction on.So using the data, the district requires that we fill out an SIP.(NB, Middle years homeroom teacher, female) I think unless there is involvement of the administration in the, you know, in a literacy culture, in a standardized test culture in the school, all classroom teachers can close their doors and do whatever they want.(ON, High school English consultant, female) It appears that expectations, much like all the other variables in this section, are relayed in very different ways to different teachers and in some cases, they are not transmitted at all.Since policy expectations across Canada differ, it may be considered unfair to judge the practical effects of the implementation for such different LSA programs.Yet we should remember the fact that the policy goals of LSA are strikingly similar across all 10 provinces, and that the improvement of instruction (referred to in this paper as teaching [to] the curriculum) is the ultimate goal of all provincial education ministries.In sum, these responses show that the uneven and unclear implementation practices reported by interview subjects align well (triangulate) with the survey results to illuminate practices that: a) do not always align to stated policy goals; b) that have unintended consequences on instructional practices; and c) which divert the attention, financial resources and the time of educational professionals and leaders in this pursuit.

Discussion
In an effort to both improve the academic results of students in large-scale assessments and to improve the quality of teaching in Canadian public schools, LSAs are used in all 10 Canadian provinces as a means of reaching these goals.Unfortunately LSAs have not proven to be a uniformly effective means of promoting positive instructional changes.This study examined the use of LSAs and the incentives built into assessment policies in the context of both the policy goals stated by provincial ministries and the instructional methods demonstrated as effective in the STF Code of Professional Conduct (2015).It has been seen that policy-based incentives do correlate with more reactivity, but it is predominantly in the form of teaching to the test, which is in line with neither ministry goals nor the principles of the STF Code.This being established, there are some lessons for policy makers to draw.

'Incentives Work'
Incentives variables do prove to be quite effective at inspiring instructional change.Of the scale variables examined in the larger study upon which this paper is based, the incentives scale alone showed highly significant results in terms of teaching (to) the curriculum effects as well as teaching to the test effects.Incentives are clearly very effective in terms of increasing the level of reactivity in teachers.Yet while TTC is the stated goal of both the provincial ministries and the STF Code, TTT is not, and yet it has an equally significant relationship with policy incentives.Recent research bears out the fact that TTT is a common and less desirable consequence of the single-minded pursuit of higher test scores (Hill, Mellon, Laker & Goddard, 2016).
So while it is seen that incentives are effective at promoting instructional change, it should be added that the changes which result from policy incentives do not necessarily help ministries achieve instructional improvement goals.Thus the use of policy-based incentives is at best a questionable method of trying to achieve either better teaching or better learning.

High Stakes
The correlation found in the literature between high stakes testing and high pressure testing environments is an important one to consider in this study (Finnigan & Gross, 2007;Luna & Turner, 2001;Madaus, 1988).Of the 10 provinces in Canada, seven currently employ some form of high stakes exam whether it be a minimum competency exam or one that is factored into grades for summative purposes.Since graduation, university admissions, and scholarships all depend on these results, teachers feel great pressure from students, parents and administrators to produce solid results.There is an apparent correlation between giving these high stakes, high-pressure exams and greater reported reactivity effects.Table 3 lists the provinces in terms of reactivity effects.
The three provinces which do not give high stakes high school exams are Nova Scotia, Prince Edward Island, and Saskatchewan.These provinces appear near the bottom of the TTT scale and the middle of the TTC scale.The same group also appears near the top of the net reactivity scale which (by adding the positive values of TTC scores to the negative values applied to TTT) gauges the balance between the self-reported uses of these strategies.If LSA policies (including incentives) were acting as proposed, the net reactivity scale should be heavily tilted into the positive range.We can see that there is less reactivity generally in these three provinces, but that they have the highest degree of balance between TTC and TTT effects (a perfect balance would produce a net reactivity score of 0).
In terms of policy, the use of high stakes or minimum competency exams is evident in the majority of Canadian educational jurisdictions.Seeing that all the provinces that employ such assessment tools also score higher in net reactivity than the three provinces that do not use them should give ministry officials reason to reconsider their use.If the use of high stakes exams is correlated to TTT effects, it is of no help in reaching stated goals for improved teaching.

Pressure
The most significant of the variables within the incentives scale is perceived pressure.Teachers reported feeling high levels of pressure despite the Canadian context of LSA showing notably less professional consequence for educators as compared to US policy models (Finnigan & Gross, 2007;Koretz, 2009).It seems obvious that the policy-based administration of pressure needs to be considered in terms of the effects it has on instructional practices.Where reactivity effects contradict fundamental policy goals both expectations and incentives should be re-evaluated.
It is difficult for policy makers to control a fuzzy variable dealing with 'perceptions', but it should be emphasized that pressure is applied inadvertently, explicitly, or from educational stakeholders (such as the community, students or parents).This pressure is strongly correlated to the use of TTT strategies.This 'test anxiety for teachers' is not conducive to meeting ministry policy targets.
It may not seem particularly constructive to point out only things that policy makers and education ministry officials should not do.In fields of endeavour where there is the potential of harmful effects resulting from our actions, we could keep in mind that medicine's ancient Hippocratic Oath begins with an admonition to 'do no harm.'This is what we would now call the precautionary principle -the practice of exercising caution when understanding is incomplete.When incentives are examined as a policy tool, the limiting of harm must be considered a primary goal since there is so very little to say about the positive influence that policy incentives have on pedagogical practice.

Conclusion
The inclusion of incentives in provincial assessment policies across Canada has shown to be an effective means to promote the use of data in classrooms: teachers are significantly reactive to incentives.If this study did not differentiate between TTT and TTC reactivity effects, this would be seen as further proof that incentives are a suitable and appropriate lever.The main driver of this reactivity appears to be perceived pressure coming from test preparation, the high stress and sometimes secretive methods of test administration, and finally the public release of results.Explicit policy goals and guidelines for teacher conduct do not advocate teaching to the test as a preferred or effective instructional strategy, yet the LSA policies are equally likely to have this effect as they are to lead to teaching (to) the curriculum.An educational experiment with such unpredictable outcomes can hardly be called a success.Only when it can said with some certainty what effect the pressure from incentives will have on teachers can policy makers claim to have achieved some measure of successful implementation.
Based on these results the obvious conclusion is that for too long the model based on incentives, themselves inspired by capitalist economic models, have been used to guide policy in the public sphere.This is despite growing evidence that even in business incentives have unintended consequences (see, for example, Ariely, Gneezy, Loewenstein & Mazar, 2009).It seems equally apparent that policies built on the presumption of 'improvement' should be tested against objective criteria that accurately gauge their practical effects.This program evaluation study has asked that basic question: whether assessment policy is actually meeting the outcomes originally put forward in ministry literature.
The results of this research go some way to showing that incentive-based policies, especially those rooted on high-pressure testing, do not result in better instructional practices since TTC improvements are eclipsed by changes to TTT methods (as in Table 2).Whether or not increasing LSA scores indicate improved teaching (and this is an important question not explored here), high quality instruction is the stated and laudable goal of all education ministries in Canada.In order to achieve it, more objective examinations of the practical results of LSA must be conducted and examined with a critical eye.Being the basis of a large research study, the author has several papers built on this very premise which all examine the practical effects of LSAs on teaching in Canadian public schools using the reactivity model.It is hoped that these papers and the ongoing research on assessment will help policy makers avoid the possible unintended consequences of incentives-based programs and both devise and implement LSAs that support and promote the more effective instructional strategies which help teachers teach (to) the curriculum.Woessmann, L. (2001).Why students in some countries do better: International evidence on the importance of education policy.Education Matters, 1 (2), 67-74. Wößmann, L. (2003).Central exams as the "currency" of school systems: International evidence on the complementarity of school autonomy and central exams.CESifo DICE Report, 1( 4), 46-56.Young, V. M. (2006).Teachers' use of data: Loose coupling, agenda setting, and team norms.
American Journal of Education,112(4), 521-548.Self-reported awareness is shown here, while awareness being not apparent came from respondents who did not know or receive LSA data.

Figure 1 :
Figure 1: Reactivity prompts from the data gathering survey. 2

Figure 2 :
Figure2: Subjects and grade levels assessed in Canadian provinces.3

1 -
Graduation requirement (mus t be pas s ed) EF -Core Englis h or French 2 -Graduation requirement only when teacher not accredited M -Mathematics 3 -Mark on exam as s igned a des ignated value of final grade S -Science 4 -Re-write of graduation requirement exam SS -Social Studies 5 -Piloting in the 2013-2014 s chool year O -Other 6 -Student mus t write both EF and SS 7 -Sus pended s ince the 2012-2013 s chool year

Figure A3 :
Figure A3: Cronbach's alpha for incentive scale items.The higher n for expectation and follow up data is a result of the inclusion of all surveyed teachers in these item responses, whether or not they gave LSAs in their classrooms.Comparisons of the two groups are found below in figures A9 and A11.

Figure A4 :
Figure A4: Provincial / national data on the perceived expectation to use LSA data.The national average is relatively in line with most provincial numbers, but the low n for provincial survey data makes detailed analysis problematic.

Figure A5 :
Figure A5: Respondents rated following up on instructional change.National averages align well with most provincial responses.

Figure A6 :
Figure A6: Breaking down expectations and follow up on data use by jurisdictions.National data are shown here to illuminate from which jurisdictional level responding teachers reported expectations and follow up.

Figure A7 :
Figure A7: Averaging results-awareness scores across class, school and divisional levels.Self-reported awareness is shown here, while awareness being not apparent came from respondents who did not know or receive LSA data.

Figure A8 :
Figure A8: Teachers reported how much pressure they feel in relation to LSA testing.

Figure A9 :
Figure A9: Teachers rated the level of stakes they think are applied to teachers by LSAs.
And then you are looking at this kid gets in because they have and 86 and this kid doesn't because they have an 85. (AB, High school Math teacher, female) I don't think it is appropriate for a teachers to get old FSA exams and teach to that… Whereas when it starts counting, if you will, towards the kids' marks and their future and you know that this is a reality that the kids are facing I would say that it is appropriate, not necessarily the best educational thing ever, but it is appropriate because teachers are supposed to help kids.(BC, Division staff, male) And I think [LSAs] do provide unnecessary pressure on teachers and I think they are misrepresented definitely in the media.And so, of course, people who don't understand and they just hear a test score or they just see, you know, Nova Scotia's results in literacy and math are low, and therefore our school system is poor and our teachers are not teaching effectively.Well, that is just not true.(NS, Division staff, female) c. Selects students for education / employment opportunities 35.Provincial testing: a. Identifies student strengths and weaknesses (School Improvement) b.Helps students improve their learning c.Is integrated with teaching practice d.Allows different students to get different instruction e. Changes the way teachers teach 36.Provincial testing: a. Interferes with appropriate teaching (Negative test attitudes) b.Data only get used when stakes are high c.Has little impact on teaching practices d. Results are filed and ignored e.Is an imprecise process 37. Appropriate uses: Assign or re-assign students to classes; Identify learning needs of students who are struggling; Discuss student progress or instructional strategies with other educators; Form small groups of students for targeted instruction; Discuss data with a parent; Discuss data with a student; Choose which parents to contact; Meet a specialist about data -e.g.instructional coach (all 'Appropriate' responses [1]; all 'Not appropriate responses [-1]) education policy analysis archives editorial board