Interrogating the Generalizability of Portfolio Assessments of
Beginning Teachers: A Qualitative Study
Pamela A. Moss
LeeAnn M. Sutherland
Laura Haniford
Renee Miller
David Johnson
University of Michigan
Pamela K. Geist
Denver, Colorado
Stephen M. Koziol, Jr.
University of Maryland
Jon R. Star
Michigan State University
Raymond L. Pecheone
Stanford University
Citation: Moss P.A.,
Sutherland, L.M., Haniford, L., Miller, R., Johnson, D., Geist, P.K.,
Koziol, S.M., Star, J.R., Pecheone, R.L., (2004, July 20).
Interrogating the generalizability of portfolio assessments of
beginning teachers: A qualitative study,
Education Policy
Analysis Archives, 12(32). Retrieved [Date] from
http://epaa.asu.edu/epaa/v12n32/.
Abstract
This qualitative study is intended to illuminate factors that
affect the generalizability of portfolio assessments of beginning
teachers. By generalizability, we refer here to the extent to
which the portfolio assessment supports generalizations from the
particular evidence reflected in the portfolio to the conception
of competent teaching reflected in the standards on which the
assessment is based. Or, more practically, “The key question
is, ‘How likely is it that this finding would be reversed or
substantially altered if a second, independent assessment of the
same kind were made?’” (Cronbach, Linn, Brennan, and
Haertel, 1997, p. 1). In addressing this question, we draw on two
kinds of evidence that are rarely available: comparisons of two
different portfolios completed by the same teacher in the same
year and comparisons between a portfolio and a multi-day case
study (observation and interview completed shortly after portfolio
submission) intended to parallel the evidence called for in the
portfolio assessment. Our formative goal is to illuminate issues
that assessment developers and users can take into account in
designing assessment systems and appropriately limiting score
interpretations. (Note 1)
|
Introduction
A growing number of states are using some form of standardized
assessment to assist in the licensure decisions about beginning
teachers. Among the 42 states requiring such tests in 2000, the
most widely used were paper-and-pencil tests assessing varied
combinations of basic skills, content knowledge, or pedagogical
knowledge (NRC, 2001b). The National Research Council's "Committee
on Assessment and Teacher Quality" concluded that "paper and
pencil tests provide only some of the information needed to
evaluate the competencies of teacher candidates" (NRC, 2001b, p.
69). The committee called for additional research into the
development of licensure systems that include assessment of
teaching performance. As evidenced in the work of the National
Board for Professional Teaching Standards (NBPTS), portfolio
assessment provides one credible means for the large-scale
high-stakes assessment of teaching performance. The Interstate New
Teacher Assessment and Support Consortium (INTASC) is building on
the pioneering work of the NBPTS to develop subject-specific
portfolio assessments of beginning teachers. Their work provides
the basis for this study.
This qualitative study is intended to illuminate the factors
that affect the generalizability of this portfolio assessment of
beginning teachers. By generalizability, we refer here to the
extent to which the portfolio assessment supports generalizations
from the particular evidence reflected in the portfolio to the
conception of competent teaching reflected in the standards on
which the assessment is based. Or, more practically, “The
key question is, ‘How likely is it that this finding would
be reversed or substantially altered if a second, independent
assessment of the same kind were made?’” (Cronbach,
Linn, Brennan, and Haertel, 1997, p. 1). In addressing this
question, we draw on two kinds of evidence that are rarely
available: comparisons of two different portfolios completed by
the same teacher in the same year and comparisons between a
portfolio and a multi-day case study (observation and interview
completed shortly after portfolio submission) intended to parallel
the evidence called for in the portfolio assessment. The case
studies lasted 3 - 5 days, depending on each teacher's schedule.
Consistent with Cronbach’s (1988, 1989) “strong”
program of validity, this study is explicitly
disconfirmatory; it is intended to illuminate
potential problems with assumptions about generalizability. Our
formative goal is to raise issues that assessment developers and
users can take into account in designing assessment systems and
appropriately limiting score interpretations.
Conceptions of Generalizability
Messick (1989, 1996) characterized generalizability as "an
aspect of construct validity" that is meant to "ensure that the
score interpretation not be limited to the sample of assessed
tasks but be generalizable to the construct domain more broadly"
(1996, p. 250; see also 1989). He noted that generalizability has
two important senses: (a) "generalizability as reliability …
refers to the consistency of performance across the tasks,
occasions, and raters of a particular assessment which might be
quite limited in scope" (p. 250) and (b) "generalizability as
transfer …refers to the range of tasks that performance on
the assessed tasks is predictive of" (1996a, p. 250). Thus,
inferences about the broader domain (in our case, competent
teaching performance as defined by a set of standards) from a
particular sample of evidence (as contained in a portfolio) can be
productively conceived of in at least two distinct steps: from the
observed performance to the more limited scope of what we will
call the assessment domain (reliability) and then from the
assessment domain to the outcome or standards domain (transfer or
extrapolation). This distinction between kinds or levels of
generalization is drawn by others as well, albeit with somewhat
different language (e.g., Brennan and Johnson, 1995; Haertel,
1985; Haertel and Lorie, in press; Kane, Crooks, and Cohen,
1999) (Note 2).
Within psychometrics, generalizability has typically been
evaluated in terms of quantitative indicators of reliability or
transfer. These concepts from psychometrics will be useful--even
though this is a qualitative study--for helping us frame and learn
from the results of our comparisons. The comparisons we offer will
in turn, suggest the limitations of conventional theory for
illuminating the complexity of variations involved in teaching
practice and making well warranted decisions that accommodate that
variation.
This first level of inference (reliability) involves
generalization from a set of representative observations to a well
specified assessment domain (or universe of generalization)
consisting of similar observations (Kane et al., 1999; Brennan,
2001). We are not simply interested, for instance, in how an
examinee performed on a particular set of tasks on a particular
occasion; rather, we are interested in estimating how an examinee
would perform on tasks/occasions like these. Further, we
want some assurance that the score is not based on the
idiosyncrasies of a particular judge but that similarly qualified
judges would likely interpret the performance in the same
way.
Reliability is appropriately conceptualized and investigated as
a faceted concept that encompasses multiple sources of
“error” or variations over which we want to generalize
(differences in tasks, raters, occasions, and so on that are
intended as samples from the same assessment domain). A set of
scores can have multiple reliabilities and errors of measurement
depending on which sources of variation are taken into account.
The appropriate domain of generalization, including the sources of
variation over which we want to generalize, depends on the
decision to be made (Cronbach et al., 1997). For those sources of
variation over which we want to generalize, empirical studies that
examine these variations—across tasks, occasions, raters,
etc.—are required to support the generalization. As Brennan
(2001) argued, the notion of “replication” is central
to an understanding of reliability. Generalizability theory
(Cronbach, Gleser, Nanda, and Rajaratnam, 1972; Brennan, 1983;
Shavelson and Webb, 1991) is, perhaps, the most commonly used
theoretical model that enables the effects of various sources of
error to be “disentangled” and estimated
simultaneously, although other models, especially those based on
Item Response Theory (IRT) (e.g., Engelhard, 1994, 2002; Myford
and Mislevy, 1995; Wilson and Case, 1997) are becoming more widely
used (see Mislevy, Wilson, Ercikan, and Chudowski, 2002, and NRC,
2001a, for a discussion of alternative models). (Note 3) With generalizability theory,
reliability is idealized as a statistical generalization based on
“random” samples from the assessment domain. Brennan
(2001) acknowledged that the notion of random sampling is an
“idealization that is not fully supported”, but noted
that “the central conceptual distinction is not so much
between fixed and random in the literal sense of
‘random,’ as it is between fixed and ‘not
fixed’” (p. 302). Reliability estimates can be quite
misleading if a facet that varies in the assessment domain
(possible essay prompts, for instance) is not included in the
estimated error of measurement. An unfortunate practice is
reporting reliability estimates for performance assessments based
on differences among readers but ignoring potential differences
among tasks even though the intended generalization is to a
broader domain of tasks like these. This can seriously
overestimate the quality of the generalization to the intended
assessment domain.
Turning to the second level of generalization (transfer or
extrapolation), this involves generalization from the more limited
and carefully specified assessment domain to a broader outcome
domain, which includes the full range of performances about which
we would like to generalize. As Kane and colleagues noted, most
educational concepts are quite broad; rarely are we interested
simply in how examinees perform on other (test) items like these.
Using reading comprehension as an (often cited) example, the
outcome domain of interest might include a wide range of types and
genres of text (e.g., newspapers, magazines, novels, instructional
manuals, technical reports, text books, friendly letters, business
letters, signs, forms, lists, tests), read for a variety of
purposes, in many different contexts, requiring various kinds and
depths of background knowledge to understand, with readings
represented in multiple ways (writing, conversation, mental images
or concepts, drawing, marks on answer sheets, and the like).
This level of generalization clearly spills over the bounds of
reliability into validity more generally and typically involves a
more tenuously warranted set of inferences. Warrants for transfer
generalizations include logical or theoretical arguments about the
relationship between the assessment domain and outcome domain. A
common approach “is to argue that the skills needed for good
performances in the universe of generalization (e.g., problem
definition, problem solving) are essentially the same as, or are a
critical subset of, those needed in the full target domain”
and “that anyone who performs well on the assessment should
also be able to perform well in the target domain and anyone who
performs poorly on the assessment should also perform poorly in
the target domain…” or at least that “the skills
being assessed are necessary (if not sufficient) for effective
performance in the target domain” (Kane et al., 1999, p.
11).
Empirical studies supporting transfer generalizations might
involve “criterion studies,” examining of the
relationship between test performance and some “especially
thorough (and representative)” sample from the outcome
domain (Kane et al., 1999, p. 10) or, more practically, a
“series of small experiments regressing various outcomes on
test performance” (Haertel, 1985, p. 35). Given the near
infinite range of possible studies, some means of deciding which
are most important to undertake given limited resources is
necessary. As Kane and colleagues noted, “in practice, the
argument for extrapolation is likely to be a negative
argument.”
A serious effort is made to identify differences
between the universe of generalization and the target domain that
would be likely to invalidate the extrapolation. If no
major differences are found, the extrapolation is likely to be
accepted. If the impact of some differences on the plausibility of
extrapolation is unclear, it may be necessary to check on their
importance empirically. (Kane et al., 1999, p. 11)
Empirical Evidence of Generalizability
With Performance Assessments, In General
With performance assessments, the most commonly examined
sources of error are those due to raters and tasks. Empirical
studies of reliability or generalizability with performance
assessments are quite consistent in their conclusions that (a)
reader reliability, defined as consistency of evaluation across
readers on a given task, can reach acceptable levels when
carefully trained readers evaluate responses to one task at a
time, and (b) adequate task or "score" reliability, defined as
consistency in performances across tasks intended to address the
same capabilities, is far more difficult to achieve (e.g., Breland
et al., 1987; Brennan and Johnson, 1995; Dunbar, Koretz, and
Hoover, 1991; Gao and Colton, 1997; Gao, Shavelson, and Baxter,
1994; Lane, Liu, Ankemann, and Stone, 1996; Linn and Burton, 1994;
McBee and Barnes, 1998; Swanson, Norman, and Linn, 1995). In the
case of portfolios, where the tasks may vary substantially from
student to student and where multiple tasks may be evaluated
simultaneously, inter-reader reliability may drop below acceptable
levels for consequential decisions about individuals or programs
(e.g., Koretz, McCaffrey, Klein, Bell, and Stecher, 1992;
Nystrand, Cohen, and Martinez, 1993). Adequate levels of score
(reader and task) reliability have typically been achieved by
further standardizing the task directions, choosing tasks with
higher intercorrelations, disaggregating the portfolio into
separate tasks that can be scored one at a time, and then
estimating generalizability as one would with any collection of
performance tasks. Brennan (2001) cautioned that tasks and raters
are only some of the sources of error that are likely to matter.
He cited other sources of variation that should likely be taken
into account. These included different occasions, both occasions
of testing as well as occasions of scoring, and different methods
of testing as sources of error. He noted that some of these, such
as different methods, are better conceptualized as convergent
validity studies (rather than as reliability studies per se). (Note 4) Of course, certain types
of estimate are often deemed not feasible, including parallel
forms reliabilities with portfolios and assessments of performance
in different contexts (ETS, 1998; Harris, 1997; NRC, 2001b; Porter
et al., 2003).
Special studies involving performance assessments have looked
at relationships among methods of assessment: between multiple
choice and performance assessment (e.g., Lane et al., 1996;
Crehan, 2001); (Note 5) between
different methods of performance assessment, such as direct
observation of scientific experiments and analysis of students
notebooks (Shavelson et al., 1991); and between on-demand and
school based tasks (Gentile, 1992, in Brennan and Johnson, 1995).
The general conclusion is that different methods appear to be
getting at somewhat different constructs (e.g., Brennan, 2001;
Brennan and Johnson, 1995). Fewer operational assessments in
education undertake this sort of empirical research, relying
instead on empirical evidence of reliability and logical arguments
about content-relevance and representativeness. And, indeed, while
the Standards for Educational and Psychological Testing
(AERA, APA, NCME, 1999), require at least some sort of empirical
evidence about reliability, they mention “external
validity” as only one potential source of evidence, but
leave the choice of validity evidence up to the assessment
developer and user.
Some authors note a tradeoff between these two levels of
generalization. Strengthening the faithfulness with which the
assessment represents the outcome domain often undermines the
reliability of assessment (as reflected in the many technical
problems with performance assessment) and enhancing reliability,
for example by employing a larger number of shorter tasks,
undermines fidelity (e.g., Kane et al., 1999).
With Teaching Performances, in Particular
Research into the generalizability of performance assessment of
teaching has tended to emphasize much the same sort of evidence
described above, focusing primarily on consistency among tasks and
judges. There are two major programs of research that are most
relevant to our study, the portfolio assessments of the National
Board for Professional Teaching Standards and the
observation/interview assessments of Praxis III. Both of these
assessments are developed by the Educational Testing Service.
National Board’s standards-based assessments are designed
to certify the accomplishment of experienced teachers with at
least three years of service. Assessments are developed or
underway for over thirty different certificates (differentiated by
subject area and age of students taught). The ten performance
tasks that comprised the assessment in each certificate area (when
the research described here was undertaken) are divided into two
parts: a portfolio completed by candidates in their home schools
across a year and a one-day assessment-center experience. The
school based portfolio consists of (a) four tasks that ask
candidates to document their practice, through videotapes and
samples of student work, and to provide "extensive analytical and
reflective commentary" (Pearlman, in Jaeger, 1998, p. 191), and
(b) two tasks that ask candidates to document their
accomplishments outside the classroom and explain why they are
important. The four assessment-center tasks provide candidates
with materials such as student work samples, assessment records,
instructional resource materials, or professional reading and ask
them to use the materials to diagnose the status of student
learning, plan instruction, and so on (Pearlman, in Jaeger, 1998,
p. 191). (Note 6) Each exercise
is scored independently by two reviewers. The resulting scores for
each exercise are weighted and aggregated to form an overall
composite score for each candidate. This composite is then
compared to a predetermined passing score.
The National Board’s Technical Analysis Report
(ETS, 1998) described four relevant sources of error:
- Assessors: Would a candidate, given a different set of
assessors, fare similarly on the assessment?
- Exercise Sampling: Would candidates perform similarly on a
different set (sample) of exercises?
- Assessment occasions: Would candidates fare similarly if they
took the same assessment on a different occasion?
- School context: Would candidates fare similarly if they
happened to teach in a different school?
They noted that it is not feasible for them to provide evidence
of reliability across school contexts or assessment occasions.
With assessment occasions, they argued that there is likely to be
a learning effect such that one would expect a candidate to fare
differently (better) and so reliability may not feasibly be
assessed.
They provided empirical evidence with respect to assessors and
exercise sampling— concluding that both are adequate to
support the assessment for its intended use (ETS, 1998, p. 125;
see also Myford and Engelhard, 2001). (Note 7) With respect to exercise sampling,
they cautioned readers about the limitation of such evidence since
the set of tasks was explicitly designed to represent a
multidimensional domain:
Whether an assessment with the current design can be
considered to allow for alternative forms in a traditional
measurement sense is debatable. It is possible to argue that the
exercises are but one possible sample from a larger domain of
accomplished teaching or that the exercises, for all intents and
purposes, comprise a fixed assessment of accomplished teaching.
(ETS, 1998, pp. 107-108)
This is, in fact, typical of the way in which task
generalizability is investigated with portfolio assessments (e.g.,
Klein et al., 1995; Koretz et al., 1992; Reckase, 1995; Nystrand
et al., 1993); what we have is an estimate of internal consistency
(based on tasks that were designed to access quite different
elements of teaching practice) and that treats as fixed a wide
range of factors that may in fact vary. Following Brennan (2001),
this is not really a replication “using two full length
operational forms” (p. 313).
With respect to "transfer," Bond, Smith, Baker, and Hattie
(2000) examined the relationship between scores on the National
Board's assessment (in two certificate areas for 65 teachers) and
1-3 hour observations of teaching accompanied by interviews with
teachers and some students. The casebooks produced from the visits
were scored according to thirteen dimensions of accomplished
teaching identified in an extensive literature search. Using
discriminant analysis, they were able to correctly classify 84% of
teachers as to whether they had been certified using the National
Board’s assessment. Other studies are currently underway
(see www.nbpts.org). While the
National Board’s goal was primarily documenting consistency
across the sources in support of the validity of the NBPTS
assessment, our purpose is to illuminate both similarities and
differences at the level of particularity that qualitative methods
allow.
ETS’s PRAXIS series, which is intended for use with
beginning teachers, involves three sets of assessments: PRAXIS I
focuses on basic skills, PRAXIS II on content knowledge and
general pedagogical knowledge, and PRAXIS III on teaching
performance. The PRAXIS III assessment involves direct
observations of classroom performance over a series of
“assessment cycles.” An assessment cycle consists of a
preliminary description of the context, the students, and the
lesson-to-be-observed, prepared by the beginning teacher; an
observation of a lesson of instruction by a trained assessor
(experienced teacher); and pre and post semi-structured
interviews. The assessor’s notes are then scored on a list
of nineteen criteria (that were developed through an extensive
literature review and job analysis survey) and an overall score
given. “Summative decisions are made based on cumulated
data from two or more assessors based on two or more assessment
cycles” (Dwyer, 1998, p. 8). In addition to the obvious
differences in methods, PRAXIS III is intended for use across
grade levels and subject areas, and the criteria for classroom
observation have not been tailored to particular subject areas as
with INTASC and the National Board. Although this leads to a
somewhat different emphasis, Porter et al. noted the similarity of
the PRAXIS criteria to the general principles of the National
Board and INTASC. While there are multiple studies of assessor
reliability, there are no reports of generalizability across
assessment occasions that we could locate (Dwyer, 1998; Myford and
Lehman, 1993; NRC, 2001b; Porter, Youngs, and Odden, 2003; Myford,
personal communication, 3/5/03; Wylie, personal communication,
5/2/03). With respect to generalizability across occasions,
assessment developers caution:
“The purpose and consequences of the assessment,
particular local circumstances, and the beginning teacher's level
of performance (both absolute and in terms of improvement) are
factors that determine how many assessment cycles will be carried
out. Guidelines governing Praxis validity and use prohibit
decision-making on the basis of a single assessment cycle or on
the judgment of a single assessor (Educational Testing Service,
1993b).” (Dwyer, 1998, p. 171)
Thus, the comparisons in the study reported here--which involve
full length replications of portfolio assessments, methods of
performance assessment, and classroom contexts in which the same
tasks can be implemented--begin to address an important gap in our
understanding of the generalizability of portfolio assessments of
teaching and, perhaps, of performance assessments more
generally.
Research Design
Our study draws on qualitative methods to address questions of
portfolio generalizability through comparative content analyses
across different portfolios and different methods of assessment
for the same teachers. Consistent with Kane and colleagues’
(1999) conception of a negative argument, built from a serious
effort to disconfirm, our goal is to illuminate differences that
challenge assumptions about generalizability. Where to locate
these comparisons in terms of the level of generalizability
described in the previous section is an open question. At face
value, one might argue that the portfolio-portfolio comparison is
a reliability issue (different occasions on which same tasks are
performed), and the portfolio-case comparison is a transfer issue
(different methods and different occasions). And yet, as we return
to this issue after sharing our findings, the nature of variations
that the different occasions afford makes this problem far more
complex--as occasion is confounded with uncontrollable aspects of
context--and raises important questions about the nature of the
assessment domain to which we can appropriately generalize. These
are the variations that can be invisible when portfolio
reliability is examined via intercorrelations among tasks and
readers.
We begin with a brief description of the INTASC portfolio
assessment system and then describe data collection for the two
comparative studies--portfolio-portfolio comparison and
case-portfolio comparison--which were replicated in secondary
English Language Arts (ELA) and Mathematics (Math). Since the
comparative content analyses for both studies follow a similar
pattern, we describe those activities in a fourth section. While
the data sets are small from a quantitative perspective (29
comparative cases across the two studies and two subject areas),
our goal was to understand each comparative case in depth and to
illuminate issues for assessment developers and policy makers to
consider.
INTASC Portfolio Assessment
The portfolio assessments are intended for teachers in their
first, second, or third year of teaching. To guide the portfolio
assessment, INTASC has developed a set of general and subject
specific standards based on INTASC's Principles for Beginning
Teachers and standards from the relevant professional communities.
The standards and related assessments are intended to provide a
coherent developmental trajectory with those of the National
Board. The assessments ask candidates for licensure to prepare a
portfolio documenting their teaching practice with entries that
include: a description of the contexts in which they work, goals
for student learning with plans for achieving those goals, lesson
plans, video tapes of actual lessons, assessment activities with
samples of evaluated student work, and critical analysis and
reflection on their teaching practices. Unlike the National Board
portfolios (which contain four separate entries), these entries
are organized around one or two units (8 – 10 hours) of
instruction such that the portfolio cannot easily be broken into
parts for separate evaluation. Judges evaluate the portfolios in
terms of a series of “guiding questions” focused on
the portfolio but based on the standards described above; they
record evidence relevant to each guiding question and develop
interpretive summaries or “pattern statements” that
respond to the question; then they determine an overall decision
about the candidate (Note 8). As
developed by INTASC, the portfolios were intended both for
professional development and for informing decisions about
licensure. Of the 10 INTASC states that participated in the
development of the portfolio assessment, only Connecticut is
currently using it to inform licensure decisions.
For this study, participants were recruited from fieldtests
in multiple INTASC states in 1998-2000. Because our interest
in this paper is about the generalizability of portfolios for
licensure decisions, we chose to evaluate the portfolios using the
guiding questions and decision guide as they were used by
Connecticut for field tests in 1999-2000, even though the
participating teachers were recruited from multiple states. As it
was implemented in Connecticut in 2000, there were four possible
levels to the overall decision: conditional, basic, proficient,
and advanced. Judges also completed a "feedback rubric" on which
they selected performance levels that best characterized the
portfolio with respect to each guiding question. The assessment
occurred as part of a 2-3 year induction program in which
beginning teachers who had an initial three-year license were
provided with a mentor in the first year and the opportunity to
attend state-sponsored workshops to prepare them for the
assessment. When fully operational, teachers who did not pass the
portfolio assessment in their second year would continue in the
program for another year. If they did not pass in the third year,
they would be required to reapply for the initial license after
successfully completing additional course work or a state approved
field placement.
Portfolio-Portfolio Comparison Data Collection
A small sample of secondary beginning teachers in math (n=7)
and ELA (n=6) were recruited to complete two portfolios during the
same year, choosing classes and units of instruction that differed
as much as possible within their routine teaching assignments.
They were compensated for the second portfolio. Not surprisingly,
it was very hard to find beginning teachers willing to assume the
burden of two portfolios, and it is impossible to fully understand
how these stalwart volunteers might have differed from their
colleagues. We can say that their portfolios do reflect a range of
performance levels, teaching practices, and school contexts and
that their paired portfolios do illuminate an instructive array of
differences, consistent with the goals of the study.
Case-Portfolio Comparison Data Collection:
Another small sample of secondary teachers in math (n=8) and
ELA (n=8) was asked to allow case studies of their teaching
shortly after they submitted their portfolios. The sample was
recruited to include differences in gender, ethnicity, school
context, and performance level (based upon a quick read through by
the portfolio developers). The case studies took place over 3-5
days (depending on the teacher’s schedule) during which
researchers observed classes; conducted entry, exit, and brief
daily interviews with the teacher; and interviewed the school
principal and, if possible a mentor, regarding the support
available to the teacher. [See Ball, Gere, and Moss, 1998; Moss,
Rex, and Geist, 2000a, 2000b for fieldwork and case write-up
guidelines.] Case study researchers observed two classes: the
class used in the preparation of the portfolio and a second class.
As with the portfolio/portfolio comparison, we asked for a class
that differed as much as possible within the teachers' routine
teaching assignment (but sometimes we were only able to observe a
different section of the same class). Our intent was to parallel
the information collected in the portfolio as closely as possible
and to gather additional information about the teacher's
background, school context, and experience preparing the portfolio
to address additional questions of fairness. Teachers were given a
small honorarium for participating in the case study. As before,
it is not possible to know how these volunteers differed from the
larger population of beginning teachers.
Case study researchers, all experienced teachers in the
appropriate field, were taken through an abbreviated course of
study (with practice and feedback) in taking fieldnotes and
conducting interviews relevant to the project. Tape recordings and
artifacts were used as back-up. Field and interview notes were
read by a senior researcher and questions of clarification and
elaboration were raised to guide revisions (which could be
supported with audio-recordings and artifacts). Case study
researchers were then asked to draw on their notes in responding
to the Guiding Questions used to evaluate the portfolios. Again, a
senior researcher reviewed the responses (with fieldnotes at hand)
and raised questions to facilitate revision.
Comparative Analyses
The comparative content analyses for both studies were
undertaken in a similar fashion. Research assistants (experienced
teachers in the content area with graduate research training) used
the guiding questions (and the dimensions contained in the related
feedback rubric) to develop a coding scheme for the two sources of
evidence. Videotapes were roughly scripted for coding. Then,
answers to each of the guiding questions were developed for each
source based upon a comprehensive review of the evidence,
including the search for counter examples to challenge developing
interpretations. Similarities and differences were then noted,
organized by guiding question and overall. Justifications for
perceived differences in performance level with respect to the
criteria were developed. For the portfolio-portfolio comparisons
in ELA, each pair of portfolios was read twice, in reverse order,
by two research assistants, who then met to develop a consensus on
any differences. (Note 9) For
the portfolio-case comparisons and the portfolio-portfolio
comparisons in math, a single comparison document was developed,
and the process was audited by another researcher. The comparative
content analyses typically took 3-5 days per teacher and generated
30-70 pages of text each. These comparisons were then condensed
into 2-3 pages versions that highlighted substantial differences
both at the level of the guiding question and overall.
It is important to note that we have, for the purposes of this
paper, bracketed questions about consistency among readers.
Elsewhere we address concerns about differences in the way
knowledgeable readers evaluate portfolios in different social
settings when trained to reach consistent decisions and when
allowed to draw on their own criteria of competent teaching (Moss
and Schutz, 2001; Moss, Schutz, Haniford, and Miller, in
preparation; Schutz and Moss, in press). Here, we present findings
whose validity is based upon in-depth analyses, in which relevant
differences in perspective between readers were resolved through
consensus seeking dialogue. The issue for us is not the validity
of a specific score; rather it is the validity of an
interpretation of difference between two portraits of teaching and
an argument for whether the observed differences are likely to
matter in light of the evaluation criteria. We present our
evidence for which differences are likely to matter in sufficient
detail that readers can reconsider these judgments for
themselves.
Structural Differences Between Data Sources and Asymmetrical
Questions of Comparison
By structural differences, we mean those differences between
data sources that could be anticipated in light of the different
methods and which are, in fact, typically present in our data.
With respect to the portfolio-case comparisons, beyond the obvious
differences in data collection methods, it is important to note
the following. While we attempted to have case study researchers
present on days when teaching consistent with what is expected in
the portfolio was occurring, it was not always possible to observe
all the aspects of teaching called for in the portfolio. For
instance, while the ELA portfolio required evidence of students'
response to literature and students’ processes in writing,
the lessons observed in the case study might not cover both areas.
The case-based evidence is typically weak with respect to formal
assessment procedures since often no formal assessment was
occurring. However, the case study provides substantially more
evidence about daily classroom interactions. The case also
provides rich information about the context in which the teacher
worked and about teaching practices not foregrounded in the
portfolio evidence. With respect to the portfolio/portfolio
comparisons, the portfolio completed second is invariably shorter,
often considerably so. It contains typically fewer artifacts and
shorter commentary (sometimes with reference back to the first
portfolio). This caused us to develop an asymmetrical
comparison and research question:
To what extent does the second portrait (case study or
second portfolio) cause us to reconsider the evaluation of the
teacher's performance in (what we'll call) the primary
portfolio?
Findings
Our comparative analyses were set up to uncover differences in
the two portraits of beginning teachers and to evaluate whether
the differences were likely to result in different decisions in
light of the INTASC standards (as instantiated in the guiding
questions and the decision guide as adapted and used in
Connecticut in 2000). We make no attempt to estimate the frequency
with which these sorts of differences are occurring; our evidence
is not appropriate for that purpose. Again, our formative goal is
to illuminate issues for assessment developers and users to
consider in designing an assessment system, characterizing the
appropriate domains of inference, and limiting interpretations
appropriately.
We present our findings in the following sections: We begin
with an overview of the variations in context of the classes and
units selected by these teachers. Then, we illustrate our
comparison methodology in substantial detail with comparisons in
both math and ELA. In the first comparison, we provide an example
of a case in which the differences observed do not seem to matter
in terms of the relevant criteria (which was, we should note, true
in the majority of cases). In the second comparison, we provide an
example in which the evidence in the portfolio is, we’ll
argue, ambiguous, because the artifacts (videotapes, handouts)
only partially support the written representation of the class;
the case appears to clarify the ambiguity. Whether this is a
difference that would “matter” depends on how the
portfolio readers weigh the partially conflicting evidence. Thus
this comparison, even more so than the first, illustrates some of
the interpretive problems we encounter with these sorts of
data— problems that we have tried to address through far
more in-depth readings than would be possible in operational use.
Then, we present a series of briefer vignettes that describe
situations in which the second portrait caused us to question the
conclusions we drew from the primary portrait.
Contextual Variations
As indicated above, for both sets of comparisons, we asked
teachers to choose a second class that differed as much as
possible within their routine teaching assignments. The classes
they selected are presented in Tables 1 and 2 for secondary ELA
and math teachers respectively. Given their selections, it is
important to note the many different kinds of (often intersecting)
contextual variations that are present in the comparisons we
examine. These include: different sections of the same course
(which entail differences in time of day and whether the teacher
has taught the lesson before); differences in (perceived) ability
levels and groupings of students, including those designated by
the school directly (remedial, AP, and the like) or indirectly
(scheduling in ELA resulting from math assignments) and those
perceived by the teacher; different courses; different grade
levels; different units within the same course; differences in
(mix of) cultural backgrounds of students; different times of year
(which involves differences in teachers knowledge of and
relationship to students); differences in class sizes; differences
in availability of curriculum and support materials; differences
in extent to which these materials are consistent with the
standards. These are all variations that are fixed for a given
portfolio assessment of a teacher and are unexamined when all we
have is the single set of performances from a given class. (Note 10)
Illustration of Analysis in ELA with a "Complementary"
Portfolio/Case Comparison
To illustrate our comparison methodology in ELA, we focus on
one portfolio/case comparison, "Ms. Bertram (Note 11)," in which the activities we observed
differed substantially, and yet we found the portrait of the
teacher conveyed in the case study provided quite consistent
evidence with respect to the general evaluation criteria. We
illustrate this comparison in some detail both to document our
practices of analysis, and to show how two quite different
activity contexts can nevertheless support similar conclusions
about the teacher. We begin with a discussion of the ELA portfolio
guidelines, the guiding questions (developed by INTASC and revised
by Connecticut) to evaluate completed portfolios, and the way in
which we applied them for this study. Then we return to the
specific case of Ms. Bertram.
The ELA portfolio handbook asks candidates to complete two
distinct entries: one each in teaching response to literature (RL)
and processes of writing (PW). Teachers may choose the same class
or different classes for these two components. Across these two
exhibits, we have the following sources of evidence from each
teacher: (a) the teacher's rationale for her choice of literature
and writing assignment, (b) the teacher's daily logs for 10
lessons in which she describes the activities she and the students
engaged in (providing copies of instructional artifacts) and
writes brief reflections about how the day’s lesson went,
(c) video tapes of two-three activities reflecting different
participation structures (d) teacher's reflections on the videos,
(e) five samples of student writing, including multiple drafts,
with teacher's comments on the writing, (f) the teacher's
reflections on the students’ writing, and (g) the teacher's
general reflections on her teaching in the unit.
In the case study, we have fieldnotes depicting the activities
and the discourse in the classroom across three days for each of
the two classes. Through notes from a series of interviews, we
learn about the teacher’s goals, her specific plans for
daily lessons, her reflections on how the lessons went (in general
and for particular students), and her goals for professional
development.
The guiding questions for ELA (initially prepared by INTASC and
revised by Connecticut for use in 1999-2000) are organized into
four separate categories. (1) Questions about literacy
focus on "connections among responding, interpreting, and
composing" with an emphasis on the extent to which students
develop their own meanings. (2) Questions about instruction
focus on how the teacher organizes students' learning--including
questions about alignment between goals and instructional
strategies, about integration of activities within and across
lessons, and about materials--with an emphasis on the extent to
which instruction provides learning opportunities (challenges) for
all students and promotes independence. (3) Questions about
analysis of learning focus on formal and informal
assessment of students’ work--how the teacher monitors
students’ progress, communicates with them about their
learning, and uses that information to inform instruction. (4)
Finally, questions about analysis of teaching focus on how
the teacher reflects on student learning and uses that reflection
to inform her practice. (Note
12)
While some of the guiding questions were quite specific and
descriptive in nature (e.g., "Describe how the teacher helps
students use a writing process, including context, purposes, and
conventions of standard written English."); others involved
much higher levels of inference that required integrating multiple
types of evidence (e.g., Describe the ways in which the teacher
creates a learning environment that provides all students with
opportunities to develop as readers, writers, and
thinkers."). In analyzing the portfolios, we found ourselves
following a multi-step process. We began with describing the
various sources of evidence (e.g., describing the teacher's goals,
outlining the progression of lessons, scripting the videotape,
characterizing the artifacts in terms of the nature of
students’ responses and any written comments by the teacher;
illustrating the ways in which teachers reflected on their
students' work in their commentary). Then we developed
interpretations that coordinated various sources of evidence
(e.g., considering the relationship between the teacher’s
goals and the progression of lessons to evaluate alignment and
scaffolding or between the teacher's commentary on the video and
what we had observed to evaluate quality of reflection). Finally,
we moved to the level of responding to some of the
higher-inference guiding questions (e.g. "Describe how the
teacher uses knowledge about students to meet their needs in
instruction and provide them with opportunities to learn" or
"Describe the ways in which the teacher creates a learning
environment that provides all students with opportunities to
develop as readers, writers, and thinkers.”). For the
case studies, the fieldnotes from the classroom observation and
notes from a series of interviews with the teacher allowed us to
engage in much the same process. Our task was somewhat easier as
the case study writer had constructed responses to the guiding
questions that drew on evidence form the field and interview
notes. We nevertheless reviewed the field and interview notes in
light of the case study writer’s conclusions and often
included the additional detail in our comparisons.
The Appendix provides brief excerpts from the 70-plus page
portfolio/case comparison document prepared by LeeAnn Sutherland.
It shows brief examples of the sort of evidence we have from the
two methods and illustrates the way we have combined the evidence
to develop interpretations and comparisons relevant to the guiding
questions. Below we offer some general conclusions based both on
the comprehensive evidence in the longer document for which the
Appendix provides only brief examples.
Ms. Bertram teaches sixth grade English Language Arts in a
middle school located in what the case study writer describes as a
small, relatively affluent suburban community. In the portfolio
and the case, we see a "reading and writing" class of 24 students
who meet daily for two periods. In addition to this reading and
writing class, the case study writer observed another section that
covers only writing.
For the response to literature exhibit in her portfolio, Ms.
Bertram selected a series of lessons based on the study of a
novella commonly used with this age group. Three separate tasks
require students to (a) identify the character traits of the main
characters, (b) compose a written response citing which character
they felt they were most similar to/could relate to best/liked the
best and why, and (c) use that reflection as the foundation for
creating a simulated journal. In the processes of writing exhibit,
we see Ms. Bertram guide students through the development of a
poem using metaphors to describe their mothers in preparation for
Mother's day.
The case study describes three days of parallel lessons in two
classes where the teacher focuses on having students select three
best pieces of writing from their notebooks, complete evaluation
sheets about each one, exchange with a partner who would name his
or her choice for the writer’s best piece on a
‘Nomination Ballot,’ revise that piece of writing, and
publish it on a web page.
Even though the activities are substantially different, there
is nothing in the case that would cause us to question our
evaluation of the primary portfolio. Both portraits show the
teacher using a variety of activities to help students use
literature to make connections, take others’ perspectives,
and explore concepts, scaffolding their learning through the
activities she creates and the discussion she guides. She also
uses a variety of activity structures (e.g. small group, whole
class). We have ample evidence of similar classroom interaction
wherein the teacher poses questions to which students respond
initially (to begin an activity or class session), and she then
builds from students’ responses to guide subsequent
questions, consistently validating their contributions. In both
portraits, we see the teacher employ a variety of strategies to
guide students in developing as readers, writers, and thinkers.
Either portrait would tell us that this is a highly reflective
teacher who uses that reflection to shape practice immediately and
to think about changes in her future practice. She consistently
addresses both the strengths and weaknesses of each lesson as well
as their relationship to the larger unit. Thus the evidence in the
case reinforces the conclusions from the portfolio in somewhat
different contexts of teaching.
Illustration of Analysis in Math with an Ambiguous Portfolio
and Clarifying Case
Complex evidence of the sort contained in the portfolio and
case often presents substantial interpretive problems to readers.
While this is not the focus of this paper, reading problems do
impact the nature of the conclusions we draw. Here we illustrate
our analytic practices with a math comparison and present a
situation where the evidence in the portfolio is somewhat
ambiguous and where the additional evidence in the case appears to
support one potential portfolio interpretation over the other.
The math portfolio handbook focuses on a single 8 – 12
hour unit in mathematics and requests similar artifacts as
requested in the ELA handbook. Here the portfolio contains a
description of the classroom context, descriptions of a series of
lessons with instructional artifacts (e.g., handouts,
assignments); videotapes, student work, and reflections on two
featured lessons; a cumulative evaluation of student learning with
accompanying reflection; a focus on three students across the
featured lessons and cumulative evaluation of learning; and
analysis of teaching and personal growth. As with the ELA
handbook, then, we have partially independent artifacts (including
the videotape, instructional artifacts, and samples of
students’ work) against which to evaluate (some parts of)
the teacher’s description and reflection/evaluation on what
happened.
The guiding questions in mathematics are organized into five
categories (as initially prepared by INTASC and revised by
Connecticut for use in 1999-2000). (1) Tasks focuses on the
appropriateness (variety, richness, challenge, accessibility) of
the tasks selected by the teacher and on how effectively they are
implemented (clarity, accuracy, alignment, and responsiveness to
students’ interests, styles and experiences). (2)
Discourse focuses on how effectively the teacher
orchestrates discourse, uses tools and materials to support
discourse, and promotes discourse among students in which powerful
kinds of thinking predominate (defined as students exploring a
variety of approaches to problems and explaining their reasoning
with evidence). (3) Learning environment focuses on how
effectively the teacher manages the physical, time, and social
aspects of the classroom and encourages participation and
engagement by all students. (4) Analysis of learning
focuses on how effectively the teacher assesses students’
learning (accuracy, variety, and alignment with objectives and
tasks) and communicates with students about expectations and
feedback. (5) Finally, analysis of teaching focuses on how
the teacher learns from and improves teaching. The comparison
methodology was similar to that presented in ELA. (Note 13)
The mathematics teacher in this portfolio/case comparison works
at a large urban high school, in which 68% of students receive
free/reduced lunch. The portfolio presents an 11th grade
Integrated Geometry course. The teacher, Ms. Fleming, explains
that this course is the lowest level geometry course offered by
the school and that she closely follows the text. The unit
presented in the portfolio concerns tessellations and triangles.
The case study follows the same Integrated Geometry class and a
9th – 10th grade Math I class. Ms. Fleming reports that Math
I is the lowest level math course offered by this school with the
exception of remedial math. At the beginning of the course the
students shared textbooks with another class; however, Ms. Fleming
indicates that these texts soon disappeared. Ms. Fleming uses
worksheets left by a previous teacher and generates her own
curriculum worksheets. She reports that approximately 75% of Math
I students are failing. The lessons presented in the Math I class
focus on basic arithmetic, naming of geometric objects and
measurements. The Integrated Geometry lessons observed for the
case focus on triangles, angles, and parallel lines. The case was
conducted late in the year when both classes were reviewing
material for a final exam.
We begin with an extended discussion of the portfolio because
it alone raises a complex interpretive problem when the partially
independent artifacts (videotapes, handouts) are compared to the
teacher’s descriptions of what is happening. In the interest
of space, we focus on connections across lessons, nature of
mathematical tasks, and the implementation of tasks in classroom
discourse. Then we turn to the case where the portrait of the
teacher is substantially different from what is portrayed in the
written portion of the portfolio. [Both descriptions draw heavily
on Pamela Geist’s extended comparison document.]
The evidence in the portfolio creates a picture of a teacher
that sees how the mathematics of a unit connects across ideas and
to prior and later learning. Early in the portfolio, Ms. Fleming
describes some of the mathematical connections she believes are
important for students to understand. She writes,
Knowing the properties of triangles is important to
the student of mathematics because it is the starting point for
learning the properties for special triangles and enclosed
figures, namely polygons. For example, the Triangle Angle-sum
Theorem can be used to derive the sum of the interior angles of a
quadrilateral and convex polygons. It also lays the foundation for
students to learn about pyramids and other three dimensional
figures.
Ms. Fleming describes in detail how the seven lessons across
the unit connect mathematically and what students will learn
across the unit to accomplish learning goals and objectives. For
example, she explains,
It was important to show the relationship between
the exterior angle of a triangle with the adjacent interior and
remote interior angles. Once the properties of a single triangle
were established, it was necessary to establish the relationship
between a pair of congruent triangles and how to use the
postulates to establish congruence. In order to establish this,
students had to learn how to make congruence correspondence and
congruence statement. Of course this also leads us to establishing
proof, but my department recommended not to introduce proofs with
this level class.
In the portfolio, Ms. Fleming develops a strong case for the
predominance of discovery-type tasks and learning. She explains
that hands-on discovery type tasks dominate her practice and that
students learn best in these types of lessons. She writes,
The tasks that are most effective are of a
‘discovery’ or ‘hands on’ type…
[explaining] when my students “see it” and
“find it” the learning is retained …. I
try to let my students have the experience of discovery even when
it seems small. I have found that using the discovery method works
best for my students so I have tried to use it
often…
Using this method students get to see ideas. For example,
when students put together the angles from a triangle and actually
saw that it made a straight line, they knew that adding the
interior angles of a triangle would equal 180°.
They saw that it worked for all triangles regardless of the size
and shape.
In the portfolio artifacts, we see a range of tasks including
tasks that appear to offer opportunities for discovery and those
that focus more on recall and application of definitions and
facts. Consistent with the teacher’s description, a series
of problems presented in one of the instructional artifacts asks
students to work at drawing and measuring the various angles
within the triangles and record their data. The questions that
follow ask students to detect a pattern in the data and develop
statements or conclusions about various relationships within the
triangles. There is much writing in the portfolio explaining that
these tasks and others like them are selected because they support
students’ opportunity to formulate conjectures, reason about
mathematical ideas, and justify results. The teacher also provides
evidence of other types of tasks that are designed to check on
students’ general understanding of geometric shapes and
their properties. For instance, they ask students to classify
shapes, recall definitions and theorems, use definitions and
properties to find other measures, and to justify answers with a
known theorem or definition. The tasks appear on daily activity
worksheets, homework assignments, and on assessments such as tests
or quizzes.
We turn next to the videotape to see how these sorts of tasks
are implemented in classroom interaction. Here, we focus in detail
on one of two videotaped lessons providing excerpts from our rough
transcript of the videotape and Ms. Fleming’s reflections on
what occurred. The videotape is less effective in making the
teacher’s case for discovery-type learning. We see Ms.
Fleming guiding the discourse with students responding in short
statements that restate a definition or fact. The excerpt
we’ve selected begins about 4 minutes into the tape after
Ms. Fleming has finished reviewing, through brief question and
answer segments, the previous day’s lesson for students.
|
T Okay. So let’s look at a triangle (she has
an example on the overhead - (4:37 into the tape)). We have remote
interior angles, we have exterior. We have adjacent interior.
Let’s look in relationship to this one angle (points to
image on the screen). Here we have an exterior angle. It’s
outside the triangle. The adjacent interior is the one that is
what?
S Sharing the same sides.
T Sharing the same sides. So it’s adjacent.
Adjacent means?
S Next to.
T Next to, okay? Remote means, we said?
S Far away.
T So these two are far away from this exterior
angle. Right? These are going to be your remote interior.
S (unintelligible)
T Now look at this, if I said 2 is you exterior
angle, what is the adjacent angle for 2. Where would it be
located?
S (A student is asked to come up and point out the
specified angle. Other students were calling out some helpful
comments as well as “I know, I know” as he points to
various angles the teacher asks him to identify - remote interior.
She asks him to confirm that he is pointing to remote interior or
remote exterior. He confirms remote interior).
T Very good. So depending on which angle you pick,
your remote interior angles will be switching sides. Okay, this is
my exterior angle 1, right (she points)? So what angle is
adjacent to that and inside the triangle?
S What’s adjacent?
T Adjacent means next to, it’s touching.
It’s sharing the same side. So angle A over here would be
CAB. Correct? CAB is adjacent to angle ...? I’m looking at
angle 1. What is it?
T I’m looking at angle 1. What kind of angle
is angle 1?
S Exterior
T What is the adjacent angle to angle 1?
S 4
T What are the remote interior angles?
S 5 and 6
T 5 and 6. Much better.
T & S (At 7:28 in the tape) (Some students need
clarification so students and teacher have a brief discussion on
the different types of angles and their relationships to each
other).
T (Teacher moves around the triangle she has on the
overhead and asks for students to quickly identify remote,
adjacent, and exterior angles).
T Now, today you’re going to look at the
relationships between these angles. Okay? I’m going to hand
out a worksheet and you’re going to do that.
[Break in sequence. Students are now working together in small
groups on the assigned worksheet. The teacher walks around to
answer questions and check their work. Students are comparing
work. They use rulers and protractors to measure angles and help
each other construct the various triangles on the worksheet.)
(It’s hard to hear what students are saying to one another
but the teachers voice can be heard from time to time.)]
|
In reflecting on this lesson, Ms. Fleming describes how she
interacted with students to arrive at a solution:
I did not offer ‘answers’ for the
students, but guided them using questions to arrive at a solution.
For example, when the male student attempted to identify the
angles on the transparency, I realized that he was trying to
‘bluff’ his way out of it. I guided him by repeating
the names of the angles, emphasizing the words adjacent and
interior.
And about the small group time, she writes:
When a student asked me if she measured an acute
angle correctly, (she did not by reading the protractor
incorrectly), I asked her if her angle was greater or less than
90°, and if her answer made sense. When student
A asked me about the measures of her angles, I asked her how she
could check them. Once she ‘got it’, she proceeded to
help another member of her group.
This is not an unreasonable representation of what occurred,
and it helps us understand why she made some of the choices she
did. Viewed in light of this interpretive commentary, and taken
together with description of all the lessons that reflect a
privileging of discovery-type learning, it is possible to situate
the evidence in the videotape within a larger picture that
mitigates its dominant impression. While not presented here, the
other videotape and reflections surrounding it raise similar
issues; the teacher’s description surrounding the lesson
creates a different image than what we might infer from the
videotape alone.
As teachers, we know that even in the most
‘learner-centered’, discovery-oriented classroom,
there are often (with good reason) stretches of dialogue that
resemble what we see here. That we can’t hear what is
happening among the students on the videotape allows the
teacher’s characterization to shape our impressions. Viewed
alone, this portfolio can be constructed as a relatively strong
performance, better than just passing, even though the evidence
provided by the artifacts is a bit uneven.
Turning to the case study, what we see reinforces what we see
in the videotape and, taken together with what the case study
researcher reports from his interviews with the teacher, presents
a substantially different portrait. We focus on the same aspects
of the teacher’s practice, presented in essentially the same
order: connections across lessons, nature of tasks, and
implementation. About her characterization of connections across
tasks, the case study researcher writes: “In the pre-lesson
interviews when asked to describe her objectives for the next
day’s lesson, they were always in terms of discrete topics
to be covered, sometimes by book chapter. For instance, Ms.
Fleming explained her plans for her Math I class, “I am
presenting material that is very close to that they are seeing on
the exam. I will do Chapter 10 tomorrow, reading graphs, finding
mean, median, mode, and range.” Or, following a Geometry
lesson, she says: “they can do triangles, but not the
parallel lines. I keep throwing these [parallel line problems] at
them so they keep seeing them.” In his search for
counter evidence about this developing pattern, the case study
researcher offers the following quotation:
We do introduce the concept of showing how
triangles can be congruent and we will ask them to give reasons.
The last thing they were doing was perimeter and area for
rectangles, parallelograms, and other quadrilaterals. We did
Pythagorean Theorem and area under the curve using a trapezoid.
And we try to reason with them by making ‘cubes’ under
the curve and having them count the
‘cubes’.
The case study researcher argues that the teacher merely makes
mention of ideas that were presented in earlier lessons or other
contexts but did not offer any deeper understanding of how ideas
are connected, only that they are.
The case study researcher develops a very different portrait of
the dominant kinds of tasks offered and the kinds of learning they
promote. The case study writer concludes, “there is little
diversity or richness in the problems offered,” and,
“the majority of tasks are one-step applications of
definitions and theorems.” He describes, “Both in the
integrated geometry and in the geometry content of the Math I
course, she emphasized fundamental skills such as naming and
applying simple definitions. In the first period I observed of
Integrated Geometry the first set of problems all are based on
knowing definitions (e.g. altitude, median, congruent) or theorems
(e.g. corresponding angles congruent). The questions are one-step
applications of definitions where the only probe mentioned in the
problem, ‘How do you know?”?’ is more a
reference to naming the correct specific theorem used to solve the
problem.” He presents numerous examples of these kinds of
tasks. He concludes, “Ms. Fleming’s objectives across
the tasks she offered were centered around coverage of facts,
definitions and theorems students had memorized and not on the
development of particular skills or understandings of broader
concepts.”
The case study writer offers a description of the typical
discourse pattern, “There was one dominant pattern of
interaction around the tasks offered in Ms. Fleming’s
classes. My characterization of this pattern is based on three
observations in the context of two different mathematics
classes– Integrated Geometry and Math I. Ms. Fleming offered
students a set of problems, similar to those given earlier, in the
form of a worksheet. Ms. Fleming engaged students in a
Question-Response-Evaluate type of dialogue around the problems
offered on the worksheet. The pattern consisted of Ms. Fleming
going over the problems with the students as a whole class. She
would move in order through the problems on the worksheet they
were currently discussing and for each question, the pattern would
be essentially the same.” The case study writer explains,
“As can be seen from the example, Ms. Fleming asks a
question, or reads a question from the sheet to initiate the
conversation; next, a student responds to the question with a
specific piece of information, either a number, theorem name, or
yes/no with little or no emphasis on reasoning or justification;
in the next turn of talk Ms. Fleming evaluates the student’s
response, and then either gives a correct answer if the student
answer is incorrect, or poses a new question, which implies the
student answer was acceptable.” He notes two exceptions to
the pattern: (a) a TV game simulation where students are allowed
to call on one another for help if they aren’t confident of
the answer to a question and (b) a small group activity where
students worked together on a problem in groups of three or four
where, he notes, the groups often took on much the same dynamic as
the class overall: students worked on the same problems and
usually agreed on an answer, which other students in the group
then copied from those who ‘got it’.
Which portrait presents the more credible representation of the
teacher’s practice? What might explain the differences? The
difference in the quality of representation and reflection could
be attributed, in part, to the differences in format: spontaneous
comments in informal conversations and unprepared interviews are
unlikely to show the depth of the teacher’s considered
reflections. And, the written reflections may have been completed
with full access to curriculum resources and feedback from
colleagues. The handbook in fact encourages collaborative
reflection with colleagues. It’s also important to keep in
mind that the case study occurred at the very end of the year when
the teacher was reviewing for the final exam. Does this make the
classroom discourse atypical? Given the evidence, it is impossible
to know. [We address the issue of ambiguous evidence in more
detail in Schutz and Moss (in press).] Whether this would count as
a difference that matters depends, in part, on how portfolio
readers cope with the ambiguous evidence in the portfolio.
Additional Comparative Vignettes
We examined all 29 of the comparisons in ELA and Math at the
level of detail described and illustrated in the previous two
cases. In this section, we present vignettes from five additional
comparisons in which the differences we observed did seem to
matter in terms of the relevant criteria and raise, we argue,
dilemmas that assessment developers and policy makers should
consider in the design of assessment systems.
Consistent with the intent of the paper, our vignettes are
developed to foreground important differences for a particular
comparison; we do not describe, as we have above, the similarities
in these comparisons. In the interests of space, we summarize our
conclusions with brief illustrations. [We hope the extended
examples described above and in the Appendix illustrate the
attention to detail that underlies these conclusions.] Each
vignette follows a similar pattern: we first characterize the
issues and context differences that the vignette raises (so that
readers can choose whether to read the vignette) and then we
provide brief illustrations of those issues. (Note 14) This section concludes with a brief
mention of additional sorts of differences we noted but thought
were unlikely to matter in terms of the criteria used. We reserve
discussion of the issues the vignettes raise until the final
section of the paper where we propose some possible paths for
resolution.
Vignette 3: Mr. Richards
In this portfolio/case comparison we see a case in which an
English teacher's performance looks substantially different in an
honors class than in his third level class. Mr. Richards teaches
in what the case study researcher describes as a rural school of
about 600 students, 97% of whom are white. Distinguishing among
students' placements in the school's tracking system, Mr. Richards
indicated, "Honors kids are chosen because of their work ethic
and their intelligence." Students in the second level,
“have the work ethic, but they just can't grasp the
material. They will eventually, but their work ethic keeps their
nose above water.” For the students in the third level,
“the content is watered down.” While the
portrait of the honors class is relatively consistent across the
two methodologies, the case study highlights how it is that his
beliefs, as well as institutional tracking, seem to shape his
practice with students in different tracks. [The original
comparison was prepared by LeeAnn Sutherland.]
Only the honors class is represented in the portfolio as they
complete a poetry unit. The case study writer observed the honors
class as well as a third level class. Mr. Richards explained that
the poems for the third level are not as difficult; they use a
narrative, abridged version of the literature selection from their
textbook (honors classes read the unabridged version); he uses the
textbook much more; he has different expectations of
students’ writing (the focal "correction areas" are
different). Mr. Richard’s rationale for his choices of
texts, for the activities he employs to engage students with
literature, and for his implementation of a writing process appear
to differ in terms of his understanding of their level of
ability.
As the case study writer describes it, both honors and third
level students readthe same novel at different times during the
school year, but Mr. Richards assessed their interpretive needs
and abilities differently. The goal for third level students was
more “the story” and “trying to pick
out the basic elements [such as] plot, theme." He believed
that honors students, however, “can go beyond the
literal." Honors students had "a lot of discussion …
a lot of note-taking, explaining the concepts,” whereas
third level students answered primarily lower inference questions
on worksheets. Of honors students, Mr. Richards required
out-of-class reading and book reports that follow a genre
sequence—first fantasy/science fiction, second historical
fiction, and the like. Students in the other class did a single
book report on a biography or autobiography, and they reported on
the book by creating a poster or doing an in-character
presentation to the class. They ended the school year with a novel
based on a made-for-TV movie which Mr. Richards acknowledged has
"absolutely no literary merit" but that he chose
"because students like it and because it’s
reading." Students wrote an essay at the end of the
unit, a personal narrative that did not require them to make
connections with the text itself.
Another example of the difference in opportunities provided to
students depending on level was in composition study. Honors
students prepared for 10th grade by writing a persuasive essay
that included MLA documentation. About writing, Mr. Richards said
that honors students would be "mortified" to conference
with him individually as "they don’t like to be
embarrassed." He typically worked with small groups of these
students in the first semester, he said, but did not require
“rewrites” of them in the second semester
because they had already "mastered the guidelines" of
revision. Mr. Richards writes in the portfolio narrative about two
additional, “authentic” tasks Honors students
would complete— entries for a poetry contest and composing a
group poem to be read at graduation.
In contrast to honors students, third level students met with
Mr. Richards for individual meetings about their writing.
Conferences took place at the front center of the room, facing the
class, with the teacher seated at a low table and the student
whose paper was being reviewed seated on a high stool next to him.
Mr. Richards marked student papers ahead of time so that he would
remember what to tell them "they need to fix." He
"counseled kids" by skimming their papers and calling their
attention to each item he had marked as problematic. The case
study researcher observed 15 conferences he held over two days.
Mr. Richards emphasized form and mechanics in these meetings,
including frequent references to spelling, contractions,
capitalization, use of second-person pronouns, writing numbers in
word form, and the need to include information in its proper place
in an essay. Students spoke to answer his questions or to ask for
clarification of his suggestions. Following those meetings
students were to "rewrite," which offered two options.
Students could “rewrite the entire paper and make all the
corrections" or they could rewrite problem words ten times,
sentences three times, and in addition, write three things that
they learned. Vocabulary study for students in this class
consisted of writing definitions, parts of speech, and sentences
using each word.
Vignette 4: Mr. Johnson
Here we have a portfolio-portfolio comparison across two
different subject areas within mathematics. Both portfolios were
generated by a novice middle school mathematics teacher working in
a community that he describes as white, suburban, and blue collar.
Mr. Johnson works at a large middle school with about 900
students, almost all of whom are native English speakers. The two
portfolios present 8th grade math courses; both classes use a
popular textbook series. The more advanced of the two is an
Algebra I course for "average" students. The unit presented in the
portfolio from this class concerns linear relationships,
particularly the generation of algebraic equations for lines. The
other portfolio is from a Transitions course for "general ability"
students. The unit from this course covers statistics,
particularly the generation of multiple types of graphs to display
data. In a close reading of the two portfolios, important
differences emerge relating to the use of ‘real world’
applications in classroom tasks, to modes of final assessment, and
to the role of the teacher in classroom activities. [The original
comparison was prepared by Jon Star.]
The first category of difference concerns Mr. Johnson’s
use of real world examples and concrete materials. Connections to
real world examples play a very prominent role in the tasks in the
Transitions portfolio. Several of the lessons in this unit begin
with students collecting data that is subsequently made into a
chart. For example, students count Fruit Loops cereal pieces to
determine which colors occur most frequently; they work with box
scores of a basketball game; and they cut paper plates in their
examination of pie charts. In contrast, context plays little or no
role in the Algebra portfolio. Students work exclusively with
symbolic equations of lines: these equations are never given any
referent or context, nor are any real-life situations embodying
linear relationships introduced in class.
A second salient difference concerns the way Mr. Johnson
assesses students at the end of the portfolio units. In the
Algebra portfolio, the teacher assesses students in a traditional
manner -- using a written test, administered in a single class
period. Students are asked to complete 23 problems, all clustered
around the execution of procedures (finding an equation of a line
given a point and a slope, finding the equation of a line given
two points, and converting a line from point-slope form to general
form). There is significant repetition: for example, clusters of
four or five problems look identical, with only the numbers
changed from one to the next. In contrast, the final assessment in
the Transition portfolio requires students (in groups) to collect
data, construct graphs, and give an oral presentation. This
assessment takes several days to complete; students are assessed
on the quality and accuracy of their graphs and on their oral
presentations. At the conclusion of this assessment, students meet
individually with the teacher to discuss their grade.
A third difference concerns the role Mr. Johnson appears to
take in conducting classroom activities. In the Transitions class,
the teacher seems to view his role as one of a background guide;
his actions and his commentary consistently indicate that his goal
is to largely remove himself from classroom activity. For example,
in one lesson plan, he writes that he plans to "step into the
background and let students proceed with their work on their
own." In another lesson, the teacher makes an explicit attempt
to re-direct questions posed to him back to students (and he
subsequently reflects that he was very happy with the results). In
general, almost all of the Mr. Johnson’s lessons in this
portfolio consist of students being given a worksheet or an
activity to do in groups; the teacher spends much of each class in
the background, circulating from group to group and answering
students' questions when they arise. In contrast, the teacher
portrayed in the Algebra portfolio is much more directly involved
in student activity. Almost all lessons in the Algebra portfolio
involve the teacher conducting a recitation: standing in front of
the classroom, he demonstrates a procedure, asks frequent
questions of the class to guide him through his demonstration, and
then offers problems that the class should do for practice.
Although students are involved in these recitations via the
teacher's questions, the teacher is largely controlling the
activity and problem-solving that occurs in most classes. The
teacher writes that he views the recitation style of instruction
as appropriate for the more advanced Algebra class but less so for
the low-achieving students of the Transitions class: "Low
achievers and behavior disorder students could not stand more than
ten minutes of lecture... The style that works best for them is
more of an activity based learning."
Vignette 5: Mrs. Martin
This ELA portfolio-case comparison raises a complex chain of
issues: (a) we see the same class at two different points in time
engaging in substantially different learning activities; (b) we
learn from the case that some practices illustrated in the
portfolio were not consistent with the teacher’s practice,
were undertaken because the portfolio handbook prompted them, and
were not consistent with what she believed were her
students’ needs; and (c) this then causes us to question the
teacher’s judgments about her students’ capabilities.
[The original comparison was prepared by LeeAnn Sutherland.]
The teacher in this portfolio/case comparison works at a school
characterized by the case study writer as a “large inner
city school.” The students who attend this school are
predominantly Hispanic from poor and working class families. For
this teacher, Mrs. Martin, both the portfolio and the case study
are based on a 9th grade Writing Enrichment course. The course was
developed for students in a “transitional program” who
are “too old to be in Middle School” at 15-16
years of age, but are “earmarked as an at-risk
group.” There are 13 students in the class, 10 of whom
are bilingual; 3 are identified as special education students. The
portfolio literature exhibit is comprised of one 4-page short
story and one poem which is integrated with the writing exhibit.
The case study focuses on a drama unit with a two-page play from
an adolescents’ literary magazine as the primary text. The
case study writer observed two sections of the same writing
enrichment course.
The texts used in the literature section of the portfolio were
selected to focus on the theme of strong, courageous women. The
teacher characterizes her goals as helping students see the
connection among these pieces of literature and their own lives,
wanting them to “see the potential within
themselves”. The lessons focusing on these texts, across
more than 10 of the 15 days represented in the portfolio, take
students through a series of activities including the completion
of several charts focusing on elements of literature such as
character, theme, and imagery, a 2-paragraph
“mini-essay” describing one of the characters, and a
culminating essay in which students write three paragraphs which
compare a character from the short story with a character from the
poem. The comparison writer notes that the time spent on these
brief and straightforward texts seems excessive and that students
have little opportunity to develop their own ideas. The
teacher’s reflections suggest that she believes students
need this level of support to comprehend the story. She indicates
that “students have difficulty decoding words”
and that each story read in this course begins with an
uninterrupted oral reading (by the teacher) that gives the
students “the opportunity to hear the story first, get a
basic idea of the plot of the story, and minimize the frustration
of difficult vocabulary.” She notes that
“focusing on a few skills and then building on them,
ensures a complete understanding, and more importantly, retention
of the lesson. For this group of students, retention is the
key.”
In her portfolio, Mrs. Martin provides commentary and videotape
of two students as they conference about essays they have written.
Each of the students offers observations to the other, and the
author is seen to respond to those observations. They discuss
thesis statements, paragraphing, use of examples, and proper
citation form including line numbering [for the poem]. They also
discuss parts of the essay they found difficult to understand. The
comparison writer argues that this is one of the better
student-student conferences seen in portfolios, as participants
are actively engaged in dialogue about writing. While it is not a
substantively rich conference, many of the writing conferences
seen on videotape are teacher-directed, or the students speak
to one another but not with one another. The two
students seem to “get” the idea of how to engage in a
writing conference.
However, when the case study researcher asked the teacher a
question from the interview protocol, “Did the portfolio
involve things that were not part of your teaching
practice?” Mrs. Martin responded: “I thought the
peer editing and peer responses were phony for me because I
don’t do that yet. The kids were not really ready
yet.” If the teacher coached the girls on how to talk
for the camera, then that raises one set of issues. If she did
not, but simply asked them to conference, even though she usually
does not have them do so, then their relative success in the
conference raises questions about the teacher’s judgment of
students’ abilities.
The case study writer saw no writing in response to literature
and no process writing during the time of his observations. The
only writing he saw involved lists and definitions associated with
vocabulary words. Asked whether what the case study writer had
observed over the three-day period was “typical of your
teaching and your classroom,” Mrs. Martin indicated that
“the class is a writing enrichment class and most of our
time for the whole year was spent on enrichment.” She
stated that previous writing assignments for the course had
followed the [state’s] test format and students had also
written autobiographical essays. Neither source of evidence
provides examples of these types of writing.
About literature, Mrs. Martin said that the portfolio
requirements—again—did not jibe with her usual
practice: “I think having 7-8 hours of literature did not
really fit the curriculum I have for these kids.” She
told the case study writer that these students are not ready for a
novel or for the “7-8 hours of literature”
required for the portfolio, that they struggle with decoding, and
that students needed to read the screenplay twice in order to get
it. The case study writer observed the teaching of the magazine
play, and he reports that students’ oral reading over two
days was relatively fluent, and though little attention was paid
to students’ understanding of the play, students’
verbal comments, a question one student asked, their expressive
reading, and other verbal cues indicated that they did, indeed,
comprehend this particular text as they were reading. Again, this
raises questions for us about the teacher’s judgment of her
students’ capabilities and needs—question that the
evidence is insufficient to address.
Vignette 6: Mr. Gere
In this vignette, we encounter a teacher who indicates that he
chose to use his portfolio as an opportunity to improve certain
areas of his teaching. In the portfolio he presents an Algebra I
class, where he reports the students reflected a wide range of
abilities and dispositions. The case study writer observed the
Algebra 1 class and a Pre-Calculus honors class. We learn from the
case that Mr. Gere had originally decided to focus the
Pre-Calculus honors class for the portfolio but switched to the
Algebra I class because he felt the class needed extra attention.
In the portfolio, he indicates that he hoped the portfolio would
help him focus in on the difficulties he was having and turn them
around. [This vignette is based on the comparison originally
developed by Pamela Geist.]
While the two portraits of teaching present quite consistent
evidence about this teacher’s practice, the interpretive
commentary that surrounds them leaves the reader with a
substantially different impression about the effectiveness of his
teaching. Unlike the situation with Ms. Fleming, where her
interpretation highlights the strengths in her practice, Mr. Gere
focuses, it seems relentlessly, on his concerns about his teaching
and what his students are learning. The case study, then, presents
the teacher in a far more positive light.
The case study writer, acknowledging the predominance of
procedural work and the ongoing focus on manipulating expressions
and equations, nevertheless develops a picture of a teacher who
encourages students to look at underlying ideas and explore some
of the logic associated with working the procedures. The case
study writer concludes, “In all six of the classes observed,
the students’ oral responses, questions, homework, classwork
and quizzes indicated that Mr. Gere’s expectations were
accessible to most students.” She notes some differences
between Pre-Calculus and Algebra: For example, the case study
writer reports that in the algebra class, the pattern was one of
fairly routine mechanics; first distributing with algebraic
expressions, then factoring algebraic expressions, and finally
solving quadratic equations that were factored and set equal to
zero. In the pre-calculus class, although the work appeared to be
quite mechanical, there was more problem solving involved because
of the number of possibilities when finding equations that fit a
set of data points. However, she also notes: “As the
material developed over the three days, the students played a
bigger role in the dialogue, offering their own strategies for
finding the equation from a set of data points. Mr. Gere also used
open-ended questions effectively: for example, about a quadratic
equation, in standard form, students were asked to give its
characteristics, in other words, to tell him what they could about
this function. The responses were extensive and showed depth of
knowledge about a quadratic.” The case study writer
creates an overall image of a fairly successful teacher, one who
takes his work seriously, is well-liked and respected by his
students, and works hard to create a practice that meets his goals
and expectations.
The comparison writer notes the similarities between this
representation and what she sees in the artifacts of the
portfolio. The portfolio artifacts show a similar continuum of
difficulty on daily worksheets, quizzes, and tests. Problems begin
with simple equations and progress to more complicated ones.
Initial tasks focus on the procedural steps to solve problems and
move toward using these steps in context. There usually is one
task that requires students to explain an idea or the logic
underlying steps. In this sense, both reports show that tasks
become progressively more difficult because they require that
students know more about the different scenarios represented in
algebraic equations and how to manipulate more complex expressions
and equations. In large group work, Mr. Gere demonstrates
procedures and talks students through his logic of the steps. Mr.
Gere asks next-step questions of students and students answer Mr.
Gere directly. And he effectively and accurately demonstrates the
procedures for students, using appropriate mathematical language
and notation to demonstrate how a system is solved, and students
practice and memorize the procedures eventually making them their
own process. Mr. Gere promotes student-to-student discourse in the
context of small group work and pairing students together to
complete a task. The video evidence illuminates that for the most
part students work productively in pairs and small groups
explaining to each other how to proceed with a task and compare
procedures and answers with each other.
And yet, Mr. Gere’s reflective commentary on his practice
paints an entirely different image of his success. For instance,
talking about the difficulties he faced in facilitating
discussion, he writes “I regretted not soliciting a
variety of problem solving methods for this exercise and again
bypassed potentially rich mathematical discussion in the interests
of time. The decomposition of the problem’s solution into
discrete steps was worthwhile and helpful, but again lost
something due to the more directed discussion that resulted from
my sense of time pressure.” He notes further:
“My responses to students’ questions also reflect
my impatience such as the response to student A when I don’t
even let him finish his question before answering. I give him a
perfectly accurate and reasonable answer, but the tone of
impatience is more damaging in other areas. Another student
question is similar in outcome. I quickly give an accurate concise
answer to his question but would have benefited the other students
with the same misunderstanding by instead redirecting his question
to a few of the weaker students to make sure their understanding
was solid.” He worries “I have begun to
recognize that I have slowly adopted more and more of the
students’ inclination to ‘just let me see how to do
the problem so I can stop thinking.’” In fact, he
describes what he perceives to be an ongoing decline in
students’ efforts to succeed in his Algebra I class:
“I know that the effort level has declined precipitously
over the past 1 1/2 months in this class, and I worry that I am
enabling the very destructive tendencies that are plaguing this
class.”
Thus, there is a running theme across his reflections, one that
details the frustrations and disappointments of not being able to
change students’ attitude. The image he creates in the
portfolio is of a teacher struggling with changing his teaching
and at times, there is a sense of hopelessness. Because there is
little offered in the portfolio in the way of a rich analysis for
how he intends to turn this pattern around, the portfolio writing
produces the image of a teacher who sees himself as mostly
ineffective and struggling with supporting richer opportunities in
the discourse and at the same time offers few ideas for how to
change current patterns. In effect, the case study report paints a
much more positive image of the discourse patterns, indicating
that procedural goals are getting met through the patterns of
discourse and at times, especially in Pre-calculus, the discourse
supports a deeper and richer investigation into the mathematical
ideas. While the comparison writer, perhaps cued by the image in
the case study, was able to read behind Mr. Gere’s
commentary, this portfolio (which was also used in another study
involving multiple readers) elicits quite different reactions
depending on the weight the reader gives the teacher’s
negative commentary about himself. Whether this is a difference
that would matter depends on whether or not the portfolio readers
are willing and able to read behind the teacher’s
commentary.
Vignette 7: Mrs. Jacobson
In one sense, this portfolio/portfolio comparison provides
another example of a teacher whose practices look different in
classes she characterizes as comprised of students with different
ability levels. In this case, we observe differences in the
teacher's demeanor and attitude toward students in the two
classes. We also note that her expectations, her explanations for
her choices, and her reflections on students' performance in the
second portfolio (unlike the primary portfolio) are sometimes
framed in terms of cultural and linguistic differences. [The
primary comparison was prepared by Laura Haniford, drawing on
documents from Steve Koziol, Leah Kirell, and Suzanne Knight.]
This teacher, Mrs. Jacobson, submitted portfolios for two 7th
grade classes that are “theoretically heterogeneous”
but that are actually grouped, as she reports, based upon the
scheduling of math classes. There are 26 students in the first
class and 28 students in the second class; there appear to be 2-3
students of color in the first class while students of color are a
majority in the second class. The teacher is white. The base text
in both classes is a trade novel set during World War II and is
part of her department’s prescribed curriculum.
Mrs. Jacobson characterizes her students in the first class as
“bright and fun” and states that her
expectations for students reading this novel are that they
“learn the historical and cultural ramifications of World
War II. I intended that students examine the personal struggle of
the innocent civilians victimized during the war and the
incredible strength and courage of the survivors.” She
also states that this particular selection exposed the students to
diverse perspectives “other than the black/white issue
which is pervasive at this school.” In contrast, Mrs.
Jacobson characterizes the students in the second class as
“behavior problems” and her expectations for
them are different. She states that she would not have chosen this
book for them and that “the majority of these students
cannot--or will not--read it and understand it. These students are
intensely committed to being Black or Hispanic and did not relate
to the Holocaust…. They love violence and
injustice—most kids their age do…. [But] This was far
too sanitized for them.” Mrs. Jacobson also states that
she is more concerned that the students in the second class learn
the history of WWII as opposed to understanding any elements of
plot or character.
The teacher begins each class with a daily oral language (DOL)
experience. In class one, this takes many forms – open ended
questions about literary terms followed by a discussion of some
examples from the novel and from students own experience; brief
comprehension questions on the reading; a vocabulary exercise
where extra credit is given for making the teacher laugh; a brief
review of grammatical terms. In class two, the DOL is
consistently a recitation/review of questions on the assigned
reading with answers given “swiftly” and
written answers handed in at the end of the week. Commenting on
an interaction in class one, she writes, “In the future,
I might take a hint from this class and compare movies they may
have seen with the books we’re reading. I always try to
relate what we’re doing to their own lives, but they like to
talk about movies.” Of class two, she writes
“With this group, I have to lead them with a strong hand,
although I try very hard not to tell them what they
‘should’ say: I want to hear what they want to say,
even if it is immature or downright silly.”
In both classes, students’ written response to literature
is related to preparing a five paragraph theme and to addressing
the state’s criteria for a persuasive essay. Beyond this,
her stated goals for the first class include learning to appeal to
all five senses, to write "great opening lines" and to
"engage" readers; for the second class, her goal is getting
students to write "something--anything.” In the first
class, the primary assignment asks students to take a position on
whether they would take in an escaped prisoner of war who came to
their home seeking refuge. In the second class, the same primary
assignment is given, together with an alternative prompt related
to a reading on the Civil Rights movement, because “They
identified with the black students.”
In class one, the primary assignment is grouped with two
others, a personal time narrative and a group diary designed to
personalize the story for students; there are no surrounding
assignments in class two and students in the second class are not
given the opportunities to work with one another that the students
in class one are. Samples of writing from each class suggest that
students understood the demands of the assignment and could
respond to it appropriately. Commenting on her concerns about a
student’s writing in the first class, she says: He
“does not create an effective visual in the opening
paragraph. Also…[he] does not respond to the opposition in
his fourth paragraph. He only states that there is another
side.” Of one student in the second class, she writes:
“[He] did not follow the guidelines, either. This
child’s family speaks Spanish in the home, and he had made
great improvements since September. He has learned to skip the
lines to make paragraphs, and he is writing sentences, rather than
one long sentence.”
The guidelines for the state's writing tests are the focus of
formal writing instruction in both classes and of video segments
on writing. In the first class, Mrs. Jacobson’s
introduction is brief, mainly an overview, and students have an
opportunity to look at some samples and to begin working in a
writing workshop format on their own essays – students read
aloud some of their drafts, they work together in peer editing,
the teacher guides the critique of samples, drawing from the
student samples to deal with topics in language use (e.g., using
over-used words), and she confers with individual students. On
the video, the teacher circulates around the room, talking with
individual students about their work. Overall, the interactions
appear positive and supportive.
In class two, the teacher guides students through a series of
questions about the state’s writing test guidelines, seeking
responses about what is to go into each paragraph and elaborating
on student responses. She moves to a whole class example –
on the topic of what if there were no teachers – which
begins to generate student responses, although the teacher appears
(on video, as well as in her comments) to be frustrated that the
students don’t seem to understand how to give reasons for
the “other side,” which she says is required by
the state’s guidelines. There is no small group work:
students are in whole class activity or working on their own; when
they are writing, the teacher circulates and has occasional
interactions with students.
Based on our observation of the video, the teacher's management
in class one appears to be smooth; students move from one type of
activity to another and from one arrangement to another with
little disruption; the teacher comments that this group is
especially active and noisy, although that doesn’t appear to
be evident from the tapes. In the second class, management issues
dominate more of the dialogue. The teacher writes:
“running a discussion with this group is like walking
through hip-deep jello. With every remark comes ambient noise and
chatter which drowns it out and everything has to be repeated. In
fact, as I watched this segment, I was bored just listening to
myself repeat the instructions more than 10 or 20 times. Virtually
nothing got accomplished.” In addition to this, Mrs.
Jacobson has several extended disagreements with individual
students in the second class that are conducted in front of the
entire class.
Mrs. Jacobson’s reflection about her teaching and about
students’ learning is not detailed or extensive. With class
one, she notes that some of her assignments were too vague and
that she was unprepared for how capable her students were,
something she would better prepare for in the future. She thinks
she will add some drama in the future, because the group would
have done well with this kind of reading and activity. With class
two, she notes that she was not particularly effective as a
teacher, but attributes this primarily to being required to teach
an inappropriate text and having to follow a district mandate that
doesn’t fit the students. She notes, “Sadly, any of
these students’ real problem is behavior. If they
would listen, if they wanted to produce, they could. Peer pressure
and stress at home makes it nearly impossible for them to succeed.
Patience and in-class time to do their work does increase their
chances of doing acceptable work.”
Additional Differences
Substantive differences existed in all the comparisons, as we
would expect in any dynamic teaching situation. It’s
important to note, however, that in the majority of cases examined
in math and ELA, we found that the second data source elaborated
but did not overturn our general impression of the quality of the
teacher’s performance with respect to the relevant criteria.
In some cases, we simply saw the same practices instantiated in a
different content; in some cases we saw somewhat different
practices that, taken together, presented a coherent portrait
across the two (e.g., Ms. Bertram) or clarified an ambiguity in
the original portfolio (e.g., Ms. Fleming); in some cases we saw
differences similar to those we represented here but not so
substantial as to overturn our judgment of the primary portrait.
Portfolios that contained inconsistent evidence--in which the
artifacts did not fully support the teacher’s descriptions,
as with Ms. Fleming--complicate the question of portfolio
generalizability with the problem of interpreting the initial
portrait. [We discuss issues of portfolio evaluation elsewhere
(Schutz and Moss, in press).] In some cases, we learned things
about the teacher in the case, which may not have been relevant to
the criteria, but which shaped our judgment of the teacher. For
instance, in one case (a likely “conditional” score),
the case study researcher observed multiple situations of
conflict, at least one potentially violent, during and outside of
class that the teacher skillfully resolved. In fact, we often
learned about the teacher’s relationship to, rapport with,
and work with students outside of class. We also learned about
numerous factors that influenced the teachers’ performances
that would not likely be mentioned in the portfolio or illuminated
by the criteria if they were: the presence or lack of a coherent
curriculum and/or text that is consistent with the standards; the
presence or lack of a supportive mentor in the teacher’s
subject area; large differences in professional development
opportunities and opportunities for collaborative work with
colleagues; differences in resources available to prepare the
portfolio (including release time and access to video equipment
for multiple days). Of course, whether and how these factors
influencing a teacher’s performance should or even could be
fairly taken into account in this assessment is an open
question.
Conclusions
As we indicated in the introduction, our goal is to use these
comparisons to illuminate issues for assessment developers to
consider in designing assessment systems. Consequently, our
analysis was disconfirmatory: It was not intended to document
consistency but rather to highlight the kinds of differences that
can occur across different representations of a teacher’s
practice and that point to potential problems with implicit
assumptions about generalizability. We want to caution readers
against drawing conclusions about the typicality of our
comparisons. There is no way to know how these
volunteers--teachers who were willing to complete a second
portfolio or to allow an observer into their classrooms for 3-5
days--might differ from the larger population of beginning
teachers. However, the dilemmas we have found--which would not be
illuminated in data that are routinely collected--highlight
important issues for educators, assessment developers,
psychometricians, and policy makers to consider.
We begin our conclusions with a review of the kinds of
differences that seem likely to matter (that is, likely to result
in different performance levels) in terms of the relevant
criteria. Then we return to the useful concepts of
generalizability with which the study was framed. What is the
assessment domain (or “universe”) to which we can
safely generalize? What is the (larger) outcome domain about which
we can reasonably draw inferences supported with logical arguments
and intermittent empirical studies? How consistent are these
domains with the domain implied in the decision about licensure?
We close with some more speculative thoughts about the nature of
assessment systems (and theoretical resources) that might support
well warranted decisions about teaching performance.
What Have We Learned about the Generalizability of Teaching
Portfolios?
The comparisons in this study begin to address an important gap
in our understanding of the generalizability of portfolio
assessments of teaching and, perhaps, of performance assessments
of teaching more generally. Taken together, these vignettes raise
a number of concerns, some of which relate directly to the topic
of generalizability and some of which spill over into concerns
about validity and ethics.
In the small set of comparisons we’ve examined here, it
is very clear that context matters. We’ve shown
differences in performance across classes that differ in
(perceived and/or institutionally designated) ability level of
students, in subject matter taught, and in cultural background of
students. For instance, in the case of Mr. Johnson, we saw
differences in performance across two subject matter domains:
statistics and linear equations in algebra. Perhaps it is easier
for novice teachers to develop “rich and challenging”
tasks that foster “connections” and “reflect
students’ interests, styles and experiences” in some
domains than in others. We've presented two clear examples of
differences in performance across classes that differ in perceived
ability level (Mr. Richards and Mrs. Jacobson). And we found other
cases (not described here) in which the differences were apparent
but far more subtle (as might be seen in the difference between
the Pre-Calculus and Algebra classes of Mr. Gere). In Mrs.
Jacobson’s case, perceived differences in ability were
coupled with differences in the cultural background of her
students. The differences in performance here are more troubling
because of the teacher's apparent attitude toward the students and
tendency to seek explanations of their performance outside her
practice, in district requirements and in their perceived needs as
members of different cultural groups. The rubric has no place for
descriptions of teachers' expectations and, indeed, if it did, it
would be easy to coach a teacher to eliminate problematic language
from her text.
That context matters will come as no surprise to those who
study classroom teaching or performance assessment. There are
complex and dynamic relationships among teachers’ social
backgrounds and experiences; their expectations, values, and
beliefs; their classroom practices; their students’
(inter)actions; and the larger social and institutional structures
in which they live and work (Gallego, Cole, and the Laboratory of
Human Cognition, 2002; Knapp and Woolverton, 1995; McLaughlin and
Little, 1993; McLaughlin, Talbert, Bascia, 1990; McNeill, 1983;
Stodolsky and Grossman, 1995). Research in performance assessment
more generally, with tasks that are far narrower in scope than
those represented in teaching portfolios, shows us that different
people perform differently on different tasks (the person x task
interaction, in terms of generalizability theory) which
necessarily confound the construct of interest with variations in
the context in which it is preformed. A recent review of
approaches to performance assessment in health professions
(Swanson et al., 1995) leads to similar conclusions about the
difficulty of generalizing across the contexts presented by
different tasks. “Regardless of the assessment method used,
performance in one context (typically, a patient case) does not
predict performance in other contexts very well” (Swanson,
Norman, and Linn, 1995, p. 8). The social context of a classroom
seems even more complex than that of a health professional-patient
relationship. While both are certainly equally embedded in
societal and institutional structures, the classroom involves
dynamic relationships among as many as 30 – 35 individuals,
each with their own cultural/personal backgrounds that vary in
ways we can’t predict. Gallego and colleagues (2002) argued
that “every continuing social group develops a culture and a
body of social relations that are peculiar and common to its
members…. Hence,… we can expect that every classroom
will develop its own variant” (p. 992).
Two recent reviews of assessments of teaching (NRC, 2001b;
Porter, Youngs, and Odden, 2003) both raised concerns about the
lack of evidence of teaching performance across differing
classroom contexts, and our observations support those concerns.
It is hard to imagine, however, how a single assessment program
could adequately (and fairly) address those concerns. One could
ask for samples of teachers' performance in different classroom
contexts, as we tried to do, and yet the variations available
within teachers’ yearly class loads vary quite substantially
from teacher to teacher, and all are considerably narrower than
the range of classes and school contexts in which they are
licensed to teach. (Note 15)
One could imagine other kinds of assessments in which teachers are
presented with cases from a range of classroom contexts, and this
might provide some relevant evidence; however, asking teachers to
plan or evaluate activities for students with whom they have
little experience would raise other kinds of validity questions.
And the experience in health-related professions with these sorts
of simulations suggests that questions of generalizability are
likely to remain. There are no straightforward solutions.
The case of Ms. Martin raises a second issue directly relevant
to generalizability. Here we find a teacher who perceives that she
is in the position of being required to show evidence of a
performance that is outside of her routine teaching practice. Does
that suggest the portfolio guidelines were too directive or
restrictive? Experience with National Board assessments has led
developers to conclude that it is important for teachers to
understand what is valued in the assessment; being explicit about
expectations, within the bounds of construct relevance, is
considered important for validity and fairness (Pearlman, in press
a, b), and INTASC has emulated their practice. Clearly, portfolio
assessments of this sort do not support conclusions about what is
typical. What we learn with a “passing” portfolio is
whether a teacher and a group of her students can engage in a
particular kind of practice and reflection in at least one
instance. Teachers may, of course, make choices that are not in
their best interests, as was the case with Mr. Gere, who chose a
class with which he was struggling and then emphasized his
shortcomings. While this is commendable and productive for a
professional development activity, it is less than strategic for a
high-stakes assessment. Careful instructions to candidates, and
examples of successful portfolios, will be important in helping
teachers demonstrate the strength of their practice with respect
to the standards. It is important to recognize, however, that not
all candidates will have commensurate opportunities to illustrate
their practice. Assessors should try to make sure that teachers
have the human and material resources they need, including
adequate time, access to competent mentors, and access to
audiovisual services. Of course, assessors cannot control
teachers’ work assignments or the schools in which they
work. We have to recognize that these factors influence the extent
to which teachers can demonstrate a performance consistent with
higher scores and design a system that is appropriately skeptical
of the validity of its conclusions about individual teachers.
Not surprisingly, differences across methods used here also
played a role, with different methods being more or less adequate
in providing evidence relevant to different criteria (as we
discussed above under structural differences). The portfolio
typically offered the teachers in our comparison a better
opportunity to explain their choices and reflect thoughtfully on
their teaching (although with a skillful interviewer, one could
imagine the opposite for some teachers who are uncomfortable with
writing); the case study provided more evidence (six full classes
vs. brief videotaped segments from two featured lessons) that
allowed stronger inferences about the pattern of discourse in
class. Of course, either method could be revised to better address
these concerns. If we want to draw conclusions about patterns of
classroom discourse, having access to two lessons may be
insufficient, especially if they do not support the written
description in the portfolio. Clearly, more research about this
would be most beneficial.
While criteria were not varied in this study (and are typically
considered fixed), there are clearly many different ways to
instantiate the INTASC principles in specific criteria tied to
available evidence. Consider, for instance, the following two
principles taken from the ten INTASC (1992) principles on which
the subject specific standards are based:
Principle #2: The teacher understands how children
learn and develop, and can provide learning opportunities that
support their intellectual, social and personal
development.
Principle #5: The teacher uses an understanding of
individual and group motivation and behavior to create a learning
environment that encourages positive social interaction, active
engagement in learning, and self-motivation. (p.
16)
The portfolio assessment situates evidence and criteria
relevant to these principles within particular subject matter
contexts where particular approaches to learning are privileged.
Alternatively, assessment developers could, as PRAXIS III
assessments do, frame criteria and evidence more generally.
Consider the following criteria drawn from PRAXIS III Domain
B:
B1: Creating a climate that promotes fairness
B2: Establishing and maintaining rapport with students
B3: Communicating challenging learning expectations to each
student
B4: Establishing and maintaining consistent standards of
classroom behavior
B5: Making the physical environment as safe and conducive to
learning as possible
(Dwyer, 1998, pp. 21-22)
When we have asked INTASC portfolio readers what they would
like to attend to that isn’t addressed in the rubric, among
the issues that repeatedly arise are teachers’ relationships
with their |