
In explicating SEF , many closely related issues must be substantially bracketed. These related issues include (1) its validity ( Cahn , 1987; Damron , 1996; Greenwald, 1996; Greenwald and Gillmore , 1996; Scriven , 1993; Seldin , 1984; Tagomori , Bishop, and Laurence. 1995), (2) the problem of defining teaching effectiveness, (3) general variables affecting SEF scores, (4) alternatives to SEF 's (5) classroom politically correct or popular standards and perceptions, (6) low student academic preparation, (7) age and gender discrimination issues (Feldman, 1983, 1993) (8) strategies for change, and (9) other integrally related issues such as their being largely responsible for lowered course standards, and grade inflation. Though these are important and related issues, they can only be addressed here in so far as they directly impact the focus on SEF and academic freedom.
Robert Haskell suggests that we do away with ratings....The thrust is simply that we should do away with student ratings.... I am targeting the use (perhaps, misuse) of these issues in efforts to discredit or do away with faculty evaluation practices in general and student ratings in particular....building his case that ratings were an essentially unreliable form of data that should be done away with.
It is important to note at the outset, that it is not SEF per se that is the issue, but the impact of its use on salary, promotion, tenure decisions, and its impact on the delivery of quality education.
In addition in endnote #6 I said, again:
If used correctly (see Copeland and Murry, 1996; Kemp and Kuman, 1990; Scriven, 1995, 1993, 1991; Seldin, 1984), SEF can be very useful instructionally, and when used in conjunction with other methodologically sound evaluation procedures and criteria, it can assist in informing an institution when a faculty does not pass muster as an effective teacher.
SEF may indeed validly describe incompetence...given (a) the conflicting data on their validity, (b) the way many institutions have constructed SEF instruments, (c) the often unsystematic statistical method by which SEF are interpreted, and especially (d) given the considerable weight accorded negative comments by only a few students in making tenure and promotion decisions, it would seem SEF can all to easily be used as a covert instrument for the elimination of tenure candidates and other faculty who may threaten student tuition dollars and perhaps ideological and popular culture agendas.
Lately, there has been a lot of discussion attempting to causally link student ratings to problems such as grade inflation'.... Ratings" become the cause of the downfall of higher education.... Robert Haskell (1997) suggested that ratings are.... the cause of grade inflation)....it is very risky to point to one reason for these changes.
largely responsible for lowered course standards, and grade inflation....are responsible for a considerable amount of grade inflation.
seen as not only a novel idea, but as an attack on either students, or a general attack on evaluating faculty.
It should go without saying, that not all students are the same. SEF vary by maturity, intellectual level, i.e., graduate student evaluations v. undergraduate.
Many students understand the above described ensuing consequences. A glance at articles from online student newspapers reveals strong sentiments against what many students consider the erosion of standards created by SEF . One student writer went so far as to say "We therefore suggest a boycott of the 1995 student/teacher evaluations. This boycott will provide a more effective means of communication than anything written on the evaluation itself. Something must be done about the trend of grade inflation. We as students refuse to contribute to the downfall of academia (Stern and Flynn, 1995) Some students are thus quite aware of the effects of SEF on their education.
ratings are collected before students get their final grades and thus, their opinions must be based on expectations which are, in turn, based on performance to date. Ratings thus can not be said to reflect a disconfirmed expectancy about the overall course outcome (i.e., the course grade).This would only be the case if ratings were gathered after final grades were distributed and the final grade disconfirmed what was expected as a result of the experiences during the semester. So the ratings relationship is limited to experiences and results during the term rather than to the final grade.
The question now becomes one which deals with: 1) the appropriateness of course content; 2) the standards used for grading; and 3) the question of whether lenient coverage and grading were deliberately chosen in order to influence ratings. The first two items can be and should be dealt with via curricular mechanisms such as departmental review of courses and content, and faculty agreement on standards for student work.Granting for the moment SEF validity and reliability (I probably would even really accept their reliability), I would like to address 1) and 2) in the above quote (I addressed the third above). On a collective level, I agree with Theall that faculty have not been responsible in developing and enforcing standards. In fact in endnote #34 I wrote:
This is not intended as a blanket apologia for academia. There are many problems within the academy. In many other areas, I am a severe critic of my colleague's collective behavior.
Haskell included a long quote from a chapter Jennifer Franklin and I (1990) did in our New Directions for Teaching and Learning issue #43 ( Theall & Franklin, 1990b) While our point was that ratings practice must be improved, Haskell used the quote to supplement other citations as evidence in building his case that ratings were an essentially unreliable form of data that should be done away with ... a gross misinterpretation of our intent.I am perplexed for two reasons. First, it is really neither here nor there that the quote apparently does not reflect Theall's (et al.) intent. Whatever an author's intent was in using a quote does not necessarily reflect on other uses of the material, especially when the quote is not used to suggest an authors position.
Even given the inherently less than perfect nature of ratings data and the analytical inclinations of academics, the problem of unskilled users, making decisions based on invalid interpretations of ambiguous or frankly bad data, deserves attention. According to Thompson (1988, p. 217) "Bayes Theorem shows that anything close to an accurate interpretation of the results of imperfect predictors is very elusive at the intuitive level. Indeed, empirical studies have shown that persons unfamiliar with conditional probability are quite poor at doing so (that is, interpreting ratings results) unless the situation is quite simple." It seems likely that the combination of less than perfect data with less than perfect users could quickly yield completely unacceptable practices, unless safeguards were in place to insure that users knew how to recognize problems of validity and reliability, understood the inherent limitations of rating data and knew valid procedures for using ratings data in the contexts of summative and formative evaluation. (79-80).The authors conclude by noting "It is hard to ignore the mounting anecdotal evidence of abuse. Our findings, and the evidence that ratings use is on the increase, taken together, suggest that ratings malpractice, causing harm to individual careers and undermining institutional goals, deserves our attention." (pp. 79-80).
Conversations with faculty and administrators...led increasingly to concerns about what users [e.g., chairmen; deans] were doing with the information we were providing. We saw that some departmental administrators, who routinely use ratings to make decisions about personnel, evaluation policy, and resource allocation, were not familiar enough with important ratings issues to make well informed decisions...Clearly stated disclaimers regarding the limitations of ratings data in particular circumstances appeared to have little effect on the inclination of some clients to use invalid or inadequate data...There are some fundamental concepts for using numbers in decision making. To the degree that these concepts are ignored, interpretations of data become, at best, projective tests reflecting what the user (e.g., a chairperson or dean) already knows, believes, or perceives in the data. Treating tables of numbers like inkblots ('ratings by Rorschach') will cause decisions to be subjective and liable to error or even litigation...
Three types of errors come to mind immediately. The first involves interpretation of severely flawed data, with no recognition of the limitations imposed by problems in data collection, sampling, or analysis. This error can be compared to a Type I error in research -- wrongly rejecting the null hypothesis -- because it involves incorrectly interpreting the data and coming to an unwarranted conclusion. In this case, misinterpretation of statistics could lead to a decision favoring one instructor over another, when in fact the two instructors are not significantly different...(p.87-88)The second type of error occurs when, given adequate data, there is a failure to distinguish significant differences from insignificant differences. This error can be compared to a Type II error. -- failure to reject the null hypothesis - because the user does not realize that there is enough evidence to warrant a decision. In this case, failure to use data from available reports (assuming the reports to be complete, valid, reliable, and appropriate) may be prejudicial to an instructor whose performance has been outstanding but who, as a result of the error, is not appropriately rewarded or worse, is penalized. (p.87-88)
The third type of error occurs when, given significant differences, there is a failure to account for or correctly identify the sources of differences. This error combines the other two types and is caused by misunderstanding of the influences of relevant and irrelevant variables. In this case, a personal predisposition toward teaching style.., may lead a user to attribute negative meanings to good ratings, or to misinterpret the results of an item as negative evidence when the item is actually irrelevant and there is no quantitative justification for such a decision. Any of these errors can render an interpretation entirely invalid.(p.87-88)
Let us...state our goal in the following way: "The user will make decisions that are based on valid, reliable hypotheses about the meaning of data." In this case, the user should receive or construct working hypotheses that do the following things:
- Take into account problems in measurement, sampling, or data collection and include any appropriate warnings or disclaimers regarding the suitability of the data for interpretation and use.
- Do not attempt to account for differences between any results when they are statistically not significant (probably <.05).
- Disregard any significant differences that are merely artifacts (for example, small differences observed in huge samples), which can technically be significant but are unimportant).
- Account for any practically important, significant differences between results in terms of known, likely sources of systematic bias in ratings or reliably observed correlations, as well as in terms of relevant praxio logical constructs about teaching or instruction.
- The user should also refrain from constructing or acting on hypotheses that do not meet these conditions...(pp. 87-89)...
SEF may indeed validly describe incompetence...given (a) the conflicting data on their validity, (b) the way many institutions have constructed SEF instruments, (c) the often unsystematic statistical method by which SEF are interpreted, and especially (d) given the considerable weight accorded negative comments by only a few students in making tenure and promotion decisions, it would seem SEF can all to easily be used as a covert instrument for the elimination of tenure candidates and other faculty who may threaten student tuition dollars and perhaps ideological and popular culture agendas.
Stake's reference is probably to "educational seduction", the skill of the infamous "Dr. Fox" who supposedly entertained students and received high ratings despite the fact that he delivered no content ( Naftulin, Ware, & Donnelly, 1973). Many who do not care for ratings find one study that supports their position but ignore subsequent work (e.g. Perry et al., 1979; Marsh & Ware, 1982) which points out problems with the original study and proceeds to clarify the issue.
Frankly, I don't like to recommend articles like this to those not actively involved in ratings research or practice because such writings can mislead readers who aren't really familiar with the cited ratings literature. I'm sorry if that sounds elitist: it isn't intended to be but I do have a reason for noting it.
We (Franklin & Theall, 1989) found that ignorance of evaluation/measurement literature and methods correlated significantly with negative faculty opinions about students and student ratings. I note this because discussions about ratings are so often filled with misinformation.
Cohen, P. A. (1983). Comment on a selective review of the validity of student ratings of teaching. Journal of Higher Education, 54, 448-458.
Franklin, J. L. & Theall , M. (1990). Communicating student ratings to decision makers: design for good practice. In M. Theall & J. Franklin ( Eds .) Student ratings of instruction: issues for improving practice. New Directions for Teaching and Learning # 43. San Francisco: Jossey Bass.
Franklin, J. L. & Theall, M. (1989). Who reads ratings. Knowledge, attitudes, and practices of users of student ratings of instruction. Paper presented at the 70th annual meeting of the American Educational Research Association. San Francisco: March 31.
Greenwald, A. G. & Gillmore , G. M. (1997). No pain, no gain? The importance of measuring course workload in student ratings of instruction. Journal of Educational Psychology (in press).
Greenwald, A. G. & Gillmore , G. M. (1996). Applying social psychology to reveal a major (but correctable) flaw in student evaluations of teaching. University of Washington, Draft Manuscript, March 1.
Haskell, R. E. (1997b). Contributed Commentary on: Stake Response to Haskell: Academic Freedom, Tenure, and Student Evaluation of Faculty. Educational Policy Analysis Archives, 5 Available online: http://olam.ed.asu.edu/epaa/v5n8c1.html.
Haskell, R. E. (1997a). Academic freedom, tenure, and student evaluations of faculty: Galloping polls in the 21st century. Educational Policy Analysis Archives, 5. Available online: http://olam.ed.asu.edu/epaa/v5n6.html.
Natuflin , D. H. Ware, J. E., & Donnelly , F. A. (1973). The Dr. Fox lecture. A paradigm of educational seduction. Journal of Medical Education, 48, 630-635.
Scriven , M. (1988). The New Crisis in Teacher Evaluation: The Improper Use of 'Research-based' Indicators. Professional Personnel Evaluation News. (January) 1-8.
Stake, J. E. (1997). Response to Haskell: academic freedom, tenure, and student evaluations of faculty. Educational Policy Analysis Archives, 5. Available online: http://olam.ed.asu.edu/epaa/v5n8.html.
Theall , M. (1997). On Drawing Reasonable Conclusions about Student Ratings of Instruction: a Reply to Haskell and to Stake. Educational Policy Analysis Archives, 5. Available online: http://olam.ed.asu.edu/epaa/v5n8c2.html.
Thompson, G.E. (1988). Difficulties in interpreting course evaluations: Some Bayesian insights. Research in Higher Education, 28, 217-222.
1. Theall, suggested that I misinterpreted research on the SEF. For example, he says,
Peter Cohen's most recognized contribution to the ratings literature is the meta-analysis of multisection validity studies in which performance on a common final exam was correlated with ratings (1981). There was a .43 correlation between ratings and performance on those exams. In other words, sections with highly rated instructors had higher average scores which could not be attributed to differential grading standards or sampling errors. This is evidence for the validity of ratings, the exact opposite of what Haskell says. Calling that relationship grade inflation is simply ridiculous!
I turn now to a second statistical matter the interpretation of correlation coefficients. The numerical value of a correlation coefficient can be deceiving, even when it is 'statistically significant' (i.e., even when it is unlikely to have occurred by chance if no relationship exists). A correlation coefficient measures the degree to which change in the amount of the explanatory variable is accompanied by change in the amount of the effect variable, but the most beneficial feature of a coefficient is not its numerical value, which has no inherent, practical meaning. Rather, the square of the numerical value is the most advantageous aspect of a correlation coefficient, for the square indicates the proportion of variation in the effect variable that can be statistically attributed to variation in the explanatory variable. The research summary by Professor Cashin reported a correlation coefficient of .44 between student ratings of an instructor 'overall' and examination grades [8]. This coefficient means that 19.4 percent of the variation in student learning (as measured by course grades) is explained by variation in instructional quality (as measured by student ratings). If accurate, a correlation of this magnitude is 'practically useful,' as Professor Cashin said, though one must keep in mind that four-fifths of the variation in course grades remains unexplained and is attributable to other factors.But does this correlation coefficient accurately estimate the relationship between student evaluations of teaching and student achievement? The best research on the magnitude of the relationship is the ' multisection validity study.' When it is ideally designed, such a study possesses the following features: each course included in the study has numerous sections; students are randomly assigned to sections; the sections of a course have different instructors but a common textbook and the same examination(s); all examinations for a course are constructed by a person who does not teach any section; and subjective (essay) components of examinations are graded by the person who developed them. A review of multisection validity studies cites one work that, the author of the review asserts, eliminates at least in part 'many of the criticisms of the multisection validity study' and ' provide[s] strong support for the validity of students' evaluations of teaching effectiveness' [9, p. 721]. However, the cited work which subjected the results of other multisection studies to a statistical analysis did not control a number of critical variables that could have generated or enlarged the relationship between student ratings of teachers and student achievement [10]. Among the missing variables that might have explained the relationship was the rigor of the requirements of the instructor (such as checks for student preparation and amount of material assigned), a factor that may vary considerably across sections of a single course. If the variable was related both to student ratings of instructional quality and to student achievement, a control for the variable could have markedly weakened or entirely eliminated the relationship originally found between student ratings and student achievement p.339 Another variable that the work omitted was the students' level of interest in the subject matter of the course prior to exposure to the teacher they later evaluated. As will be suggested below, neither of these variables should have been excluded from the analysis and left uncontrolled. While the work did not incorporate a number of potentially important variables into its data analysis, the work is the source of a set of correlation coefficients (including the coefficient of .44) that Cashin suggested are credible estimates of the relationship between student ratings of teachers and student achievement. A reader of the reproduced coefficients can easily be misled, however, because Cashin failed to make clear that the coefficients may have been seriously confounded by variables whose influence was not removed. The failure to clarify this point is surprising inasmuch as Cashin explicitly stated that a control is necessary for one of the variables omitted by the work, namely, the interest students initially exhibited in the subject.' [12, p. 5]. P.340 Cohen treated student judgments of 'the amount and difficulty of the work the teacher expects of students' as one component of instructional quality. Whether it is an element of teaching quality or a separate factor, the amount and difficulty of work should be controlled under the conditions mentioned because it may explain much or all of the relationship detected between, on the one hand, evaluations of an instructor overall or on specific dimensions and, on the other, performance on examinations. From the studies he reviewed, Cohen calculated a negligible mean correlation coefficient for the relationship between the amount/difficulty of work and student achievement, but he also found a substantial range for the coefficients reported by the studies. Specifically, the interval for 95% of the coefficients extended from -.42 to +.39. Id, at 293, 295. Individual studies may thus involve a nontrivial association between the perceived difficulty of teachers and the examination performance of students (p.341).
2. I can understand how such quotes might be embarrassing (a) to an author who professes that SEF is basically if not completely valid instrument, and (b) to a practitioner trying to convince faculty and administrators to use them in faculty evaluation programs. I have never heard of a consultant who suggests that SEF have been shown to be invalid. Given Theall’s critique let me make it very clear that I am not here indicting consultants. I have done and occasionally do consulting myself. I would like to say that Theall (et al.)---in word, I can not speak for deed--- does recognize most of what needs to be done to make SEF use in administrative decisions reasonably ethical.
3. Requests for prepublication copies of Greenwald’s articles should be sent to http://weber.u.washington.edu/~agg/
4. Theall goes on to say, "Stake says that ratings undermine the faith and trust students must place in teachers." Given his opinion of ratings, I assume that he must then feel that teachers have no need to place faith or trust in their students. This paternalism is antiquated and unrealistic. The logic by which it is unacceptable for Stake not to trust (valid?) perceptions of students, but acceptable for Theall not to trust the perceptions of scholars, escapes me, just a as does his calling Stake’s view of students paternalistic and not seeing that his own views of his not trusting scholars to correctly read research is paternalistic. Though he does recognize his view sounds elitist, he says it is really not.
5. I wonder what Theall would say about child psychologists who have no children of their own, or about healthy psychotherapists who treat the mentally disordered.