This article has been retrieved
times since February 22, 2001
Education Policy Analysis Archives | ||
Volume 9 Number 6 |
February 22, 2001 |
ISSN 1068-2341 |
|
Editor: Gene V Glass, College of Education Arizona State University
Copyright 2001, the
EDUCATION POLICY ANALYSIS ARCHIVES. Articles appearing in EPAA are abstracted in the Current Index to Journals in Education by the ERIC Clearinghouse on Assessment and Evaluation and are permanently archived in Resources in Education. |
Teacher Test Accountability:
|
|
Abstract
Given the high stakes of teacher testing, there is no doubt that every teacher test should meet the industry guidelines set forth in the Standards for Educational and Psychological Testing. Unfortunately, however, there is no public or private business or governmental agency that serves to certify or in any other formal way declare that any teacher test does, in fact, meet the psychometric recommendations stipulated in the Standards. Consequently, there are no legislated penalties for faulty products (tests) nor are there opportunities for test takers simply to raise questions about a test and to have their questions taken seriously by an impartial panel. The purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems (NES) in their 1999 Massachusetts Educator Certification Test (MECT) Technical Report, and more specifically, to identify those technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to call for the establishment of a standing test auditing organization with investigation and sanctioning power. The significance of the present analysis is twofold: a) psychometric results for the MECT are similar in nature to psychometric results presented as evidence of test development flaws in an Alabama class-action lawsuit dealing with teacher certification (an NES-designed testing system); and b) there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. I begin by reviewing NES's role in Allen v. Alabama State Board of Education, 81-697-N. Next I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally, I present results taken directly from the 1999 MECT Technical Report and compare them to procedures, results, and consequences of procedures followed by NES in Alabama. |
Teacher Test Accountability: From Alabama to MassachusettsFrom its inception and continuing through present administrations, the Massachusetts Educator Certification Test (MECT) has attracted considerable public attention both regional and around the world (Cochran-Smith & Dudley- Marling, in press). This attention is due in part to two disturbing facts: 1) educators seeking certification in Massachusetts have generally performed poorly on the test, and 2) in many instances politicians have used these test results to assert, among other things, that candidates who failed are idiots (Pressley, 1998).The purpose of the MECT is to ensure that each certified educator has the knowledge and some of the skills essential to teach in Massachusetts public schools (National Evaluation Systems, 1999, p. 22). The Massachusetts Board of Education has raised the stakes on the MECT by enacting plans to sanction institutions of higher education (IHEs) with less than an 80% pass rate for their teacher candidates (Massachusetts Department of Education, 2000). One consequence of this proposal is that most IHEs are considering requirements that the MECT be passed before students are admitted to their teacher education programs. In addition, Title II (Section 207) of the Higher Education Act of 1998 requires the compilation of state report cards for teacher education programs, which must include performance on certification examinations (U.S. Department of Education, 2000). What all of this means is that poor performance on the MECT could prevent federal funding for professional development programs, limit federal financial aid to students, allow some IHEs be labeled publicly low performing, and prove damaging at the state-level when states are inevitably compared to one another upon release of the Title II report cards in October 2001. Given the personal, institutional, and national ramifications of the test results, there is no question that the MECT should be expected to meet the industry benchmarks for good test development practice as set forth in the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). At this time, however, there is no public or private business or governmental agency either within the Commonwealth of Massachusetts or nationally that can certify or in any other formal way declare that the MECT does (or does not), in fact, meet the psychometric recommendations stipulated in the Standards. The National Board on Educational Testing and Public Policy (NBETPP) serves as an independent organization that monitors testing in the US but even it does not function as a regulatory agency (NBETPP, 2000). In addition to the absence of a national regulatory agency, many state departments of education do not have the professionally trained staff to answer directly technical psychometric questions. Nor do they usually have the expertise on staff to confront a testing company, which they have contracted, and demand a sufficient response to a technical question raised by outside psychometricians. Furthermore, even when a database with the candidates' item- level responses is available for internal analysis, a state department of education does not typically conduct rigorous disconfirming analyses, e.g. evidence of adverse impact. Thus, most state departments are largely dependent on whatever information testing companies decide to release. The public is then left with an inadequate accountability process. One purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems in their 1999 MECT Technical Report (NES, 1999). Specifically, this article identifies technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to voice one more call for the establishment of a standing test auditing organization with powers to investigate and sanction (National Commission on Testing and Public Policy, 1990; Haney, Madaus & Lyons, 1993). The significance of the present analysis is twofold. First, psychometric results reported by NES for the MECT are similar in nature to psychometric results entered as evidence of test development flaws in an Alabama class- action lawsuit dealing with teacher certification (Allen v. Alabama State Board of Education, 81-697-N). That suit was brought by several African-American teachers who charged, among other things, that the State of Alabama's teacher certification tests impermissibly discriminate[d] against black persons seeking teacher certification; the tests [were] culturally biased; and the tests [had] no relationship to job performance (Allen, 1985, p. 1048). Second, there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. These two points are linked in an interesting and troubling way--NES, the Massachusetts Educator Certification Tests contractor, was also the contractor for the Alabama Initial Teacher Certification Testing Program (AITCTP). Some of the criticism of debates about teacher testing, teacher standards, teacher quality, and accountability suggests that arguments are, in part, ideologically, rather than empirically based (Cochran-Smith, in press). This may or may not be the case. This article, however, takes the stance that regardless of one's political ideology or philosophy about testing, the MECT is technically flawed. Furthermore, because of the lack of an enforceable accountability process, the public is powerless in its efforts to question the quality or challenge the use of this state-administered set of teacher certification examinations. In this article I argue that the consequences of high-stakes teacher certification examinations are too great to leave questions about technical quality solely in the hands of state agency personnel, who are often ill- prepared and under-resourced, or in the hands of test contractors, who may face obvious conflicts-of-interest in any aggressive analyses of their own tests. In the sections that follow, I begin by reviewing NES's role in Allen v Alabama. Then I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally I compare results taken directly from the 1999 MECT Technical Report with statistical results entered as evidence of test development flaws in Allen v Alabama. |
NES and the AITCTPAllen, et al. v. Alabama State Board of Education, et al.In January 1980, National Evaluation Systems was awarded a contract on a non-competitive basis for the development of the Alabama Initial Teacher Certification testing Program (AITCTP). Item writing for these tests began in the Spring of 1981, and the first administration of the tests took place on June 6, 1981. Allen v Alabama was brought just six months later on December 15th, 1981. The Allen complaint challenged the Alabama State Board of Education's requirement that applicants for state teacher certification pass certain standardized tests administered under the AITCTP. On October 14, 1983, class certification (Note 1) was granted, and the first trial was set for April 22, 1985. Subsequent to a pre-trial hearing on December 19, 1984 and after substantial discovery was done,(Note 2) an out-of-court settlement was reached on April 4, 1985. A Consent Decree was presented to the U.S. District Court April 8, 1985(Note 3). The Attorney General for the State of Alabama immediately publicly attacked the settlement (Allen, 1985, p. 1050), claiming that it was illegal. Nonetheless, the consent decree was accepted by the court October 25, 1985 (Allen, Oct. 25. 1985). A succession of challenges and appeals on the legality and enforceable status of the settlement resulted (Note 4). For example, on February 5, 1986, the district court vacated its October 25th order approving the consent decree (Allen, February 5, 1985, p. 76). While the plaintiffs appeal of the February 5th decision was pending at the 11th Circuit Court of Appeals, trial began in district court on May 5, 1986.The AITCTP consisted of an English language proficiency examination, a basic professional studies examination, and 45 content-area examinations. The purpose of the examinations was to measure specific competencies which are considered necessary to successfully teach in the Alabama schools (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 21). A pool of 120 items for each exam was generated--100 of which were scorable and mostly remained unchanged across the first eight administrations. Extensive revisions were incorporated into most of the tests at the ninth administration. By the start of the May 1986 trial the tests had been administered 15 times in all. A team of technical experts (Note 5) for the plaintiffs was hired in November 1983 (prior to the ninth administration of the exams) to examine test development, administration, and implementation procedures. The team was initially unsure about the form of the sophisticated statistical analyses they assumed would have to be conducted to test for the presence of bias and discrimination, the bases of the case. That is, the methodology for investigating what was then called bias and is now called differential item functioning was far from well established at that time (Baldus & Cole, 1980). Nevertheless, when the plaintiffs' team received the student-level item response data from the defendants, their first steps were to perform an item analysis. Such an analysis produces various item statistics and test reliability estimates. These initial analyses produced negative point-biserial correlations. Although point-biserial correlations are explained in detail below, suffice it to say at this point that it was a surprise to find negative point-biserial correlations between the responses that examinees provided on individual items and their total test scores. Such correlations are not an intended outcome from a well-designed testing program. These statistical results prompted a detailed inspection of the content, format, and answers for all the individual items on the AITCTP tests. Content analyses yielded discrepancies in the keyed correct responses in the NES test documents and the keyed correct responses in the NES- supplied machine scorable answer keys (i.e., miskeyed items were on the answer keys). This finding led to an inspection of the original NES in-house analyses which revealed that negative point-biserials for scorable items existed in their own records from the beginning of the testing program and continuing throughout the eighth administration without correction. What this meant for the plaintiffs was that NES had item analysis results in their own possession which indicated that there were mis-keyed items. Nonetheless they implemented no significant changes in the exams until they were faced with a lawsuit and plaintiffs' hiring of the testing experts to do their own analyses. The defendants argued that it was normal for some problems to go undetected or uncorrected in a large-scale testing program because the overall effect is trivial for the final outcome. The problem with that argument was that many candidates were denied credit for test items on which they should have received credit, and some of those candidates failed the exam by only one point. In fact, as the plaintiffs argued, as many as 355 candidates over eight administrations of the basic professional skills exam alone should have passed but were denied that opportunity simply because of faulty items that remained on the tests (Milman, 1986, p. 285). It should be noted here that these were items that even one of the state's expert witnesses for the defense admitted were faulty (Millman, 1986, p. 280). Establishing that there were flawed items with negative point-biserial correlations was critical to the plaintiffs' case. The plaintiffs presented as evidence page after page of so-called failure tables (Note 6) with the names of candidates for each test whose answers were mis-scored on these faulty items. Based upon these failure tables, any argument from defendants that the mis-keyed items did not change the career expectations for some candidates would most likely have failed. In the face of this evidence, the defendants argued at trial that ...the real disagreement is between two different testing philosophies. One of these philosophies would require virtual perfection under its proponents' rigid definition of that word. The other looks at testing as a constantly- developing art in which professional judgment ultimately determines what is appropriate in a particular casePlaintiffs counter-argued This case is not a philosophical case at all. This case is a case on professional competence .this was an incompetent job, unprofessional, and as I said before, sloppy and shoddy, and in the case of the miskeyed items, unethical. (Madaus, 1986, p. 185).Judge Thompson, in the subsequent Richardson decision which also involved the AITCTP, specifically agreed with plaintiffs on this point (Richardson, 1989, p. 821, 823, 825). Excellent reviews of the diametrically opposed plaintiff and defendant positions may be found in Walden & Deaton (1988) and Madaus (1990). At the same time that this case was proceeding, the plaintiffs' appeal to reverse the vacating of the original settlement was granted prior to a decision in this trial (Allen, Feb. 5, 1986, p. 75). The U.S. Court of Appeals decided the district court should have enforced the consent decree (Allen, April 22, 1987)which the district court so ordered on May 14, 1987 (Allen, May 14, 1987). Although the decision to uphold the original settlement was a positive ruling for the plaintiffs, it also was somewhat counter-productive for them because it was unexpectedly beneficial to NES at this stage in the proceedings. That is because the evidence presented above in Allen v Alabama was critical of the state and NES (NES was explicitly referred to in the court documents). Thus, NES's best hope for avoiding a written opinion critical of their test development procedures was if plaintiffs' appeal were to be upheld and the original settlement enforced, as it was. Then there would be no evidentiary record, no court ruling, and no legal opinion that would reflect badly upon the NES procedures. Richardson v Lamar County Board of Education (87-T-568-N) commenced, however, and the actions of NES and the Alabama State Board of Education were openly discussed and critiqued in the court's opinion of November 30, 1989 (though NES was not mentioned by name in the Richardson, 1989 decision). |
Richardson v Lamar County Board of Education, et al.Like Allen v Alabama, Richardson v Lamar County also addressed issues of the racially disparate impact of the AITCTP (Richardson, 1989, p. 808). The Honorable Myron H. Thompson again presided, and testimony from Allen v Alabama was admitted as evidence (Richardson, 1989). Although the defendants denied in the Allen v Alabama consent decree that the AITCTP tests were psychometrically invalid, and even though no decision was reached in the abbreviated Allen v Alabama trial, the State Board of Education did not attempt to defend the validity of the tests in Richardson v Lamar and, in fact, it conceded at trial that plaintiff need not relitigate the issue of test validity (Richardson v Alabama State Board of Education, 1991, p. 1240, 1246).Judge Thompson's position on the test development process of NES was clearly stated: In order to fully appreciate the invalidity of the two challenged examinations, one must understand just how bankrupt the overall methodology used by the State Board and the test developer was (Richardson, 1989, p. 825, n. 37). While sensitive to the fact that close scrutiny of any testing program of this magnitude will inevitably reveal numerous errors, the court concluded that these errors were not of equal footing and the error rate per examination was simply too high (Richardson, 1989, pp. 822- 24) Thus, none of the examinations that comprised the certification test possessed content validity because of five major errors by the test developer and the test developer had made six major errors in establishing cut scores (Richardson, 1989, pp. 821-25). Case Outcomes in AlabamaThe Allen v Alabama consent decree required Alabama to pay $500,000 in liquidated damages and issue permanent teaching certificates to a large portion of the plaintiff class (Allen, Consent Decree, Oct. 25, 1985, pp. 9-11). The decree also provided for a new teacher certification process. However, no new test was developed or implemented and the Alabama State Board of Education suspended the teacher certification testing program on July 12, 1988. In 1995 the Alabama State Legislature enacted a law requiring that teacher candidates pass an examination as a condition for graduation. Subsequently, another trial was held February 23, 1996 to decide the state's motions to modify or vacate the 1985 consent decree (Allen, 1997, p. 1414). Those motions were denied on September 8, 1997 (Allen, Sept. 8, 1997). Given the rigorous test development and monitoring conditions of the Amended Consent Decree, it was estimated by the court that the State of Alabama would not gain complete control of its teacher testing program until the year 2015 (Allen, Jan. 5, 2000, p. 23). Only recently has a testing company stepped forward with a proposal for a new Alabama teacher certification test (Rawls, 2000).Plaintiff Richardson was awarded re-employment, backpay, and various other employment benefits (Richardson, 1989, pp. 825-26). Defendants (the State of Alabama and its agencies) in both cases were ordered to pay court costs and attorney fees (Richardson, 1989, pp. 825-26). However, even though NES was responsible for the development of the tests, NES was not named as one of the defendants in these cases and was not held liable for any damages (Note 7). |
Psychometric and Statistical BackgroundAt this point it is appropriate to discuss some of the psychometric concepts and statistics that are fundamental to any question about test quality. The purpose of this discussion is to illustrate that excruciatingly complex analyses are not necessarily required in order to reveal flaws in a test or individual test items. The first steps in test development simply involve common sense practice combined with sound statistical interpretations. If those first steps are flawed, then no complex psychometric analysis will provide a remedy for the mistakes.One of the simplest statistics reported in the reliability analysis of a test like the MECT is the item-test point-biserial correlation. This statistic goes by other names such as the item-total correlation and the item discrimination index. It is called the point-biserial correlation specifically because it represents the relationship between a truly dichotomous variable (i.e., an item scored as either right or wrong) and a continuous variable (i.e., the total test score for a person). A total test score, here, is the simple sum of the number of correctly answered items on a test. The biserial correlation has a long history of statistical use (Pearson, 1909). One of its earliest measurement uses was as an item-level index of validity (Thorndike, et al., 1929, p. 129). The point-biserial correlation appeared specifically for individual dichotomous items in an item analysis because of concerns over the assumptions implicit in the more general biserial-correlation (Richardson & Stalnaker, 1933). It was again used as a validity index. It subsequently came to acquire diagnostic value and was re-labeled as a discrimination index (Guilford, 1936, p. 426). The purpose of this statistic is to determine the extent to which an individual item contributes useful information to a total test score. Useful information may be defined as the extent to which variation in the total test scores has spread examinees across a continuum of low scoring persons to high scoring persons. In the present situation, this refers to the extent to which well qualified candidates can be distinguished from less capable candidates. Generally, the greater the variation in the test scores, the greater the magnitude of a reliability estimate. Reliability may be defined many ways through the body of definitions and assumptions known as Classical Test Theory or CTT (Lord & Novick, 1968). According to CTT, an examinee's observed score (X) is assumed to consist of two independent components, a true score component (T) and an error component (E). One relevant definition of reliability may be expressed as the ratio of true-score variance to observed- score variance. Thus, the closer the ratio is to 1.0, the greater the proportion of observed-score variance that is attributed to true-score variance. The KR-20 reliability estimate is often reported for achievement tests (Kuder & Richardson, 1937, Eq. 20, p. 158). Although reliability as defined above is necessarily positive, the KR-20 can be negative under certain extraordinary conditions (Dressel, 1940) but typically ranges from 0 to +1. Nevertheless, the higher the value, the more internally consistent the items on a test. The magnitude of the KR-20, however, is affected by the direction and magnitude of the point-biserial correlations. Specifically, total test score reliability is decreased by the inclusion of items with near-zero point-biserial correlations and is worsened further by the inclusion of items with negative point-biserial correlations. This is because each additional faulty item increases the error variance in the scores at a faster rate than the increase in true-score variance. Technically, the point-biserial correlation represents the magnitude and direction of the relationship between the set of incorrect (scored as 0) and correct (scored as 1) responses to an individual item and the set of total test scores for a given group of examinees. In other words, it is a variation of the common Pearson product-moment correlation (Lord & Novick, 1968, p. 341). It can range in magnitude from zero to . An estimate near zero is a poorly discriminating item that contributes no useful information. An estimate of +1 would indicate a perfectly discriminating item in the sense that no other items are necessary on the test for differentiating between high scoring and low scoring persons. A value of 1.0 is never attained in practice nor is it sought (Loevinger, 1954). Negative estimates are addressed below. Ideally the test item point-biserial correlation should be moderately positive. Although various authors differ on what precisely constitutes moderately positive, a long-standing general rule of thumb among experts is that a correlation of .20 is the minimum to be considered satisfactory (Nunnally, 1967, p. 242; Donlon, 1984, p. 48) (Note 8). There is, however, no disagreement among psychometricians on the direction of the relationshipit has to be positive. The direction of the correlation is critical. A positive correlation means that examinees who got an item right also tended to score above the mean total test score and those who got the item wrong tended to score below the mean total test score. This is intuitively reasonable and is an intended psychometric outcome. Such an item is accepted as a good discriminator because it differentiates between high and low scoring examinees. This is one of the fundamental objectives of classical test theory, the theory underlying the development and use of the MECT. A negative point-biserial correlation, however, occurs when examinees who got an item correct tended to score below the mean total test score while those who got the item wrong tended to score above the mean total test score. This situation is contrary to all standard test practice and is not an intended psychometric outcome (Angoff, 1971, p. 27). A negative point-biserial correlation for an item can occur because of a variety of problems (Crocker & Algina , 1986). These include:
One additional point must be made. The point-biserial correlation can be computed two ways. The first way is to correlate the set of 0/1 (incorrect/correct) responses with the total scores as described above. In this way of computing the statistic, the item for which the correlation is being computed contributes variance to the total score, hence, the correlation is necessarily magnified. That is, the statistical estimate of the extent to which an item is internally consistent with the other items tends to be inflated (Guilford, 1954, p.439). The second way in which the correlation may be computed is to compute it between the 0/1 responses on an item and the total scores for everyone but with the responses to that particular item removed from the total score (Henrysson, 1963). This is called the corrected point-biserial correlation. It is a more accurate estimate of the extent to which an individual item is correlated to all the other items. It is easily calculated and reported by most statistical software packages used to perform reliability analyses (e.g., SPSS's Reliability procedure). Various concerns have been raised over the interpretation of the point-biserial correlation because the magnitude of the coefficient is affected by the difficulty of the item. The fact is, however, that all the various discrimination indices are highly positively correlated (Nunnally, 1936; Crocker & Algina, 1986). Furthermore, even though the magnitude of the point-biserial correlation tends to be less than the biserial-correlation, all writers agree on the interpretation of negative discriminations. No test item, regardless of its intended purpose, is useful if it yields a negative discrimination index(Ebel & Frisbie, 1991, p. 237). Such an item lowers test reliability and, no doubt, validity as well (Hopkins, 1998, p. 261). Furthermore, on subsequent versions of the test, these items [with negative point-biserial correlations] should be revised or eliminated (Hopkins, 1998, p. 259). |
NES AND THE MECTThe 1999 MECT Technical ReportIn July 1999 NES released their five volume Technical Report on the Massachusetts Educator Certification Tests. Volume I describes the test design, item development description, and psychometric results. Volume II describes the subject matter knowledge and test objectives. Volume III consists of correlation matrices by test field. Volume IV consists of various content validation materials and reports. Volume V consists of pilot material, bias review material, and qualifying score material. The report was immediately hailed by Massachusetts Commissioner of Education David P. Driscoll: "I have said all along that I stand by the reliability and validity of the tests, and this report supports it. (Massachusetts Department of Education, 1999).Field TrialTechnical Report Volume I contains the psychometric results for the first four administrations of the MECT (April, July, and October 1998, and January 1999). It does not, however, contain any results from a full-scale field trial, nor are any pilot test results reported (Note 9). There is no information on how may different items were tested, where the items came from, how many items were revised or rejected, what the revisions were to any revised items, or what the psychometric item-level results were. In fact, there is no field trial evidence in support of the initial inclusion of any of the individual items on the operational exams because there was no field trial.Interestingly, the Department of Education released a brochure in January 1998 stating that the first two test administrations would not count for certificationimplying that the tests would serve as a field trial. Chairman of the Board of Education John Silber, however, declared in March 1998 that the public had been misinformed and that the first two tests would indeed count for certification. This policy reversal was unfortunate because of the confusion and anxiety it created among the first group of examinees and because it prevented the gathering of statistical results that could have improved the quality of the test. NES had considered a field trial of their teacher test in Alabama but did not conduct one and assumedly came to regret that decision. In Allen v Alabam they argued, As the evidence will show, there was no need to conduct a separate large-scale field tryout in this case, since the first test administration served that purpose (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 113). That decision was unwise because it directly affected the implementation and validity of their procedures. For example, The court has no doubt that, after the results from the first administration of those 35 examinations were tallied, the test developer knew that its cut-score procedures had failed (Richardson, 1989, p. 823). In fact, the original settlement in Allen v Alabama stipulated that in any new operational examination, the items shall be field tested using a large scale field test (Allen, Consent Decree, Oct. 25, 1985, p. 3). The first two administrations of the MECT would have served an important purpose as a full-scale field trial for the new tests, thus avoiding the mistake made in Alabama. However, that opportunity to detect and correct problems in administration, scoring, and interpretation was lost. The impact of the lack of a field trial is further magnified when it is noted that the time period between when NES was awarded the Massachusetts contract (October 1997) and when the first tests were administered (April 1998) was even smaller than the time period NES had to develop the tests in Alabamaa time frame that the court referred to as quite short (Richardson, 1989, p. 817). Furthermore, even though NES may have drawn many of the MECT items from existing test item banks, items written and used elsewhere still must be field tested on each new population of teacher candidates. Point-biserial correlationsIn the NES Technical Report Volume I, Chapter 8, p. 140, there is a description of when an item is flagged for further scrutiny. One of the conditions is when an item displays an item-to-test point-biserial correlation less than 0.10 (if the percent of examinees who selected the correct response is less than 50). After such an item is found, The accuracy of each flagged item is reverified before examinees are scored. The Technical Report, however, does not report or provide the percent of persons who selected the correct response on each item. Nor is there an explanation of what the reverification process consisted of, nor of how many items were flagged, nor what was subsequently modified on flagged items. Thus, there is no way to determine the extent to which NES actually followed its own stated guidelines and procedures in the development of the MECT. The relevance of what NES states as their review procedures and what they actually performed is that in Alabama, under the topic of content validity, it was argued by the defense that items rated as content invalid were revised by NES and that these revisions were approved by Alabama panelists before they appeared on a test. The court, however, found that no such process occurred (Richardson, 1989, p. 822).The following table summarizes the point-biserial estimates reported for the MECT. Note that these are not the results prior to NES conducting the item review process. These are the results for the scorable items after the NES review. |
Table 1
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tested |
M/C Items |
|
items |
|||||
| <.00 | .00-.05 | .06-.10 | .11-.15 | .16-.20 | ||||
| Apr-98 | 4891 | 315 | 1 | 7 | 15 | 24 | 46 | 29.5% |
| Jul-98 | 5716 | 443 | 0 | 2 | 14 | 17 | 39 | 16.3% |
| Oct-98 | 5286 | 379 | 2 | 5 | 10 | 15 | 32 | 16.9% |
| Jan-99 | 9471 | 507 | 1 | 4 | 14 | 35 | 49 | 20.3% |
| 25,364 | 1,644 | 4 | 18 | 53 | 91 | 166 | 332/1644 = 20.2% | |
| tested |
M/C Items |
|
items |
|||||
| <.00 | .00-.05 | .06-.10 | .11-.15 | .16-.20 | ||||
| Writing | 9750 | 92 | 0 | 0 | 0 | 1 | 1 | 2.2% |
| Reading | 9455 | 144 | 0 | 0 | 1 | 1 | 6 | 5.6% |
| Early Childhood | 936 | 256 | 0 | 3 | 18 | 30 | 46 | 37.9% |
| Elementary | 3125 | 256 | 0 | 2 | 0 | 3 | 27 | 12.5% |
| Social Studies | 259 | 128 | 1 | 0 | 1 | 6 | 14 | 17.2% |
| History | 108 | 64 | 0 | 0 | 2 | 6 | 5 | 20.3% |
| English | 695 | 256 | 0 | 3 | 11 | 12 | 29 | 21.5% |
| Mathematics | 345 | 192 | 1 | 0 | 4 | 4 | 7 | 8.3% |
| Special Needs | 691 | 256 | 2 | 10 | 16 | 28 | 31 | 34.0% |
| 1,644 | 4 | 18 | 53 | 91 | 166 | |||
A number of observations may be made from the information in
this table. First, of the 1644 total number of items
administered over the first four dates, 332 items (20.19%)
had point-biserial correlations that are lower than the
industry minimum standard criterion of .20. That is a huge
percent of poorly performing items for a high-stakes
examination. Second, while there are relatively few suspect
items on the Reading and Writing tests, there are large
numbers of items with poor statistics on many of the subject
matter tests. The Early Childhood, English, and Special
Needs tests, in particular, consisted of extraordinarily
large percentages of poorly performing items (37.9%, 21.5%,
and 34%, respectively). Overall, of the 332 items with low
point-biserials, 322 (97%) occurred on the subject matter
tests. On the face of it, the results for the subject matter
tests are terrible. There is, unfortunately, no
authoritative source in the literature (including the
Standards) that tells us unequivocally whether or not
this overall 20.19% of poorly performing items on a
licensure examination with high-stakes consequences is
acceptable, not acceptable, or even terrible. Given the
steps that NES claims were followed in selecting items from
existing item banks and in writing new items, there simply
should not be this many technically poor items on these
tests.
Before any item was allowed to contribute to a candidate's score, and before the final 100 scorable items were selected, the item statistics for all the items of the test were reviewed and any items identified as questionable were checked for content and a decision was made about each such item (Allen, Defendants' Pre-Trial Memorandum, 1986, pp. 113-14).In fact, in Alabama there were negative point-biserial correlations in the original reliability reports generated by NES (their own documents reported negative point-biserial correlations as large as -0.70) and those negative point- biserial correlations for the same scorable items remained after multiple administrations of the examinations. Simply taking out the worst 20 items in each test did not remove all the faulty items since each exam had to have 100 scorable items. As seen above in Table 1, the MECT has statistically flawed items on many tests, these items have been there since the first administration, and they may be the same items still being used in current administrations.
Allen v. Alabama State Board of Education, 612 F. Supp. 1046 (M.D. Ala. 1985).
Allen v. Alabama State Board of Education, 636 F. Supp. 64 (M.D. Ala. Feb. 5, 1986).
Allen v. Alabama State Board of Education, 816 F. 2d 575 (11th Cir. April 22, 1987).
Allen v. Alabama State Board of Education, 976 F. Supp. 1410 (M.D. Ala. Sept. 8, 1997).
Allen v. Alabama State Board of Education, 190 F.R.D. 602 (M.D. Ala. Jan. 5, 2000).
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: American Educational Research Association.
Associated Press Archives, (October 4, 1998). State Administers Teacher Certification Test Amid Ongoing Complaints.
Baldus, D.C. & Cole, J.W.L. (1980). Statistical Proof of Discrimination. NY: McGraw-Hill.
Cochran-Smith, M. (in press). The outcomes question in teacher education. Teaching and Teacher Education.
Cochran-Smith, M. & Dudley-Marling, C. (in press). The flunk heard round the world. Teaching Education.
Consent Decree, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. Oct. 25, 1985).
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. NY: Holt, Rinehart and Winston.
Daley, B. (1999). Teacher exam authors put to the test. Boston Globe, 10/7/98, B3.
Daley, B.; Vigue, D.I. & Zernike, K. (1999) Survey says Massachusetts Teacher Test is best in US. Boston Globe, 6/22/99, B02.
Defendant's Pre-trial Memorandum, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. May 1, 1986).
Donlon, T. (ed.) (1984). The College Board Technical Handbook for the Scholastic Aptitude Test and Achievement Tests. NY: College Entrance Examination Board.
Downing, S. & Haladyna, A. (1996). A model for evaluating high stakes testing programs: Why the fox should not guard the chicken coop. Educational Measurement: Issues and Practice, 15:1, pp.5-12.
Dressel, P.L. (1940). Some remarks on the Kuder-Richardson reliability coefficient. Psychometrika, 5, 305-310.
Ebel, R.L. & Frisbie, D.A. (1991) (5th ed.). Essentials of Educational Measurement. NJ: Prentice Hall.
Guilford, J.P. (1936) (1st ed.). Psychometric Methods. NY: McGraw-Hill.
Guilford, J.P. (1954) (2nd ed.). Psychometric Methods. NY: McGraw-Hill.
Haney, W., & Madaus, G. F. (1990). Evolution of Ethical and Technical standards. In R.K. Hamilton, & J. N. Zaal (Eds.), Advances in Educational and Psychological Testing (pp.395-425).
Haney, W.M., Madaus, G.F. & Lyons, R. (1993). The Fractured Marketplace for Standardized Testing. Boston: Kluwer.
Haney, W. (1996). Standards, Schmandards: The need for bringing test standards to bear on assessment practice. Paper presented at the annual meeting of the American Educational Research association annual meeting. NY: NY.
Haney, W., Fowler, C., Wheelock, A, Bebell, D. & Malec, N. (1999). Less truth than error?: An independent study of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(4). Available online at http://epaa.asu.edu/epaa/v7n4/.
Henrysson, S. (1963). Correction for item-total correlations in item analysis. Psychometrika, 28, 211-218.
Hopkins, K.D. (1998) (8th ed.). Educational and Psychological Measurement and Evaluation. Boston: Allyn and Bacon.
Kuder, G.F. & Richardson, M.W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151-160.
Loevinger, J. (1954). The attenuation paradox in test theory. Psychological Bulletin, 51, 493-504.
Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Madaus, G. (May 19-20, 1986). Testimony in Allen v Alabama (81-697-N).
Madaus, G. (1990). Legal and professional issues in teacher certification testing: A psychometric snark hunt. In J.V. Mitchell, S. Wise, & B. Plake (Ed.), Assessment of teaching: Purposes, practices, and implications for the profession. (pp. 209-260). Hillside, NJ: Lawrence Erlbaum Associates..
Massachusetts. (1999). FY 2000-2001 Budget.
Massachusetts Department of Education (February 24, 1997). Massachusetts Teacher Certification Tests of Communication and Literacy Skills and Subject Matter Knowledge: Request for Responses (RFR).
Massachusetts Department of Education (July 1, 1998). Board
of Education Special Meeting Minutes. http://www.doe.mass.edu/boe/minutes/98/min07
0198.html.
Massachusetts Department of Education (November 28, 2000).
Board of Education Regular Meeting Minutes.
Massachusetts Department of Education (February 16, 2001).
Massachusetts Educator Certification Tests: Registration
Bulletin.
Melnick, S. & Pullin, D. (1999, April). Teacher education
& testing in Massachusetts: The issues, the facts, and
conclusions for institutions of higher education.
Boston: Association of Independent Colleges and Universities
of Massachusetts.
Millman, J. (June 17, 1986). Testimony in Allen v
Alabama (81-697-N).
National Board on Educational Testing & Public Policy.
(2000). Policy statement. Chestnut Hill, MA: Lynch
School of Education, Boston College.
National Commission on Testing and Public Policy. (1990).
From Gatekeeper to Gateway: Transforming Testing in
America. Chestnut Hill, MA: Lynch School of Education,
Boston College.
National Evaluation Systems. (1999). Massachusetts
Educator Certification Tests Technical Report. Amherst,
MA: National Evaluation Systems.
Nunnally, J. (1967). Psychometric Theory. NY: McGraw-
Hill.
Order On Pretrial Hearing, Allen v. Alabama State Board
of Education, No. 81-697-N (M.D. Ala. Dec. 19,
1984).
Pearson, K. (1909). On a new method of determining
correlation between a measured character A and a character
B, of which only the percentage of cases wherein B exceeds
or falls short of a given intensity is recorded for each
grade of A. Biometrika, Vol. VII.
Pressley, D.S. (1998). Dumb struck: Finneran slams
'idiots' who failed teacher tests. Boston
Herald, 6/26/98 pp. 1,28.
Rawls, P. (2000). ACT may design test for Alabama's
future teachers. The Associated Press,
7/11/00
Richardson v. Lamar County Board of Education, 729 F.
Supp. 806. (M.D. Ala 1989) aff'd, 935 F. 2d 1240
(11th Cir. 1991).
Richardson, M.W. & Stalnaker, J.M. (1933). A note on the use
of bi-serial r in test research. Journal of General
Psychology, 8, 463-465.
Thorndike, E.L., Bregman, M.V., Cobb, Woodyard, E. et al.,
(1929) The Measurement of Intelligence. NY: Teachers
College, Columbia University.
U.S. Department of Education, National Center for Education
Statistics. Reference and Reporting Guide for Preparing
State and Institutional Reports on the Quality of Teacher
Preparation: Title II, Higher Education Act, NCES 2000-
089. Washington, DC: 2000.
Wainer, H. (1999). Some comments on the Ad Hoc Committee's
critique of the Massachusetts Teacher Tests. Education
Policy Analysis Archives, 7(5).
Available online at http://epaa.asu.edu/epaa/v7n5.html.
Walden, J.C. & Deaton, W.L. (1988). Alabama's teacher
certification test fails. 42 Ed. Law Rep.1
Larry Ludlow is an Associate Professor in the Lynch School of Education
at Boston College. He teaches courses in research methods, statistics,
and psychometrics. His research interests include teacher testing,
faculty evaluations, applied psychometrics, and the history of
statistics.
About the Author
Larry H. Ludow
Associate Professor
Boston College
Lynch School of Education
Educational Research, Measurement, and Evaluation
Department
Copyright 2001 by the Education Policy Analysis ArchivesThe World Wide Web address for the Education Policy Analysis Archives is epaa.asu.edu General questions about appropriateness of topics or particular articles may be addressed to the Editor, Gene V Glass, glass@asu.edu or reach him at College of Education, Arizona State University, Tempe, AZ 85287-0211. (602-965-9644). The Commentary Editor is Casey D. Cobb: casey.cobb@unh.edu . EPAA Editorial Board
|