~ EPAA Vol. 7 No. 4: Haney, Fowler, Wheelock, Bebell & Malec "Massachusetts Teacher Test" ~
page 1 | introduction | background | reliability & validity | interviews | conclusions | references

Conclusions and Recommendations


        The Ad Hoc Committee was formed in the summer of 1998 out of concern that important decisions were being based on the Massachusetts Teacher Tests (MTT) scores before any reasonable evidence had been produced concerning their reliability and validity. Since the DOE and NES have not made available any documentation on the reliability and validity of the MTT, in clear violation of professional standards concerning testing, and despite repeated requests for such documentation, the Ad Hoc Committee set out to study the technical merits of the new tests. Our original idea was compare individuals' scores on the MTT with scores on post-collegiate tests (such as the Praxis and the GRE) on which technical documentation is available. Toward this end we invited people to send us score reports on both the MTT and other tests.
        As of December we had not received sufficient data to undertake a concurrent validity study, comparing MTT scores with those on established tests. But, in the meantime, we examined the reliability of the new tests. Specifically, using data on over 200 individuals who took the MTT in April and July 1998 (generously provided to us by eight institutions of higher education in the Commonwealth), we studied the test-retest reliability of the MTT. We found the correlations between April and July test to be extraordinarily low: about 0.30 for both the MTT Reading and Writing tests. Test-retest correlation coefficients for well-developed standardized tests typically range between 0.80 and 0.90. To examine the possibility that very low correlations were due to restriction of range (only people who scored below 70 on the April tests had to retake them), we corrected for attenuation due to restriction of range and estimated test-retest correlations for the unrestricted population of test-takers. The results indicated test-retest correlations of 0.50 to 0.70 -- still well below the reliability of well-developed tests.
        We used these results to estimate the error of measurement in MTT scores. We found that MTT scores contain unusually high levels of measurement error--with an error of measurement on the new tests in the range of 9 to 17 points. We estimate that MTT Reading and Writing test scores contain two to three times the degree of error as well-developed tests.
        Next, we compared pass and failure rates on the April and July administrations to consider the rates of misclassification on the MTT. Using both our test-retest sample, and a much larger sample of data reported on the DOE web site, we found that the MTT tests have very high rates of misclassification--as indicated by the fact that among those who "failed" either the MTT reading or writing test in April, more than 50% "passed" the test in July. Evidence suggests also that a fair number of people who "passed" the MTT did so simply because of error in the tests.
        We also considered the content and construct validity of the MTT tests. At least one portion of the MTT Writing test (the dictation exercise) raises doubts about the content validity of the MTT and specifically their job-relatedness. Moreover, when we examined the correlation between MTT Reading and Writing test scores, the resulting correlations of about 0.50 raise serious doubt about their construct validity. Previous research suggests that the scores for tests of two related verbal constructs correlate in the range of 0.65 to 0.80.
        Finally, we report on results of interviews with 15 candidates who took the MTT in April, July, or October (7 of whom passed and 8 failed). Since this was a small and self-selected sample, results are merely suggestive. But they indicate that the unreliability and poor validity of MTT scores may result from the lack of a study guide for the new tests, confusion over whether the April results would "count" towards certification, poor conditions of administration (in at least some test sites), simple fatigue resulting from the 8-hour duration of the tests, and test content. Although all those interviewed supported the idea of certification testing for teachers, as is common with other professions, many compared the MTT unfavorably with other teacher certification tests they had taken (e.g. the Praxis or NTE and certification tests in other states).

        Recommendations


        If the Commonwealth wants high standards for its teaching force, it must use assessments that meet similarly high professional standards. The current Massachusetts Teacher Tests fail to meet this criterion. Results from the April and July administrations of the MTT reveal that the new tests are so unreliable and of such poor validity that they are passing candidates who lack the knowledge and skills the MTT are allegedly testing and failing many who do have these skills. Therefore, the Ad Hoc Committee recommends that:

  1. The Massachusetts Board of Education immediately suspend administration of the MTT. No exam at all is better than an unreliable exam that may be mistakenly failing 50% of qualified pre- service teachers while passing unknown numbers of unqualified ones.
  2. The Commonwealth convene an independent panel of testing experts to audit the development, administration and use of the MTT in light of both of professional standards for testing and the requirements of the Education Reform Act. These experts should issue a report evaluating how well the first four administrations of the MTT meet accepted professional standards. If they find that the MTT fails to meet these standards, they should propose other approaches that will contribute to high-quality teaching in the Commonwealth.
  3. An investigation be launched into how and why the state has allowed the new MTT tests to be used. An independent investigation into this matter is essential, since even before contracting with NES to develop the MTT, the DOE knew that a federal court had found that same firm to have "violated the minimum requirements for professional test development" with its teacher certification tests for Alabama. That the DOE nevertheless proceeded to allow the new MTT tests to be used, in obvious violation of professional standards on testing, to make important decisions about individuals before the validity and reliability of the new tests had been documented, was a course of action so imprudent as to call out for independent scrutiny.

        As James Madison wrote in 1787, in the passage candidates were asked to transcribe in the April 1998 version of the MTT, "No man is allowed to be a judge in his own cause because his interest would certainly bias his judgment and, not improbably, corrupt his integrity." So too with organizations; the DOE, having implemented new teacher certification tests of undocumented validity and reliability, should not be allowed to judge its own cause.

Notes

  1. In September 1998, the Massachusetts Department of Education (DOE) announced that the name Massachusetts Teacher Tests (MTT) was being changed to Massachusetts Educator Certification Tests (MCET), to reflect the fact that not just teachers but also other professional educators, such as counselors and principals, would be required to pass the new exams. However, throughout this report we refer to the Massachusetts Teacher Tests (MTT), since that is how they are most widely known.
  2. These test standards have been developed by the American Educational Research Association (AERA), the American Psychological Association (APA) and the National Council on Measurement in Education (NCME).
  3. Richardson v. Lamar County Bd. of Educ. 729 F. Supp 806, M. D. Ala. 1989, p. 821. A portion of this decision appears in appendix 2 of this report.
  4. NES President William Gorth signed the contract on February 23, and then Commissioner of Education Robert Antonucci on February 26, 1998.
  5. The most common approaches for estimating internal consistency are the Cronbach alpha and split-half techniques.
  6. We are submitting this report for publication and will make available to other investigators the complete set of data on which our reliability analyses have been based, but with the identities of the institutions of higher education removed.
  7. In the remainder of this section of this report, we focus on MTT reading and writing scores. Among the more than 200 candidates whose MTT scores we obtained, there were many different subject matter tests represented. Hence the sizes of samples for any one subject matter test were much small than those for the reading and writing tests.
  8. Here we should explain why we devoted considerable attention to these anomalous "outlier" scores. Such unusual cases can have a disproportionate impact on summary statistics, such as means, standard deviations and correlation coefficients. Deletion of one or two extreme cases can change the summary statistics. Hence, as we explain below, we report reliability estimates not only for our entire test-retest sample, but also for a trimmed sample from which outlier cases have been deleted. Therefore, in summary
  9. Candidates whose experiences are described in these vignettes have given their consent to these descriptions. We note, however, that specific details of their cases have been altered to protect their confidentiality.
page 1 | introduction | background | reliability & validity | interviews | conclusions | references