|
page 1 | introduction | background | reliability & validity | interviews | conclusions | references The Reliability and Validity of the Massachusetts Teacher TestsGiven the publicity that has surrounded the new tests and the questions that have been raised about their validity and reliability, it is not surprising that Massachusetts officials have sought to defend their merits. For example, in his July 7, 1998, editorial in the New York Times, John Silber wrote that the exams had been "validated by teachers and scholars who prepared it . . . [and] again by the panels of distinguished teachers, administrators and college professors who reviewed the questions for fairness and agreed on minimal passing scores." What this defense does not take into account is that a test cannot be validated simply by having people review test questions.Test validation refers to the meaning of test scores and that meaning depends not just on test content, but also on a host of other factors, such as the conditions under which tests are administered and how they are scored. A simple example illustrates this point. Suppose that we have a test made of 50 three-digit addition problems such as 231 + 458 = ? On its surface, this would seem to be a test of ability to add three-digit numbers. Perhaps so, if given in a math class with 20 or 30 minutes to solve the 50 problems. But suppose the test was sprung with little warning on aspiring accountants as a condition for getting a job, and they were given only five minutes to solve the 50 problems. Under these conditions, the test would obviously measure the ability not just to add three-digit numbers, but also to work fast under pressure. Or suppose that answers above 999 were scored correct only if they included a comma between the hundreds and thousands positions (such that 1,200 would be scored correct, but 1200 would not). If examinees were not told of this scoring rule, this would undermine the validity of the test as a measure of addition skills; the scoring rule would in effect test examinees' adherence to a particular convention for writing numbers greater than 999. This example is directly relevant to the Massachusetts Teachers Tests, for when candidates signed up to take the April exams, they had been told that these were merely practice tests and results would not count toward certification. But less than two weeks before the examination, the DOE announced that the results would count toward certification. Moreover, people taking the MTT in April and July had no access to sample tests or details on how questions (such as exercises in summarization and dictation) would be scored. Hence it is impossible to assess how meaningful the MTT scores are simply by reviewing questions that make up these tests. The concepts of test validity and reliability So how does one assess the validity and reliability of test scores? The 1985 test Standards says: Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. (AERA, APA & NCME, 1985, p. 9, emphasis added) Traditionally, three types of validity evidence have been recognized: content-related validity evidence, criterion-related validity evidence, and construct validity evidence. Content-related validity refers to the "degree to which the sample of items, tasks or questions on a test are representative of some defined universe or domain of content" (AERA, APA & NCME, 1985, p. 10). As Emanuel Mason pointed out in his November 22, 1998, article in the Boston Globe, this is the only form of validity evidence to which state and NES officials have even referred, and even here they have provided no technical documentation as required by the 1985 test Standards. But, since validation refers to the meaningfulness of test scores, validation must also consider evidence on criterion-related validity and construct validity. Criterion-related evidence "demonstrates that test scores are systematically related to one or more outcome criteria" (AERA, APA & NCME, 1985, p. 11). The validity of college admissions tests is often evaluated in terms of the extent to which scores predict success in college as measured by freshman-year grade-point average (a form of criterion-related validity evidence referred to as predictive validity evidence). Another form of criterion-related validity evidence is concurrent validity. This refers to how scores on one test relate to those on another test intended to measure the same trait, when both tests are taken at about the same time. This is the sort of validity evidence that the Ad Hoc Committee was seeking when we asked test-takers to send us score reports on both the MTT and other tests designed to measure communications skills and/or teaching competence. As recounted above, we have not been able to acquire enough data to allow us to undertake a concurrent validity study. Construct validity is an over-arching concept referring to evidence that test scores represent "a measure of the psychological characteristic of interest" (AERA, APA & NCME, 1985, p. 9): The process of compiling construct-related evidence for test validity starts with the process of test development and continues until the pattern of empirical relationships between test scores and other variables clearly indicates the meaning of the test score. Especially when multiple measures of a construct are available -- as in many practical testing applications -- validating inferences about a construct also requires paying careful attention to aspects of measurement such as test format, administration conditions, or language level, that may affect test meaning and interpretation materially.
The reliability of the Massachusetts Teacher Tests
We are seeking your cooperation in affording us access to data that will allow us to analyze some of the psychometric properties of the Massachusetts Teacher Tests. As of mid-December we had received MTT data from eight institutions, namely Boston College, Bridgewater State College, Elms College, Framingham State College, Lesley College, Salem State College, UMass Boston and Westfield State College. Five of these institutions are public and three are private. In both April and July, students from over 50 different institutions took the MTT. Eight out of 50 represents only 16% of the institutions that had students taking the MTT, but since these eight represent some of the largest teacher training institutions in the Commonwealth, they account for close to one third of the candidates who took the MTT in April. Altogether we collected data on 219 people who took the MTT tests in both April and July, though not all 219 took the reading, writing and subject matter portions of the MTT on both occasions. (Note 6) One of the first things we noted about the April and July MTT scores is that some of the score changes seemed truly bizarre. (Note 7) For example, one individual was reported to have scored 36 on the reading test in April and 75 on the April writing test, but to have scored 89 on the reading test in July and 17 on the writing test. In another case, an individual was reported to have scored 56 on the writing test in April and then got an 11 in July. Such huge score changes seemed so unlikely that we inquired into the accuracy of the reported scores. In both of these cases, the scores reported were verified by institutional representatives. In the first case we were told that the individual had not been prepared for the reading test in April, and that the dramatic increase from 36 in April to 89 in July was explained by the fact that the test taker had known that the latter counted toward certification. Why did the writing score plummet from 75 to 17 while the reading score increased from 36 to 89? According to the institutional representative, this happened because the test-taker started taking the July writing test, but then remembered that because she had scored more than 70 in April, she did not have to take the writing test again in July. Hence she simply stopped answering questions. Nonetheless the July score of 17 was reported to the institution as a failure. In the second case, in which writing scores dropped from 56 in April to 11 in July, the institutional representative verified the accuracy of the scores. She had no explanation for the dramatic score decrease, but added that the individual who had received these scores had left Massachusetts to take a teaching job in Arizona. Table 1 presents the summary statistics for the 219 cases of April and July MTT test-takers for which we have data.
Table 1: Summary Descriptive Statistics on April-July MTT Test-Takers
Table 2: Intercorrelations of April and July MTT Scores
Note: Sample sizes shown in parentheses, test-retest correlations in bold.
|
Figure 1. Scatterplot of April and July MTT Reading Test ScoresNote also how widely retest scores vary among people who had approximately the same test scores in April. For example, in Figure 1, among examinees who had scores of about 60 on the reading test in April, retest scores in July ranged from less than 40 to about 90. And, as is apparent in Figure 2, among test-takers who scored in the 65 to 69 range in April, retest scores range from about 50 to 90. These figures also illustrate some of the huge score changes that initially caught our attention. These cases, often called "outliers" in data analysis, are marked with x's in Figures 1 and 2. In Figure 1, for example, note the three cases in the upper left corner. In all three cases, examinees had scores of less than 20 on the reading test in April but more than 70 in July, increases of more than 3 standard deviations. And in Figure 2, note the case in the lower right corner, representing someone who had a score of 75 in April, but a score of 17 in July. This is the case mentioned previously that was so bizarre that we asked the institutional representative to verify the accuracy of the data--the case of the test-taker who, remembering she did not have to take the writing test again, simply stopped answering questions . (Note 8) The other clear "outlier" in Figure 2 is lowest x on the figure, representing someone who had a score of 56 on the writing test in April, but 11 in July.
Figure 2. Scatterplot of April and July MTT Writing Test ScoresWe have checked these "outlier" cases and all are accurate in terms of scores reported to institutions. Nonetheless, as a more conservative examination of the test-retest reliability, we recalculated the test-retest correlations after deleting the outliers. We refer to these groups, after deleting outliers, as our trimmed samples. Specifically, after deleting the four unusual cases marked in Figure 1 with x's, the test-retest correlation for the reading test rose to 0.49. Similarly, after deleting the two outlying cases shown in Figure 2, the test-retest correlation for the writing test increased to 0.48. This brings us to one other feature apparent in Figures 1 and 2, and also to a possible explanation for the remarkably low test-retest correlations shown in Table 1. Note that in both Figures 1 and 2, there is only one case for which retest data are available for an examinee who scored 70 or above in April. This is because people who scored 70 or above "passed" the tests and did not have to retake them in order to be provisionally certified. With this one exception, our test-retest data for the MTT are for people who scored below 70 on the April tests. This leads to one possible explanation for the unusually low test-retest correlations, namely attenuation of observed correlation coefficients due to restriction of range. This concept is easy to explain with an example. People's height tends to be correlated with their weight. Tall people tend to weigh more than short people. Thus, we would find a positive correlation between the heights and weights of adults in general. But suppose that we consider a sample of people who are all exactly five feet tall. If we examine the correlation between their height and weight, we will find a zero correlation for the simple reason that they are all of the same height. By focusing on people who are exactly five feet tall, we have restricted the range on this variable; hence, the observed correlation between height and weight for this sample has been reduced or attenuated. This is what is meant by attenuation due to restriction of range. If we restrict the range of a variable, the observed correlation between this variable and another will be attenuated, as compared to the correlation that likely would be observed if the range on the variable were not restricted. Hence, before concluding that the MTT reading and writing tests are unreliable, we need to consider the possibility that attenuation due to restriction of range, with most of test-retest data available only for examinees who scored less than 70 on the April tests, may have led to the low test-retest correlations shown in Table 2. Fortunately, the phenomenon of attenuation of correlation coefficients due to restriction of range has been widely recognized in the testing and measurement literature. Formulas and tables are available to allow estimation of "unattenuated" correlation coefficients when restriction of range is taken into account (slightly different formulas are available, for example, in Lord & Novick, 1968; Cronbach, 1971; and Linn, 1982). Lord & Novick (1968) present an extended discussion of attenuation due to restriction of range and tables showing how observed correlations can be corrected for attenuation. If we assume that the relationship between two variables is linear and that the conditional variance of one does not depend on the particular value of the other (the assumption of homoscedasticity), then the following table shows the corrections for observed correlations when the percentage of the sample is restricted to the top (or bottom) 60%, 50%, 40% and 30% of the entire population. As shown in Table 2 above, we found that the observed correlation between the April and July MTT reading tests was 0.29. However, 70% of examinees passed the April reading test, so the range of examinees who had to take the July reading test was "restricted" to only the bottom 30% of the population of April examinees. Table 3 indicates that a correlation of 0.30 observed when range is restricted to 30% of a population would be corrected to 0.519 for the whole population. Similarly, we observed a correlation of 0.37 between scores on the April and July writing test, but since about 60% of examinees passed the April writing test, the group retaking the July writing test was restricted to about 40% of the population. Table 3 indicates that an observed correlation of 0.40 in a sample restricted to 40% of a population would be corrected to 0.616 for the entire population. For the trimmed samples, the observed correlations of 0.49 and 0.48, would be corrected to about 0.74 and 0.72, again presuming that only the bottom 30% retook the reading test and the bottom 40% retook the writing test.
To verify these corrections for attenuation due to restriction of range, we conducted simulation analyses to address questions such as the following. If the test-retest correlation among a group of test takers was 0.50, what would be the correlation observed if only the bottom 30% on the initial test were considered? We do not attempt to present all of the results of these simulations here, but instead, in Figure 3, present the results of one iteration of the data simulations aimed at addressing the following question. If there were a test- retest correlation between test 1 (t1) and re-test (t2) of 0.50, what would be the observed correlation between test and re-test scores if attention were restricted to only the bottom 30% on the initial test (t1). What our results show is that if there were a test-retest correlation of 0.50 among the entire population of test-takers, restricting attention to only the bottom 30% of test-takers on the initial test (t1) would reduce (or attenuate) the observed correlation to about 0.30. These results confirm the theoretical results reported above. Given that we observed a test-retest correlation of about 0.30 in the 30-40% of examinees who had to retake the MTT, our estimate of the test-retest correlation for the MTT, if all examinees had retaken the tests, is about 0.50.
Figure 3. Example of Simulation ResultsNote: Results shown here are for a sample of 1000Test-retest correlations in the range of 0.50 (or even 0.70) are unusually low. In comparison, as previously mentioned, test-retest correlations for the SAT have routinely been found to be in the range of 0.85 to 0.90 (Donlon, 1984, p. 54). There are several ways of illustrating the implications of test-retest reliability being as low as 0.50. One way of interpreting a test-retest reliability coefficient rtt is as the ratio of signal to "noise plus signal," or as the ratio of true score variance to observed score variance.
Since observed score variance is composed of true score variance plus error score variance (see Anastasi, 1976, pp. 120-22, or many other textbooks on testing, for more detailed explanations), this equation can also be expressed as Thus, it is easy to see that when a test-retest reliability coefficient rtt is as low as 0.50, observed scores are composed of as much error score variance as of true score variance. Thus a test-retest correlation of 0.50 indicates that MTT scores contain as much error as true score variance. Even a test-retest correlation of 0.70 indicates that MTT scores are composed of 30 percent error variance. A second way of showing the meaning of a test-retest reliability coefficient rtt is to use it to calculate the standard error of measurement, as follows:
![]() (Thorndike & Hagen, 1977, p. 85; Anastasi, 1976, p. 128) where:
As shown in Table 2, in our test-retest sample, we found the
standard deviations of reading and writing test scores to be
about 15 and 11 points respectively. However, these
observed standard deviations were based on the restricted
sample of retest examinees (with only 30% of April examinees
having to retake the reading test and 40% the writing test),
so we need to find a way of estimating the standard
deviations of MTT test scores for the entire population of
test takers.
![]()
Nonetheless, data available from the April and July administrations of the MTT allow us to examine the reliability of pass/fail classifications based on the MTT reading and writing tests. In a report entitled "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary," the Massachusetts DOE released data shedding direct evidence on this point. The report lists the number, and percent passing, of examinees who retook the MTT on July 11, 1998. Results are reported only for institutions that had more than four candidates. Hence this table showed reading test results for only 18 institutions and writing test results for 23 institutions. These data are shown in Table 4 below.
Table 4: Passing Rates of Second-Time Test Takers on July 11, 1998, MTT Tests
Source: Adapted from Mass DOE, "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary."; (Available at http://www.doe.mass.edu/teachertest/)
As the data in Table 4 indicate, the mean pass rate (unweighted average across institutions for which data were reported) among second-time MTT test-takers was over 50% on both the reading and writing tests. Though we do not show weighted results in Table 4, these data indicate that 160 of 282 or 57% of examinees taking the reading test for the second time passed, and 207 of 400 or 52% of those taking the writing test passed. This indicates that the misclassification rate among those who "failed" the April tests was over 50% on both the reading test and the writing test. This seems extraordinarily high given that adults' basic skills in reading and writing are unlikely to change much over a three month-period (and as previously mentioned, candidates could not cram for the July test). Note, too, that the misclassification rate was higher on the reading test than on the writing test-- exactly what would be predicted from the results of our reliability analysis, which showed the reading test to be less reliable than the writing test.
Table 5: Passing Rates of Second-Time Test Takers Reported by DOE compared with Retest Sample
Source: Adapted from Mass DOE, "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary". (Available at http://www.doe.mass.edu/teachertest/)
The data summarized in Table 4 also allowed us to check the findings from our test- retest sample against passing rates reported by the DOE that are summarized in Table 5. This table presents the passing rates reported by the DOE with those apparent in our test- retest sample. Note first that this table shows no results for Elms College; the DOE did not report any results for this institution because it had fewer than five second-time test takers. For four of the remaining seven institutions, the sample sizes (Ns) and passing rates reported by the DOE are exactly the same as those in our test-retest sample. For the remaining three institutions there are slight differences between results reported by the DOE and those apparent in our test-retest sample. For Boston College, the DOE reported seven second-time takers for the reading test, whereas we counted only six in our test-retest sample. We have examined the data for Boston College in detail and suspect that this discrepancy arises from an unusual case in which one student took the MTT in April, and had an April writing test score reported, but had no April reading test score reported in results transmitted to Boston College. Thus, apparently this individual was counted in the DOE results as a second-time test-taker, but was not included in our test-retest sample because no reading test score was reported for April. The other two institutions for which there are slight discrepancies are Bridgewater and Salem. In both cases, the Ns for the test-retest sample are slightly higher than the Ns reported in DOE results. These differences apparently derive from the fact that the DOE results are reported only for individuals whose institutional affiliation was verified by the institution by a particular date. The reason for the slightly larger Ns for Bridgewater and Salem is that the data provided to us apparently included a small number of cases that were treated as unaffiliated examinees by the DOE. Indeed, the DOE's policy of institutional affiliation of MTT test takers seems to be of doubtful merit and of changing meaning. In reporting institutional results for the April, July, and October administrations of the MTT, the DOE has offered a number of slightly different "Interpretive cautions and notes." But in each instance, the first has read as follows: 1. Information regarding candidate institutional affiliation was obtained from candidates as self-reported information on the registration form during the test registration process. This information was forwarded to institutions of higher education, which were provided with an opportunity to verify the candidates' institutional affiliation. The institutions were informed that if they did not respond to the verification request as explained, the data to be included in their results would be based on candidate-reported institutional affiliation. The institutional results for the April administration of the MTT were released under a memo from David Driscoll dated July 21, 1998. Together with institutional results for the July and October 1998 administrations, they are also available on the DOE web site. When we examined results for April and July, we noticed that there was a sharp increase in the number of "unaffiliated" candidates, that is, ones whose affiliation was with institutions outside Massachusetts or was not verified by the institutions concerned. Hence, we used the data available from the DOE web site to calculate the percentages of test-takers at each administration who were listed as "unaffiliated." As the results in Table 6 show, between April and July there was a fourfold increase in the percentage of test-takers listed as unaffiliated.
Table 6: First-time Test-Takers Listed as Unaffiliated
http://www.doe.mass.edu/teachertest/7981st.html http://www.doe.mass.edu/teachertest/summary498.html_ http://www.doe.mass.edu/teachertest/1098inst/1test.html (Data summarized in Table 6 were downloaded 1/5/99)
The content validity of the Massachusetts Teacher Tests
The construct validity of the Massachusetts Teacher Tests
Table 7: Correlations of MTT Reading and Writing Scores
Note: Sample sizes shown in parentheses
Note, first, that the correlations between MTT Reading and MTT Writing scores vary somewhat: from 0.42 to 0.57 for April and from 0.20 to 0.65 for July. Part of this is due surely to sample size. For example, the most anomalous correlation in Table 7 is for Boston College for July test results (correlation of 0.20). Note, however, that for this sample, there was an N of only 44. If we consider only those cases in which N>100, we see a much more consistent pattern, with MTT Reading x MTT Writing correlations of 0.42, 0.50, and 0.57. This suggests that the average correlation between MTT Reading and MTT Writing test scores is about 0.50. This finding may be compared with previous research on the intercorrelations between measures of two verbal skills. Cronbach (1970), for example, reports that the Verbal and Spelling subtest scores on the General Aptitude Test Battery (GATB) correlate in the range of 0.66 to 0.72. Donlon reports that Test of Standard Written English (TSWE) scores correlate with SAT Verbal scores in the range of 0.76 to 0.80 and with SAT Reading scores in the range of 0.72 to 0.77 (Donlon, 1984, p. 81). Similarly, Conrad, Trismen and Miller report that GRE Verbal and GRE Analytical scores for the same individuals correlate in the range of 0.76 to 0.77 (Conrad, Trismen & Miller, 1977, p. 19). Indeed, even SAT Verbal and SAT Mathematical scores have been found to correlate in the range of 0.64 to 0.72 (Donlon, 1984, p. 81). These comparisons cast considerable doubt on the construct validity of the MTT Reading and Writing test scores, which correlate only in the range of 0.42 to 0.57, with an average correlation of about 0.50. Summary
Why the MTT Reading and Writing tests are so unusually unreliable and of such doubtful validity is the more mysterious because the skills of reading and writing are ones for which many reliable and valid tests have been developed over many decades. There are many possible causes for the low reliability and apparently poor validity of the MTT tests. The problems may arise from test content, administration, scoring, scaling, equating or some combination of these factors. Fortunately, another aspect of inquiry by the Ad Hoc Committee offers insight into why these scores are of such low reliability and apparently poor validity.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||