~ EPAA Vol. 7 No. 4: Haney, Fowler, Wheelock, Bebell & Malec "Massachusetts Teacher Test" ~
page 1 | introduction | background | reliability & validity | interviews | conclusions | references

The Reliability and Validity of the Massachusetts Teacher Tests

        Given the publicity that has surrounded the new tests and the questions that have been raised about their validity and reliability, it is not surprising that Massachusetts officials have sought to defend their merits. For example, in his July 7, 1998, editorial in the New York Times, John Silber wrote that the exams had been "validated by teachers and scholars who prepared it . . . [and] again by the panels of distinguished teachers, administrators and college professors who reviewed the questions for fairness and agreed on minimal passing scores." What this defense does not take into account is that a test cannot be validated simply by having people review test questions.
        Test validation refers to the meaning of test scores and that meaning depends not just on test content, but also on a host of other factors, such as the conditions under which tests are administered and how they are scored. A simple example illustrates this point. Suppose that we have a test made of 50 three-digit addition problems such as 231 + 458 = ? On its surface, this would seem to be a test of ability to add three-digit numbers. Perhaps so, if given in a math class with 20 or 30 minutes to solve the 50 problems. But suppose the test was sprung with little warning on aspiring accountants as a condition for getting a job, and they were given only five minutes to solve the 50 problems. Under these conditions, the test would obviously measure the ability not just to add three-digit numbers, but also to work fast under pressure. Or suppose that answers above 999 were scored correct only if they included a comma between the hundreds and thousands positions (such that 1,200 would be scored correct, but 1200 would not). If examinees were not told of this scoring rule, this would undermine the validity of the test as a measure of addition skills; the scoring rule would in effect test examinees' adherence to a particular convention for writing numbers greater than 999.
        This example is directly relevant to the Massachusetts Teachers Tests, for when candidates signed up to take the April exams, they had been told that these were merely practice tests and results would not count toward certification. But less than two weeks before the examination, the DOE announced that the results would count toward certification. Moreover, people taking the MTT in April and July had no access to sample tests or details on how questions (such as exercises in summarization and dictation) would be scored. Hence it is impossible to assess how meaningful the MTT scores are simply by reviewing questions that make up these tests.

        The concepts of test validity and reliability

        So how does one assess the validity and reliability of test scores? The 1985 test Standards says:

Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. (AERA, APA & NCME, 1985, p. 9, emphasis added)

        Traditionally, three types of validity evidence have been recognized: content-related validity evidence, criterion-related validity evidence, and construct validity evidence. Content-related validity refers to the "degree to which the sample of items, tasks or questions on a test are representative of some defined universe or domain of content" (AERA, APA & NCME, 1985, p. 10). As Emanuel Mason pointed out in his November 22, 1998, article in the Boston Globe, this is the only form of validity evidence to which state and NES officials have even referred, and even here they have provided no technical documentation as required by the 1985 test Standards.
        But, since validation refers to the meaningfulness of test scores, validation must also consider evidence on criterion-related validity and construct validity. Criterion-related evidence "demonstrates that test scores are systematically related to one or more outcome criteria" (AERA, APA & NCME, 1985, p. 11). The validity of college admissions tests is often evaluated in terms of the extent to which scores predict success in college as measured by freshman-year grade-point average (a form of criterion-related validity evidence referred to as predictive validity evidence). Another form of criterion-related validity evidence is concurrent validity. This refers to how scores on one test relate to those on another test intended to measure the same trait, when both tests are taken at about the same time. This is the sort of validity evidence that the Ad Hoc Committee was seeking when we asked test-takers to send us score reports on both the MTT and other tests designed to measure communications skills and/or teaching competence. As recounted above, we have not been able to acquire enough data to allow us to undertake a concurrent validity study.
        Construct validity is an over-arching concept referring to evidence that test scores represent "a measure of the psychological characteristic of interest" (AERA, APA & NCME, 1985, p. 9):
The process of compiling construct-related evidence for test validity starts with the process of test development and continues until the pattern of empirical relationships between test scores and other variables clearly indicates the meaning of the test score. Especially when multiple measures of a construct are available -- as in many practical testing applications -- validating inferences about a construct also requires paying careful attention to aspects of measurement such as test format, administration conditions, or language level, that may affect test meaning and interpretation materially.

Evidence for construct interpretation of a test may be obtained from a variety of sources. Intercorrelations among items may be used to support the assertion that a test measures primarily a single construct. Substantial relationships of a test to other measures that are purportedly of the same construct and the weaknesses of relationships to measures that are purportedly of different constructs support the identification of constructs and the differences among them. Relationships among different methods of measurement and among various nontest variables similarly sharpen and elaborate the meaning and interpretation of constructs.

Another line of evidence derives from analyses of individual responses. Questioning test takers about their performance strategies or responses to particular items or asking raters about their reasons for their ratings can yield hypotheses that enrich the definition of a construct. (AERA, APA & NCME, 1985, p. 10, emphasis added).

page 1 | introduction | background | reliability & validity | interviews | conclusions | references


        A test that is valid must be reliable. "Reliability refers to the degree to which test scores are free from measurement error" (AERA, APA & NCME, 1985, p. 19). As a basic textbook on testing points out, "The ceiling for possible validity of a test is set by its reliability" (Thorndike & Hagen , 1977, p. 87). In other words, if a test does not measure something reliably, it cannot be a valid measure of anything.

        The reliability of the Massachusetts Teacher Tests


        Absent sufficient data to assess the concurrent validity of the MTT, we decided to inquire into their reliability. Specifically, once we realized that relevant data might be available to us, we sought to examine the reliability of the scores on the April and July administrations. Comparing scores on these two administrations of the MTT might be thought of as a study of either test-retest or alternate-forms reliability. Test- retest reliability refers to the consistency of scores on two administrations of a test. Alternate-forms reliability refers to the consistency of scores on two different forms or versions of the same test. As we do not know to what extent the July MTT tests used identical questions as the April MTT tests, it is unclear whether our study should be termed a test-retest or alternate-forms study. But the essential idea is quite simple. It is to compare the scores on the MTT tests for people who took them in both April and July. Adults' scores on basic skills tests of reading and writing tests should not change much over three months, and since no study guides were available, examinees could hardly have crammed for the second administration. So if the MTT tests are reasonably reliable, we would expect individuals' scores on these two administrations to be similar; if they are unreliable, we would expect the scores to vary widely.
        We should acknowledge that there are several other ways in which test reliability might be estimated, such as internal consistency (indicating how much item results on one test administration tend to cohere.) (Note 5) But, the goal of certification tests such as the MTT is to estimate not simply examinees' competence on one test-taking occasion, but their competence in general. Alternate-forms reliability is thus more appropriate for assessing reliability. As Thorndike and Hagen (1977, p. 79) point out, "evidence based on equivalent test forms should usually be given the most weight in evaluating the reliability of a test."
        After the July administration of the MTT, the Department of Education distributed lists of results for individual test-takers to institutions of higher education in the Commonwealth. When we realized that these lists contained individuals' scores for both the April and July tests, we decided to try to gather enough data to undertake a test-retest or alternate- forms reliability study. We contacted some institutions individually via phone, and some through the Commonwealth Education Deans' Council. Three institutions with large numbers of students retaking the MTT tests in July were contacted via letter, which read in part:

We are seeking your cooperation in affording us access to data that will allow us to analyze some of the psychometric properties of the Massachusetts Teacher Tests.

In particular we seek access to scores of examinees who took Massachusetts Teacher Tests in April and again in July. Access to these data will allow us to analyze the test-retest reliability of the Massachusetts Teacher Tests. Your institution received a computer printout labeled "Institution Roster Report By Test: Verified Institutional Affiliation" for test date July 11, 1998. We would like to receive either a copy of this computer printout or a data file containing relevant data. To preserve the confidentially of examinees we seek these data with the names and SSN's of candidates removed. (Letter to institutional representatives, September 22, 1998).


        As of mid-December we had received MTT data from eight institutions, namely Boston College, Bridgewater State College, Elms College, Framingham State College, Lesley College, Salem State College, UMass Boston and Westfield State College. Five of these institutions are public and three are private. In both April and July, students from over 50 different institutions took the MTT. Eight out of 50 represents only 16% of the institutions that had students taking the MTT, but since these eight represent some of the largest teacher training institutions in the Commonwealth, they account for close to one third of the candidates who took the MTT in April.
        Altogether we collected data on 219 people who took the MTT tests in both April and July, though not all 219 took the reading, writing and subject matter portions of the MTT on both occasions. (Note 6) One of the first things we noted about the April and July MTT scores is that some of the score changes seemed truly bizarre. (Note 7) For example, one individual was reported to have scored 36 on the reading test in April and 75 on the April writing test, but to have scored 89 on the reading test in July and 17 on the writing test. In another case, an individual was reported to have scored 56 on the writing test in April and then got an 11 in July. Such huge score changes seemed so unlikely that we inquired into the accuracy of the reported scores. In both of these cases, the scores reported were verified by institutional representatives. In the first case we were told that the individual had not been prepared for the reading test in April, and that the dramatic increase from 36 in April to 89 in July was explained by the fact that the test taker had known that the latter counted toward certification. Why did the writing score plummet from 75 to 17 while the reading score increased from 36 to 89? According to the institutional representative, this happened because the test-taker started taking the July writing test, but then remembered that because she had scored more than 70 in April, she did not have to take the writing test again in July. Hence she simply stopped answering questions. Nonetheless the July score of 17 was reported to the institution as a failure. In the second case, in which writing scores dropped from 56 in April to 11 in July, the institutional representative verified the accuracy of the scores. She had no explanation for the dramatic score decrease, but added that the individual who had received these scores had left Massachusetts to take a teaching job in Arizona.
        Table 1 presents the summary statistics for the 219 cases of April and July MTT test-takers for which we have data.

Table 1: Summary Descriptive Statistics on April-July MTT Test-Takers

Reading 4/98

Writing 4/98

Reading 7/98

Writing 7/98

Count

215

218

130

173

Mean

65.2

63.1

69.4

70.7

Median

66

63.5

70

71

Standard deviation

14.7

10.3

15.2

11.8

Minimum

3

36

21

11

Maximum

93

87

96

96


        These data suggest that this sample is not unlike the April MTT test takers in general. On average, they fell below the passing score of 70 on both the reading and writing tests. Initially, these results would appear to make the MTT results seem reasonably reliable. Among repeat test-takers, the average reading scores increased from 65.2 to 69.4, and the average writing test scores from 63.1 to 70.7, apparently modest changes. But note the differences in the count of people in this sample who took the April and July tests. While more than 200 took both reading and writing tests in April, fewer than 180 took the tests in July. This reflects the fact that test-takers had to retake tests in July only if they had scored less than 70 on either the reading or writing tests.
        Hence, in order to assess the reliability of the MTT tests, we need to examine the correlations between April and July tests for examinees who took the same portions of the MTT tests on both occasions. Table 2 shows the intercorrelations of reading and writing test scores for people who took both tests. For those who took the reading test in April and July, the correlation of scores was 0.29; for writing, 0.37.

Table 2: Intercorrelations of April and July MTT Scores

Reading 4/98

Writing 4/98

Reading 7/98

Writing 7/98

Reading 4/98

1.00

(215)

0.07

(215)

0.29

(127)

0.24

(169)

Writing 4/98

  

1.00

(218)

0.47

(129)

0.37

(172)

Reading 7/98

  

  

1.00

(130)

0.06

(94)

Writing 7/98

  

  

  

1.00

(173)

Note: Sample sizes shown in parentheses, test-retest correlations in bold.

 


        These test-retest intercorrelations are extraordinarily low. Correlation coefficients can range from -1.00 to +1.00. Test-retest correlation coefficients for well-developed standardized tests typically range between 0.80 and 0.90. For example, test-retest correlations for the SAT have been reported to range between 0.86 and 0.90 (Donlon, 1984, p. 54). Similarly, Thorndike & Hagen (1977, p. 321) report alternate-form reliability coefficients in the range of .85 to .95 for the Stanford Binet. In contrast, the scores of examinees who took MTT reading tests in April and July correlated only 0.29, those of examinees who took the writing test in April and July correlated 0.37.
        To provide a more detailed picture of the relationship between April and July scores, Figures 1 and 2 show scatterplots of test scores for individuals in our sample who took the reading and of writing tests, respectively, on both occasions. Several patterns are apparent in comparing these figures. First, note that the "scatter" in reading test scores is greater than that in writing scores. This simply reflects the findings shown in Table 2 above; namely, that the correlation between reading scores in April and July (0.29) was smaller than that for writing test scores (0.37).

Figure 1. Scatterplot of April and July MTT Reading Test Scores


        Note also how widely retest scores vary among people who had approximately the same test scores in April. For example, in Figure 1, among examinees who had scores of about 60 on the reading test in April, retest scores in July ranged from less than 40 to about 90. And, as is apparent in Figure 2, among test-takers who scored in the 65 to 69 range in April, retest scores range from about 50 to 90.
        These figures also illustrate some of the huge score changes that initially caught our attention. These cases, often called "outliers" in data analysis, are marked with x's in Figures 1 and 2. In Figure 1, for example, note the three cases in the upper left corner. In all three cases, examinees had scores of less than 20 on the reading test in April but more than 70 in July, increases of more than 3 standard deviations. And in Figure 2, note the case in the lower right corner, representing someone who had a score of 75 in April, but a score of 17 in July. This is the case mentioned previously that was so bizarre that we asked the institutional representative to verify the accuracy of the data--the case of the test-taker who, remembering she did not have to take the writing test again, simply stopped answering questions . (Note 8) The other clear "outlier" in Figure 2 is lowest x on the figure, representing someone who had a score of 56 on the writing test in April, but 11 in July.

Figure 2. Scatterplot of April and July MTT Writing Test Scores


        We have checked these "outlier" cases and all are accurate in terms of scores reported to institutions. Nonetheless, as a more conservative examination of the test-retest reliability, we recalculated the test-retest correlations after deleting the outliers. We refer to these groups, after deleting outliers, as our trimmed samples. Specifically, after deleting the four unusual cases marked in Figure 1 with x's, the test-retest correlation for the reading test rose to 0.49. Similarly, after deleting the two outlying cases shown in Figure 2, the test-retest correlation for the writing test increased to 0.48.
        This brings us to one other feature apparent in Figures 1 and 2, and also to a possible explanation for the remarkably low test-retest correlations shown in Table 1. Note that in both Figures 1 and 2, there is only one case for which retest data are available for an examinee who scored 70 or above in April. This is because people who scored 70 or above "passed" the tests and did not have to retake them in order to be provisionally certified. With this one exception, our test-retest data for the MTT are for people who scored below 70 on the April tests. This leads to one possible explanation for the unusually low test-retest correlations, namely attenuation of observed correlation coefficients due to restriction of range. This concept is easy to explain with an example. People's height tends to be correlated with their weight. Tall people tend to weigh more than short people. Thus, we would find a positive correlation between the heights and weights of adults in general. But suppose that we consider a sample of people who are all exactly five feet tall. If we examine the correlation between their height and weight, we will find a zero correlation for the simple reason that they are all of the same height. By focusing on people who are exactly five feet tall, we have restricted the range on this variable; hence, the observed correlation between height and weight for this sample has been reduced or attenuated. This is what is meant by attenuation due to restriction of range. If we restrict the range of a variable, the observed correlation between this variable and another will be attenuated, as compared to the correlation that likely would be observed if the range on the variable were not restricted.
        Hence, before concluding that the MTT reading and writing tests are unreliable, we need to consider the possibility that attenuation due to restriction of range, with most of test-retest data available only for examinees who scored less than 70 on the April tests, may have led to the low test-retest correlations shown in Table 2. Fortunately, the phenomenon of attenuation of correlation coefficients due to restriction of range has been widely recognized in the testing and measurement literature. Formulas and tables are available to allow estimation of "unattenuated" correlation coefficients when restriction of range is taken into account (slightly different formulas are available, for example, in Lord & Novick, 1968; Cronbach, 1971; and Linn, 1982).
        Lord & Novick (1968) present an extended discussion of attenuation due to restriction of range and tables showing how observed correlations can be corrected for attenuation. If we assume that the relationship between two variables is linear and that the conditional variance of one does not depend on the particular value of the other (the assumption of homoscedasticity), then the following table shows the corrections for observed correlations when the percentage of the sample is restricted to the top (or bottom) 60%, 50%, 40% and 30% of the entire population. As shown in Table 2 above, we found that the observed correlation between the April and July MTT reading tests was 0.29. However, 70% of examinees passed the April reading test, so the range of examinees who had to take the July reading test was "restricted" to only the bottom 30% of the population of April examinees. Table 3 indicates that a correlation of 0.30 observed when range is restricted to 30% of a population would be corrected to 0.519 for the whole population. Similarly, we observed a correlation of 0.37 between scores on the April and July writing test, but since about 60% of examinees passed the April writing test, the group retaking the July writing test was restricted to about 40% of the population. Table 3 indicates that an observed correlation of 0.40 in a sample restricted to 40% of a population would be corrected to 0.616 for the entire population. For the trimmed samples, the observed correlations of 0.49 and 0.48, would be corrected to about 0.74 and 0.72, again presuming that only the bottom 30% retook the reading test and the bottom 40% retook the writing test.

 

Table 3: Corrections for Attenuation Due to Restriction of Range

Normal deviate

z

Percent selected in restricted sample

Standard devia-tion in selected group

Ratio of SD in unselected to SD in selected groups (K)

Observed correlation of 0.30 in restricted sample corrected to

Observed correlation of 0.40 in restricted sample corrected to

Observed correlation of 0.50 in restricted sample corrected to

-0.25

59.9

0.65

1.54

0.436

0.558

0.644

0

50

0.6

1.64

0.458

0.582

0.688

0.25

40.1

0.56

1.79

0.491

0.616

0.719

0.5

30.8

0.52

1.93

0.519

0.644

0.744

Source: Adapted from Lord & Novick, 1968, pp. 140-142.

 


        To verify these corrections for attenuation due to restriction of range, we conducted simulation analyses to address questions such as the following. If the test-retest correlation among a group of test takers was 0.50, what would be the correlation observed if only the bottom 30% on the initial test were considered? We do not attempt to present all of the results of these simulations here, but instead, in Figure 3, present the results of one iteration of the data simulations aimed at addressing the following question. If there were a test- retest correlation between test 1 (t1) and re-test (t2) of 0.50, what would be the observed correlation between test and re-test scores if attention were restricted to only the bottom 30% on the initial test (t1). What our results show is that if there were a test-retest correlation of 0.50 among the entire population of test-takers, restricting attention to only the bottom 30% of test-takers on the initial test (t1) would reduce (or attenuate) the observed correlation to about 0.30. These results confirm the theoretical results reported above. Given that we observed a test-retest correlation of about 0.30 in the 30-40% of examinees who had to retake the MTT, our estimate of the test-retest correlation for the MTT, if all examinees had retaken the tests, is about 0.50.

Figure 3. Example of Simulation Results

Note: Results shown here are for a sample of 1000

        Test-retest correlations in the range of 0.50 (or even 0.70) are unusually low. In comparison, as previously mentioned, test-retest correlations for the SAT have routinely been found to be in the range of 0.85 to 0.90 (Donlon, 1984, p. 54). There are several ways of illustrating the implications of test-retest reliability being as low as 0.50. One way of interpreting a test-retest reliability coefficient rtt is as the ratio of signal to "noise plus signal," or as the ratio of true score variance to observed score variance.

rtt = signal / (signal + noise) = true score variance / observed score variance

        Since observed score variance is composed of true score variance plus error score variance (see Anastasi, 1976, pp. 120-22, or many other textbooks on testing, for more detailed explanations), this equation can also be expressed as
rtt = (true score variance) / (true score variance + error score variance)

        Thus, it is easy to see that when a test-retest reliability coefficient rtt is as low as 0.50, observed scores are composed of as much error score variance as of true score variance. Thus a test-retest correlation of 0.50 indicates that MTT scores contain as much error as true score variance. Even a test-retest correlation of 0.70 indicates that MTT scores are composed of 30 percent error variance.
        A second way of showing the meaning of a test-retest reliability coefficient rtt is to use it to calculate the standard error of measurement, as follows:

              (Thorndike & Hagen, 1977, p. 85; Anastasi, 1976, p. 128)

where:

e = standard error of measurement
st = standard deviation of test scores, and
rtt = test-retest reliability coefficient.

As shown in Table 2, in our test-retest sample, we found the standard deviations of reading and writing test scores to be about 15 and 11 points respectively. However, these observed standard deviations were based on the restricted sample of retest examinees (with only 30% of April examinees having to retake the reading test and 40% the writing test), so we need to find a way of estimating the standard deviations of MTT test scores for the entire population of test takers.
        As we have pointed out, even after the MTT have been administered four times, over a period of a year, no technical report on these new tests has been issued. Hence, we must rely on data shared with us by cooperating institutions to estimate the standard deviations of MTT reading and writing test scores among the entire population of examinees. We have available two different avenues for pursuing this end; usingtheoretical adjustments of data on our test-retest sample and using data institutions shared with us on all their students who took the MTT in April and July.

page 1 | introduction | background | reliability & validity | interviews | conclusions | references


        Using the theoretical approach (and the data shown in Table 3 above), we can multiply the restricted sample standard deviations by 1.93 and 1.79 to estimate the standard deviations in the full population of April examinees. Since 15 x 1.93 = 28.95, and 11 x 1.79 = 19.69, we may use these figures as one set of estimates of the standard deviations of the MTT reading and writing tests. A second approach is to examine the standard deviations of the April tests for the institutions which gave us data on all of their April test-takers. We found that that the within-institution standard deviations to be as high as 19 points for the April reading test and 16 for the April writing test.
        Hence, as summary estimates of the standard deviations of the April tests, we averaged these two estimates, which yielded 24 [(29 +19)/2] and 18 [(16 +20)/2] as ballpark estimates of the standard deviations of the April tests for the full population of test takers. Then we estimate standard error of measurement for the MTT reading and writing tests as follows:


        Even if we use the more conservative estimations of test-retest correlations, based on the trimmed samples (that is, with outliers deleted) and adjusting for attenuation dues to restriction of range, namely 0.74 and 0.72 for the MTT reading and writing tests respectively, these would still imply standard error of measurement of 12.2 and 9.5. In other words, our results suggest that the standard errors of measurement in the April MTT Reading and Writing tests were about 17 and 11 points respectively (or at best 12 and 9). While neither the Massachusetts DOE nor NES has yet released any technical information on the scaling of the MTT, we have found MTT scores to range from near zero to almost 100. If indeed the scores for the MTT are on a 100 point scale, this means that standard errors of measurement of 9 and 17 points represent some 9% to 17% of the entire score range. This means that examinees scoring in the range of 50 to 69 may easily have "failed" the MTT simply because of measurement error, and, conversely, ones scoring in the range of 70 to 90 may well have "passed" simply because of the large degree of measurement error in the MTT tests.
        These errors of measurement on the MTT may be compared with the standard error of measurement on well-known tests for which technical documentation is available. The SAT (originally, the Scholastic Aptitude Test, briefly renamed the Scholastic Assessment Test, and now just the SAT) is reported on a scale that ranges from 200 to 800, or 600 points. The standard errors of measurement of the SAT verbal and quantitative scores have been reported to be in the range of 29-34 points (Donlon, 1984, pp. 33-34), or 4.4 to 5.7% of the score range. The standard errors of measurement for the Graduate Record Examination have been reported to be 33, 38 and 36 points for the GRE Verbal, Quantitative and Analytical subtests respectively (Conrad, Trismen & Miller, 1977, p. 19). Since these scores are reported on scales ranging from 600 to 670 points, these standard errors of measurement are all less than 6% of scale range. If the standard errors of measurement of MTT scores are 9 to 17 points, as we have estimated, representing 9% and 17% of the MTT score range, this means that MTT scores have almost double to triple the degree of error as the SAT and the GRE (as estimated by SEM relative to scaled score range).
        The reliability of classifications based on the MTT


        This brings us to a point mentioned in the introduction of this report. As the 1985 Standards for Educational and Psychological Testing point out, for licensure or certification tests on which people are rated as passing or failing, it is important to provide data not just on test scores, but also on the reliability of classification decisions based on those scores (Standard 11.3, p. 65). As Ron Hambleton has pointed out, "there were serious problems with the setting of passing scores on the reading literacy, writing and subject matter [MTT] tests":

  1. A detailed description of what it means to be qualified was not developed;
  2. Panelists who set passing scores did not have an opportunity to discuss their recommendations with each other prior to finalizing their recommendations to the Board;
  3. Technical data arising from the process of setting passing scores was not presented to the Board for their consideration. (Hambleton, 1999, pp. 20-21)

        Nonetheless, data available from the April and July administrations of the MTT allow us to examine the reliability of pass/fail classifications based on the MTT reading and writing tests. In a report entitled "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary," the Massachusetts DOE released data shedding direct evidence on this point. The report lists the number, and percent passing, of examinees who retook the MTT on July 11, 1998. Results are reported only for institutions that had more than four candidates. Hence this table showed reading test results for only 18 institutions and writing test results for 23 institutions. These data are shown in Table 4 below.

 

Table 4: Passing Rates of Second-Time Test Takers on July 11, 1998, MTT Tests

Institution

Reading

N

Reading % pass

Writing

N

Writing % pass

American International

College

10

70.0

13

30.8

Anna Maria College

5

20.0

5

40.0

Assumption College

6

50.0

Boston College

7

85.7

5

80.0

Bridgewater State College

41

58.5

59

55.2

Curry College

6

33.3

6

16.7

Fitchburg State College

25

56.0

31

45.7

Framingham State College

16

31.3

19

52.6

Lesley College

12

66.7

14

64.3

Mass. College of Liberal Arts

6

33.3

10

50.0

Merrimack College

5

40.0

9

66.7

Salem State College

17

76.5

26

76.9

Simmons College

6

100.0

Springfield College

22

36.4

27

48.1

Stonehill College

10

90.0

13

53.8

Univ. of Massachusetts/

Amherst

16

68.8

20

35.0

Univ. of Massachusetts/ Boston

6

50.0

Univ. of Massachusetts/ Dartmouth

6

50.0

Univ. of Massachusetts/ Lowell

5

60.0

13

76.9

Westfield State College

27

48.1

32

34.4

Wheelock College

6

50.0

12

50.0

Worcester State College

11

72.7

15

60.0

Unaffiliated candidates

35

60.0

47

44.7

Mean

55.6

53.6

Median

58.5

50.0

Source: Adapted from Mass DOE, "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary."; (Available at http://www.doe.mass.edu/teachertest/)

 


        As the data in Table 4 indicate, the mean pass rate (unweighted average across institutions for which data were reported) among second-time MTT test-takers was over 50% on both the reading and writing tests. Though we do not show weighted results in Table 4, these data indicate that 160 of 282 or 57% of examinees taking the reading test for the second time passed, and 207 of 400 or 52% of those taking the writing test passed. This indicates that the misclassification rate among those who "failed" the April tests was over 50% on both the reading test and the writing test. This seems extraordinarily high given that adults' basic skills in reading and writing are unlikely to change much over a three month-period (and as previously mentioned, candidates could not cram for the July test). Note, too, that the misclassification rate was higher on the reading test than on the writing test-- exactly what would be predicted from the results of our reliability analysis, which showed the reading test to be less reliable than the writing test.

page 1 | introduction | background | reliability & validity | interviews | conclusions | references


        What these results do not show, of course, is the rate of misclassification of those who passed the MTT tests. Nonetheless, it seems certain, given the results of our analyses, that a substantial proportion of those who "passed" the MTT reading and writing tests by 10 to 20 points did so simply because of test unreliability. People who scored above 69 on the MTT reading and writing tests, and thus "passed" these tests in April did not have to retake them in July. Hence we have no direct way of estimating the false "pass" rate. But to get a rough idea of this kind of misclassification we examined the percentage of April examinees scoring in the 50-69 point range whose scores decreased. We found that some 4% to 10% of first-time test-takers in these score ranges had decreased scores upon retest. Thus, a very conservative estimate of the percentage of April examinees who "passed" simply because of measurement error would be 2% to 5%.

Table 5: Passing Rates of Second-Time Test Takers Reported by DOE compared with Retest Sample

Report

ed by

Mass

DOE

Ad

Hoc

Retest

Sample

Institution

Rdg N

Rdg % pass

Wrtg N

Wrtg % pass

Rdg N

Rdg % pass

Wrtg N

Wrtg % pass

Boston College

7

85.7

5

80.0

6

83.3

5

80.0

Bridgewater State College

41

58.5

59

55.2

42

59.5

60

56.7

Framingham State College

16

31.3

19

52.6

16

31.3

19

52.6

Lesley College

12

66.7

14

64.3

12

66.7

14

64.3

Salem State College

17

76.5

26

76.9

21

66.7

34

79.4

U. Mass/Boston

6

50.0

6

50.0

Westfield State College

27

48.1

32

34.4

27

48.1

32

34.4

Source: Adapted from Mass DOE, "Massachusetts Teacher Tests. Summary of Institution results for Second-Time Test Takers. Test Date: July 11, 1998. Test Summary". (Available at http://www.doe.mass.edu/teachertest/)

 


        The data summarized in Table 4 also allowed us to check the findings from our test- retest sample against passing rates reported by the DOE that are summarized in Table 5. This table presents the passing rates reported by the DOE with those apparent in our test- retest sample. Note first that this table shows no results for Elms College; the DOE did not report any results for this institution because it had fewer than five second-time test takers. For four of the remaining seven institutions, the sample sizes (Ns) and passing rates reported by the DOE are exactly the same as those in our test-retest sample. For the remaining three institutions there are slight differences between results reported by the DOE and those apparent in our test-retest sample. For Boston College, the DOE reported seven second-time takers for the reading test, whereas we counted only six in our test-retest sample. We have examined the data for Boston College in detail and suspect that this discrepancy arises from an unusual case in which one student took the MTT in April, and had an April writing test score reported, but had no April reading test score reported in results transmitted to Boston College. Thus, apparently this individual was counted in the DOE results as a second-time test-taker, but was not included in our test-retest sample because no reading test score was reported for April. The other two institutions for which there are slight discrepancies are Bridgewater and Salem. In both cases, the Ns for the test-retest sample are slightly higher than the Ns reported in DOE results. These differences apparently derive from the fact that the DOE results are reported only for individuals whose institutional affiliation was verified by the institution by a particular date. The reason for the slightly larger Ns for Bridgewater and Salem is that the data provided to us apparently included a small number of cases that were treated as unaffiliated examinees by the DOE.
        Indeed, the DOE's policy of institutional affiliation of MTT test takers seems to be of doubtful merit and of changing meaning. In reporting institutional results for the April, July, and October administrations of the MTT, the DOE has offered a number of slightly different "Interpretive cautions and notes." But in each instance, the first has read as follows:
1. Information regarding candidate institutional affiliation was obtained from candidates as self-reported information on the registration form during the test registration process. This information was forwarded to institutions of higher education, which were provided with an opportunity to verify the candidates' institutional affiliation. The institutions were informed that if they did not respond to the verification request as explained, the data to be included in their results would be based on candidate-reported institutional affiliation.

        The institutional results for the April administration of the MTT were released under a memo from David Driscoll dated July 21, 1998. Together with institutional results for the July and October 1998 administrations, they are also available on the DOE web site. When we examined results for April and July, we noticed that there was a sharp increase in the number of "unaffiliated" candidates, that is, ones whose affiliation was with institutions outside Massachusetts or was not verified by the institutions concerned. Hence, we used the data available from the DOE web site to calculate the percentages of test-takers at each administration who were listed as "unaffiliated." As the results in Table 6 show, between April and July there was a fourfold increase in the percentage of test-takers listed as unaffiliated.

Table 6: First-time Test-Takers Listed as Unaffiliated

MTT Test

APRIL

JULY

OCTOBER

Reading

227 / 1794= 12.7%

891 / 1702= 52.4%

778 / 1533= 50.8%

Writing

227 / 1808= 12.5%

898 / 1707= 52.6%

783 / 1544= 50.7%

Sources:
http://www.doe.mass.edu/teachertest/7981st.html
http://www.doe.mass.edu/teachertest/summary498.html_
http://www.doe.mass.edu/teachertest/1098inst/1test.html
(Data summarized in Table 6 were downloaded 1/5/99)


        One possible explanation for this sharp increase is that students enrolled in out-of- state colleges were more readily able to come to Massachusetts to take the MTT in July than they were in April. This is not confirmed, however, by the fact the proportion of first-time test-takers listed as unaffiliated remained very high, more than 50%, in the October administration. Thus, what appears to have happened is that Massachusetts institutions of higher education with teacher preparation programs changed the manner in which they verified the affiliation of students after the first administration of the MTT. Indeed, on September 26, 1998, an article in the Boston Globe, "BU test to screen teacher hopefuls, disclaim failures," reported that Boston University had instituted a policy of not verifying students' affiliation with BU unless they had passed a literacy screening test before taking the MTT (Zernike, 1998).
        What the BU policy and the fourfold increase in the percentages of unaffiliated candidates indicate, however, is that even if the MTT test scores were reliable and valid, the DOE's practice of publishing "institutional" results may be highly misleading unless some better and uniform methods of "verifying" candidates' affiliation with institutions is developed. And even if that problem were solved, the ranking of schools based on student test scores is of doubtful merit.
        Ranking schools and school districts (and even states and countries) on student test results seems to be increasingly popular with the media in recent years. This is both unfair and ineffective in improving education. It is unfair for the simple reason that judging the effectiveness of educational institutions should be based not on end-of-school test scores, but on "value-added" as a result of experience in the school. If a value-added perspective is not adopted, then highly selective institutions (such as Harvard, Boston College, and Boston University among teacher preparation institutions in Massachusetts) may come out looking good in such rankings, simply because they only admit students who are good test-takers to begin with, not because of how much students learn while attending them.
        And ranking of schools based on student test scores can be both unfair and ineffective unless attention is paid to not just test scores as "outcomes," but also to the educational processes that produced those outcomes. In recent years, for example, numerous cases have been revealed in which schools cheated on tests in order to make their rankings look better. And even absent such manipulation of test scores, lack of attention to processes provides little leverage for improvement (Haney & Raczek, 1993).

        The content validity of the Massachusetts Teacher Tests


        Validity refers to the appropriateness and meaningfulness of inferences drawn from test scores. As explained previously, three types of evidence have been recognized in this connection; content-, criterion- and construct-related validity evidence. The Ad Hoc Committee originally intended to gather one form of criterion-related evidence, namely concurrent validity evidence that would compare scores on the MTT with those on well-established tests for college graduates. Having failed to gather enough comparable scores to allow reasonable statistical analysis, we undertook the reliability studies described above. Our findings that the MTT reading and writing tests are unreliable and have led to remarkably high rates of misclassification obviously cast doubt on their validity, but it is useful also to comment directly on the content and construct validity of the MTT tests.
        In general, content validity refers to whether test questions cover the right material, that is some defined domain of content. For licensure and certification tests, this translates into whether test questions clearly and correctly span the domain of knowledge necessary to protect the public from people who are not competent. Since competence in most professional fields is hard to define and measure precisely, content validation studies of licensure and certification tests are usually based on the expert judgment of people in the field being tested. In the case of the MTT the relevant fields were those of education and teaching. Typically in content validation studies, test developers ask practitioners to judge whether test questions are job-related and whether they match a particular content domain (often defined in terms of test objectives).
        Several Massachusetts officials have said publicly that such content validation studies have been done by panels of educators across the state in the development of the MTT. However, no relevant reports or documentation on these reviews have yet been released, even though the MTT have now been administered four times.
        At the same time, a particular portion of the MTT gives us pause about the way in which the content validation and job-relatedness studies have been used in the development of the new MTT. On the first MTT, administered in April 1998, as part of the writing test, examinees were asked to transcribe a 156-word text drawn from the Federalist papers (written by James Madison in 1787) as the text was read three times by a narrator on audiotape. (According to an August 5 story in the Boston Globe, the dictation exercise was suggested by Massachusetts Board of Education members Edward Delattre and John Silber; Hart, 8/5/98).
        It seems to us highly implausible that such an exercise would be judged a valid and job-related measure of writing competence by a majority of panelists reviewing content validity. Also, though we have reviewed more than 50 years of teacher competency testing in the United States (the NTE, for example was created in 1940; Haney, Madaus and Kreitzer, 1987), we have found no other instance in which a dictation exercise has been used as a measure of teacher competence.
        How then could such an unusual exercise have shown up on the MTT? We cannot be sure. But it is worth noting that in the Alabama teacher testing case referred to previously (which is reproduced in part in appendix 2), Judge Myron Thompson found that on the Alabama Initial Certification Test "a significant number of items appearing on the examinations failed to reflect accurately the collective judgment of curriculum committee members. In some cases changes to actual test items were not implemented. In other cases, items that had never been reviewed by a curriculum committee appeared on examinations. . . . [Also,] many items appeared on the examinations even after they had been rated content invalid by the requisite number of Alabama panelists" (Richardson v. Lamar County Bd. of Educ. 729 F. Supp 806, 821-822). It may be recalled that the developer of the Alabama test is the same company that developed the MTT.

        The construct validity of the Massachusetts Teacher Tests


        A third and more general form of validity evidence of the meaning of test scores relates to the "constructs" that the scores represent. As the 1985 Standards point out, "Substantial relationships of a test to other measures that are purportedly of the same construct and the weaknesses of relationships to measures that are purportedly of different constructs support the identification of constructs and the differences among them" (AERA, APA & NCME, 1985, p. 10).
        In carrying out test-retest analyses, we were surprised to find that in our sample of people who took the MTT in both April and July, there was a correlation of less than 0.10 between MTT reading and writing test scores in both April and July (see Table 2). Since reading and writing are both verbal or literacy skills, we would expect to find substantial correlations between test scores of these related constructs. But, as we have noted, the test-retest sample was restricted (with one odd exception) to people who had failed either the reading or writing test in April. Thus the group of repeat test- takers represents a highly restricted or attenuated sample of MTT test-takers in general.
        To examine the relationship between MTT reading and writing scores on less restricted groups of test-takers, we returned to data obtained from for analyzing test-retest reliability. Several institutions had provided us with data on all of their students who had taken the MTT in April and in July. These data allowed us to examine the MTT Reading x MTT Writing correlations on larger samples than when we confined ourselves to individuals who took the MTT in both April and July. Table 7 presents results of these analyses.

Table 7: Correlations of MTT Reading and Writing Scores

April

July

Boston College

0.42

(111)

0.20

(44)

Lesley College

0.56

(62)

0.65

(60)

Westfield State

0.57

(101)

0.50

(107)

Total N  

274

211

Median r  

0.56

0.50

Note: Sample sizes shown in parentheses

 


        Note, first, that the correlations between MTT Reading and MTT Writing scores vary somewhat: from 0.42 to 0.57 for April and from 0.20 to 0.65 for July. Part of this is due surely to sample size. For example, the most anomalous correlation in Table 7 is for Boston College for July test results (correlation of 0.20). Note, however, that for this sample, there was an N of only 44. If we consider only those cases in which N>100, we see a much more consistent pattern, with MTT Reading x MTT Writing correlations of 0.42, 0.50, and 0.57. This suggests that the average correlation between MTT Reading and MTT Writing test scores is about 0.50.
        This finding may be compared with previous research on the intercorrelations between measures of two verbal skills. Cronbach (1970), for example, reports that the Verbal and Spelling subtest scores on the General Aptitude Test Battery (GATB) correlate in the range of 0.66 to 0.72. Donlon reports that Test of Standard Written English (TSWE) scores correlate with SAT Verbal scores in the range of 0.76 to 0.80 and with SAT Reading scores in the range of 0.72 to 0.77 (Donlon, 1984, p. 81). Similarly, Conrad, Trismen and Miller report that GRE Verbal and GRE Analytical scores for the same individuals correlate in the range of 0.76 to 0.77 (Conrad, Trismen & Miller, 1977, p. 19). Indeed, even SAT Verbal and SAT Mathematical scores have been found to correlate in the range of 0.64 to 0.72 (Donlon, 1984, p. 81).
        These comparisons cast considerable doubt on the construct validity of the MTT Reading and Writing test scores, which correlate only in the range of 0.42 to 0.57, with an average correlation of about 0.50.

          Summary


        In sum, our results indicate that the MTT Reading and Writing test scores are unreliable and of doubtful validity. Specifically, we found that the scores:

  • Are unreliable as indicated by our calculations of test-retest reliability (in the range of 0.50 to 0.70);
  • Contain almost two to three times the degree of error as well-developed tests (with an error of measurement in the range of 9 to 17 points);
  • Have high rates of misclassification (as indicated by the fact that among those who "failed" either the MTT Reading or Writing test in April, more than 50% "passed" that test in July);
  • Are of questionable content validity and doubtful construct validity, as indicated by the low correlation (about 0.50) between reading and writing test scores.

        Why the MTT Reading and Writing tests are so unusually unreliable and of such doubtful validity is the more mysterious because the skills of reading and writing are ones for which many reliable and valid tests have been developed over many decades. There are many possible causes for the low reliability and apparently poor validity of the MTT tests. The problems may arise from test content, administration, scoring, scaling, equating or some combination of these factors. Fortunately, another aspect of inquiry by the Ad Hoc Committee offers insight into why these scores are of such low reliability and apparently poor validity.

page 1 | introduction | background | reliability & validity | interviews | conclusions | references