Education Policy Analysis Archives

Volume 8 Number 41

The Texas Miracle in Education

Walt Haney

4. Problems with TAAS

          Two years ago when I agreed to help MALDEF on the TAAS case, I had no way of foreseeing the extent to which education reform in Texas would come to be touted as a model to be emulated elsewhere. Nonetheless, as I studied what had been happening with TAAS in Texas, I quickly came to think otherwise. Before summarizing what I think is wrong with TAAS and how it is being misused in Texas, I should mention that some of what I recount in the remainder of this article is based on two unpublished reports that I prepared in connection with the TAAS case—a preliminary report in December 1998, and supplementary report in July 1999 (Haney, 1998; 1999). However, it also draws on additional evidence acquired and analyses undertaken since completion of the supplementary report in summer 1999.
          The problems with TAAS and the way it is being used in Texas may be summarized under five sub-headings: 1) the TAAS is having a continuing adverse impact on Black and Hispanic students; 2) the use of the TAAS test in isolation to control award of high school diplomas is contrary to professional standards concerning test use; 3) the passing score on TAAS is arbitrary and discriminatory; 4) a variety of evidence casts doubt on the validity of TAAS scores; and 5) more appropriate use of test results would have more validity and less adverse impact.

4.1 Adverse impact

          In previous research and law, three standards have been recognized for determining whether observed differences constitute discriminatory disparate impact: 1) the 80 percent (or four-fifths) rule; 2) tests of the statistical significance of observed differences; 3) and evaluation of the practical significance of differences. The "80 percent" or four-fifths rule refers to a provision of the 1978 Uniform Guidelines on Employee Selection Procedures (43 F.R. No. 166, 38290-38296, 1978) which reads:
Sec. 6D. Adverse impact and the "four-fifths rule." A selection rate for any race, sex or ethnic group which is less than four-fifths (or eighty percent) of the rate for the group with the highest rate will be generally regarded by Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact. (As quoted in Fienberg, 1989, p. 91).
          As a result of its standing in federal regulations, the 80 percent rule as a test of adverse or disparate impact has been widely recognized. Nonetheless, simple differences in percentage rates have some undesirable properties. The simple difference, for example "is inevitably small when the two percentages are close to zero" (David H. Kaye and David A. Freedman, Reference guide on statistics, Federal Judicial Center, 1994). Hence, most observers and considerable case law now hold that in assessing disparate impact, it is important to apply not just the 80% or four- fifths rule but also to consider the practical and statistical significance of differences in selection or pass rates (Fienberg, 1989; Kaye & Freedman, 1994; see also, Office of Civil Rights, 1999). In previous reports regarding the TAAS case (Haney, 1998; 1999), I applied these three tests of adverse impact to a variety of TAAS results. However, for economy of presentation here, I provide only illustrative results.
          Eighty Percent or Four-Fifths Rule. To apply this test of adverse impact, we simply multiply the pass rates on TAAS for White students by 80% and check to see whether the pass rates for Blacks and Hispanics fall below these levels. Table 4.1 presents the application of the 80% rule to the TAAS results previously presented in Table 3.2 above. As can be seen, even though grade 10 pass rates for all three TAAS tests for Black and Hispanics have improved between 1994 and 1998, these pass rates still lag below 80% of the White pass rates. According to this standard of adverse impact, the TAAS grade 10 tests continue to show adverse impact on Black and Hispanic students. (Note 5)

Table 4.1
Eighty Percent Rule and TAAS Grade 10 Pass Rates: Percent Passing All Tests by Race 1994-1998 All Students Not in Special Education
(Does Not Include Year-Round Education Results)

  1994 1995 1996 1997 1998
White 67% 70% 74% 81% 85%
White*80% 53.6% 56.0% 59.2% 64.8% 68.0%%
Black 29% 32% 38% 48% 55%
Hispanic 35% 37% 44% 52% 59%
Source: Selected State AEIS Data: A Multi-Year History

          Statistical Significance of Differences in Pass Rates. As mentioned, comparisons of simple percentages passing have some weaknesses from a statistical point of view. For example, differences in pass rates, particularly if small numbers of examinees are involved, may result from random variation in the particular sample of candidates who take an examination in a particular year. To check against this possibility, a second kind of standard for evaluating discriminatory disparate impact is generally employed; namely, a test of the statistical significance of observed differences. A test of statistical significance is used to assess the probability that a particular outcome (such as differences in proportions passing a test) might have occurred simply by chance or random sampling.
          The obvious statistical significance test to apply in a case such as that of proportions of candidates passing the TAAS is the test of the difference in proportions of two populations. As explained in most statistics textbooks, such as Paul Hoel's Introduction to mathematical statistics (1971, pp. 134-137), if p1 and p2 refer to the proportions of successes in two samples, q1 and q2 refer to the proportions of failures in the two samples, and n1 and n2 refer to the sizes of the samples, the standard error of the difference in proportions is calculated as follows:

SEdiff = (p 1q1/n 1 + p2q 2/ n2) 1/2

          Using this formula we may calculate the standard error of the difference in proportions for each comparison we wish to make and then divide the standard error of the difference into the observed difference to calculate the number of standard errors equivalent to the observed difference. Table 4.2 shows the results of such calculations for the Spring 1998 TAAS results.

Table 4.2
Statistical Significance of Differences in 1998 Grade 10 Pass Rates

  TAAS Reading TAAS Math TAAS Writing
  No. Tested % Pass No. Tested % Pass No. Tested % Pass
Black 26790 81% 27434 61% 26717 84%
Hispanic 70666 79% 71747 67% 70481 82%
White 108887 95% 109595 88% 108935 96%
Source: TAAS Summary Report—Test Performance All Students Not In Special Ed. Grade 10—Exit Level Report Date April 98 Date of Testing: March 1998 (www.tea.state.tx.us/student.assessment/results/summary/sum98/gxen98.htm)
White-Black Differences
SE of difference 0.0025   0.0031   0.0023  
Obs'd Difference   14%   27%   12%
Obs'd Diff/SE   56.312   86.982   51.721
White-Hispanic Differences
SE of difference 0.0017   0.002   0.0016  
Obs'd Difference   16%   21%   14%
Obs'd Diff/SE   95.894   104.41   89.503

          As can be seen from Table 4.2, the differences in pass rates for both White-Black and White-Hispanic comparisons are easily statistically significant, with observed differences equivalent to some fifty to over 100 standard errors. (Other statistical tests on TAAS results also yield results of this magnitude; see Haney, 1998; 1999).
          Practical significance of observed differences What of the practical significance of the observed differences in the 1998 grade 10 TAAS pass rates? Later in this report, I discuss the apparent consequences of the TAAS for grade retention and dropping out of school, but for the moment let us simply examine the numbers of students involved in the differential pass rates.
          On the TAAS writing test in 1998, 96% of White students passed, 84% of Black students and 82% of Hispanic students. While these differences do not exceed the 80% rule (96%*0.80 = 76.8%), let us consider the numbers of students involved. Specifically we may consider the numbers of Black and Hispanic students who would have passed the 1998 grade 10 writing test had the passing rates for Black and Hispanic students been the same as that for White students. These numbers are approximately 3,200 Black students and 9,900 Hispanic students, for a total of about 13,000 (comparable calculations show that on the TAAS math for 1998, about 22,000 more Black and Hispanic students would have passed had their pass rates been the same as for White students). Do the differential results on the 1998 grade 10 TAAS writing test, on which approximately 13,000 more Black and Hispanic students failed than would have been the case had the Black and Hispanic pass rates been the same as that of White students, constitute practical adverse impact? Do the differential results on all of the 1998 grade 10 TAAS tests, on which close to 34,000 more Black and Hispanic students failed (10,700 Black and 23,200 Hispanic students) than would have been the case had the Black and Hispanic pass rates been the same as that for White students constitute practical adverse impact? The answer, especially when results are also suspect under both the 80% rule and tests of statistical significance, seems clear, at least to me. A test that leads to failure for tens of thousands more minority than non-minority students, had they had equivalent passing rates, surely has practical adverse impact. Hence, the validity and educational necessity of such a test deserve close scrutiny.
          Before turning to those issues, however, I should mention that in his opinion in the TAAS case on January 7, 2000, Judge Prado ruled that "Plaintiffs have made a prima facie showing of significant adverse impact" (p. 23, though it should be added that the opinion has a discussion of disparate impact in two places, pp.15-17 and 20-23)

4.2 TAAS Use in Isolation Violates Professional Standards

          The use of TAAS scores in isolation to control award of high school diplomas (or for that matter use of any test results alone to make high stakes decisions about individuals or institutions) is contrary both to professional standards regarding testing and to sound professional practice.
          The standards to which I refer are the Standards for Educational and Psychological Testing published by the American Educational Research Association (AERA), the American Psychological Association (APA) and the National Council on Measurement in Education (NCME). These standards have been in existence for nearly 50 years (in current and previous editions; AERA, APA & NCME, 1985; 1999), and have been relied upon in numerous legal proceedings concerning testing in state and federal courts. (Note 6) One specific provision of these standards reads as follows:
Standard 13.7 In educational settings, a decision or characterization that will have a major impact on a student should not be made on the basis of a single test score. Other relevant information should be taken into account if it will enhance the overall validity of the decision.
          . . . It is important that in addition to test scores, other relevant information (e.g., school record, classroom observation, parent report) is taken into account by the professionals responsible for making the decision.
(AERA, APA & NCME, 1999, pp. 146-47) (Note 7)
          It seems clear that the practice in Texas of controlling award of high school diplomas on the basis of TAAS test scores in isolation without weighing other relevant information such as students' grades in high school (HSGPA) is contrary to this provision of the 1999 Standards for Educational and Psychological Testing (and the corresponding provision of the 1985 Standards).
          Witnesses for the state of Texas during the TAAS trial (Susan Phillips and William Mehrens) disputed my interpretation of this standard. Here is how Judge Prado summarized and resolved the dispute in his decision:
There was little dispute at trial over whether this standard exists and applies to the TAAS exit-level examination. What was disputed was whether the TAAS test is actually the sole criterion for graduation. As the TEA points out, in addition to passing the TAAS test, Texas students must also pass each required course by 70 percent. See Texas Admin. Code § 74.26(c). Graduation in Texas, in fact, hinges on three separate and independent criteria: the two objective criteria of attendance and success on the TAAS examination, and the arguably objective/subjective criterion of course success. However, as the Plaintiffs note, these factors are not weighed with and against each other; rather, failure to meet any single criterion results in failure to graduate. Thus, the failure to pass the exit-level exam does serve as a bar to graduation, and the exam is properly called a "high-stakes" test.
          On the other hand, students are given at least eight opportunities to pass the examination prior to their scheduled graduation date. In this regard, a single TAAS score does not serve as the sole criterion for graduation. The TEA presented persuasive evidence that the number of testing opportunities severely limits the possibility of "false negative" results and actually increases the possibility of "false positives," a fact that arguably advantages all students whose scores hover near the borderline between passing and failing. (Prado 2000, pp. 14-15)
          Nonetheless, I believe that my interpretation of this standard is more in keeping with preponderance of professional opinion than are the narrow interpretations offered by the witnesses for the state of Texas. This may be illustrated by reference to the 1999 report from the Board on Testing and Assessment of the Commission on Behavioral and Social Sciences of the National Research Council.
          As a result of increasing controversy over high stakes testing, the U.S. Congress passed legislation in 1997 requesting that the National Academy of Sciences undertake a study and make recommendations regarding the appropriate use of tests for student grade promotion, tracking and graduation (Heubert & Hauser, 1999, p. 1). The resulting report High Stakes: Testing for Tracking, Promotion, and Graduation specifically cites Standard 8.12 of the 1985 joint standards and clearly points out that a compensatory or sliding scale approach to using test scores in combination with grades would be "more compatible with current professional standards" than using an absolute cut-off score on a test to control high school graduation (Heubert & Hauser, 1999, pp. 165-66). More generally, this National Research Council report recommends:
High stakes decisions such as tracking, promotion, and graduation should not automatically be made on the basis of a single test score but should be buttressed by other relevant information about students' knowledge and skills such as grades, teacher recommendations and extenuating circumstances. (Heubert & Hauser, 1999, p. 279) (Note 8)
          Ironically enough, reliance on TAAS scores in isolation to control award of high school diplomas in Texas is even contrary to the following passage from the TEA's own Texas Student Assessment Program Technical Digest:
All test result uses regarding individual students or groups should incorporate as much data as possible. . . . Student test scores should also be used in conjunction with other performance indicators to assist in making placement decisions, such as whether a student should take a reading improvement course, be placed in a gifted and talented program or exit a bilingual program. (pp. 2-3)
          In sum, the state of Texas's use of TAAS scores in isolation, without regard to students' high school grades, to control award of high school diplomas, is contrary not only to both professional standards regarding test use and the advice of the recent NRC report, but also to the TEA's own advice on the need to use test results in conjunction with other performance indicators.

4.3 Passing scores on TAAS Arbitrary and Discriminatory

          The problem of using TAAS scores in isolation to control award of high school diplomas is exacerbated by the fact that the passing scores set for TAAS are arbitrary and discriminatory. This is important because when a pass or cut score is set on a test, the validity of the test depends not just on test content, administration and scoring, but also on the manner in which the passing score is set.
          The 1999 Standards for Educational and Psychological Testing state:
Standard 4.19 When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be clearly documented. (AERA, APA & NCME, 1999, p. 59)
          Also, standard 2.14 says that "Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score (AERA, APA & NCME, 1999, p. 35) . (Note 9)
          Considerable technical and professional literature has been published on alternative methods for setting passing scores on tests. Glass (1978) wrote an early critique of methods of setting passing scores that questioned the very advisability of even attempting to make this use of tests. In 1986, Ronald Berk published "A consumer's guide to setting performance standards on criterion-referenced tests" (Review of Educational Research, 56:1, 137-172) in which he reviewed 38 different methods for setting standards (or pass or cut-scores) on standardized tests. (Note 10)
          I sought to learn exactly how the passing scores were set on the TAAS in 1990 and to obtain copies of any data that were used in the process of setting passing scores on the TAAS exit test. The most complete account of the process by which the passing scores were set is provided in Appendix 9 of the Texas Student Assessment Program Technical Digest for the Academic Year 1996-1997, (TEA, 1997, pp. 337-354). Specifically contained in this appendix are 1) a memo dated July 14, 1990, from Texas Education Commissioner Kirby to members of the state Board of Education (including a summary of results from a field test of the TAAS) and 2) Minutes of the State Board of Education meeting in July 1990 at which the passing scores on the grade 10 TAAS were established.
          In his memo, Commissioner Kirby recommended a passing score of 70% correct for the exit level of TAAS, but also recommended that this standard be phased in over a period of three years, with the passing score of 60% proposed for Fall 1990. After considerable discussion, the State Board voted unanimously to adopt the recommendations of the commissioner regarding the Texas Assessment of Academic Skills, specifically that: "For the Academic Skills Level, a minimum standard of 70% of the test items must be answered correctly."
          Following a statement by a Dr. Crawford about the importance of giving "notice regarding the standard required for graduation from high school . . . to those students who will be taking the exit level test" (p. 6/353), the Board also voted 11 to 3 in favor of an amendment to the original proposal to "give notice that the 1991-92 standard will be 70" (p. 7/354).
          What struck me about this record of how the passing score on the TAAS exit test was set are the following:
  1. The process was not based on any of the professionally recognized methods for setting passing standards on tests;
  2. It appears to have failed completely to take the standard error of measurement into account; and,
  3. As I explain below, the process yielded a passing score that effectively maximized the adverse impact of the TAAS exit test on Black and Hispanic students.
          Before I elaborate on the latter point, let me emphasize that from the available record I have done my utmost to understand the rationale that motivated the Board to set the passing score where it did, namely at 70% correct. As best I can tell from the record, the main reasons for setting the passing score at 70% correct appear to have been that this is where the passing score had been set on TEAMS and this level was suggested by the Texas Education Code. The minutes of the Board meeting report that "the Commissioner cited the portion of the Texas Education code that requires 70 percent as passing (Attachment A), explaining that there is a rationale for aiming at 70 percent of test items as the mastery standard" (p. 1/348).
          In my view this is simply not a reasonable or professionally sound basis for setting a passing standard on an important test such as the TAAS exit test. Indeed from the available record it is not even clear that the Texas code cited by the Commissioner was actually referring to anything more than the passing standard for course grades. Moreover, the minutes to the July 12, 1990, meeting also report the following remarks by Dr. Crawford: "Testing is driving a curricular program, which means that the curriculum is not at the place where you want it to be when you start out." She commented that "70 only has whatever value that is given to it, and in testing 70 is not the automatic passing standard on every test" (p. 4/351).
          In sum, the process used in setting the passing scores on the TAAS exit test in 1990 did not adhere to prevailing professional standards regarding the setting of passing scores on standardized tests. For example, from the record available, it is clear that the process used to set the passing score on the TAAS exit test in 1990 failed to meet all six criteria of "technical adequacy" described in Berk's (1986) review of criteria for setting performance standards on criterion-referenced tests—a review published in a prominent education research journal, and of which TEA officials surely should have been aware in 1990.
          TAAS cut score study. To understand more fully the process by which the TAAS passing scores were set in 1990, I requested a copy of the TAAS field test data that were presented to the Board of Education in the meeting at which it set the passing score on the TAAS-X. Using these data, I undertook a study (with the assistance of Boston College doctoral student Cathy Horn) which came to be called our "TAAS cut score study." In this study, we asked individuals, reviewing the data available to the Texas Board of Education in July 1990 to select the passing scores (or cut scores) students would be need to attain in order to pass the TAAS reading and math tests. For both the reading and math tests, each research subject was presented with a graph showing the percentage of students, separately for White, Hispanic and Black ethnic groups, passing each number of percent correct answers on the field or pilot test of the TAAS exit test in 1990. These graphs are shown in Figures 4.1 and 4.2 below.

          Each person in the cut score study was then presented with the following instructions:
The following graph presents the percentage of students passing the reading / math section of the Texas Assessment of Academic Skills (TAAS) at each number of questions answered correctly. Choose the number of questions correct that most clearly differentiates White students (represented by a black line) from Black and Hispanic students.
          Respondents could then ask clarifying questions before selecting a response. After a pilot test of the cut score study in 1998, Ms. Horn (a native of Texas and secondary school teacher there before she came to Boston College for graduate studies) extended the cut score study to nine Texans. The exercise was administered, by phone or in person, to 9 individuals residing in the state of Texas. (Those individuals who were interviewed by phone had paper copies of the Figure 4.1 and 4.2 graphs and the prompt for the exercise in front of them when they selected cut points.) The professions of the nine respondents are listed below.
Respondents (all currently living in Texas):
2 teachers
3 engineers
2 college students
1 financial analyst
1 director of communications
          The cut or passing scores selected by these nine individuals as most clearly differentiating between White students and Black/Hispanic students are shown in Table 4.3 below.

Table 4.3
Results of Cut Score Study with Nine Texans

 
Reading
Math
Person 1
34
34
Person 2
35
37
Person 3
35
38
Person 4
34
37
Person 5
36
40
Person 6
33
40
Person 7
34
37
Person 8
36
43
Person 9
44
44
Summary
Minimum
33
34
Maximum
44
44
Mean
35.7
38.9
Median
35
38

          As shown, respondents selected passing scores ranging from 33 to 44 on the reading test and from 34 to 44 for the math test. The median value across all nine respondents was 35 for the reading test and 38 for the math test.
          The passing scores of 70% correct for the TAAS exit test recommended by Commissioner Kirby and accepted by the Board of Education in July 1990 were 34 for the reading test and 42 for the math test. The results of our cut score study show that if the intent in setting passing scores based on the TAAS field test results in July 1990 had been discriminatory, i.e., to set the passing scores so that they would most clearly differentiate between White students and Black/Hispanic students, then the passing scores would have been set just about where the Board of Education did in fact set them.
          At the same time, there is no evidence of which I know, in the record of the process of setting passing scores on the TAAS in 1990, that the explicit intent of either Commissioner Kirby or the Board was discriminatory. However, the available record shows no indication that Commissioner Kirby, the TEA or the Board relied on any professionally recognized method for setting passing scores on the test, and the passing scores set were indeed consistent with those that would have been set, based on the TAAS field test results, if the intent had been discriminatory.
          Use of measurement error in setting passing scores. The reason the setting of passing scores on a high stakes test such as the TAAS is so important is that the passing score divides a continuum of scores into just two categories, pass and fail. Doing so is hazardous because all standardized test scores contain some degree of measurement error. Hence, the 1985 Standards for Educational and Psychological Testing and other professional literature clearly indicate the importance of considering measurement error and consequent classification errors in the process of setting passing scores on tests.
          Before discussing this topic further, two introductory explanations may be helpful. First, from the available record of the July 1990 meeting of the Board of Education, there is no indication that consideration of measurement error entered into the Board's deliberations. Second, the issue of measurement and classification errors regarding TAAS was addressed, as far as I know at least in the 1993-94 and 1996-97 editions of Texas Student Assessment Program Technical Digest. Unfortunately there are two serious errors in the manner in which these issues are addressed. Before explaining the nature of these errors, let me first summarize what the 1996-97 edition of Texas Student Assessment Program Technical Digest says about test reliability, standard error of measurement and classification errors.
          Chapter 8 of the 1996-97 Technical Digest, entitled "reliability" provides a brief discussion of internal consistency estimates and formulas for calculating internal consistency reliability estimates (p.41). This is followed (p. 42) by a discussion of (and formulas for) calculating standard errors of measurement from reliability estimates. These discussions provide references to appendix 7 which shows data to indicate that for the Spring 1997 administration of TAAS at grade 10 (administered to 214,000 students) the internal consistency estimates for the TAAS math, reading and writing sub-tests were 0.934, 0.878 and 0.838, respectively; and the corresponding standard errors of measurement were 2.876, 2.352 and 2.195.
          This represents the first serious error in the technical report's handling of measurement and classification error. Specifically, while the technical report bases the calculation of standard error of measurement on internal consistency reliability estimates, it clearly should have been based on test-retest or alternate-forms reliability estimates. Test-retest reliability refers to the consistency of scores on two administrations of a test. Alternate-forms reliability refers to the consistency of scores on two different forms or versions of the same test. Since the purpose of TAAS testing is not simply to estimate students' performance on one version of the TAAS test, but to estimate their competence in reading, math and writing, in general, as might be measured by any version of the relevant TAAS tests, alternate-forms reliability is more appropriate for assessing reliability than is internal consistency reliability. As Thorndike and Hagen (1977, p. 79) point out in their textbook on measurement and evaluation, "evidence based on equivalent test forms should usually be given the most weight in evaluating the reliability of a test."
          In general, alternate forms test reliability tends to be lower than internal consistency reliability. Hence, it seems clear to me that the figures reported in the 1996-97 Technical Digest overestimate the relevant reliability of grade 10 TAAS test scores and underestimate the standard error of measurement associated with TAAS scores.
          I have attempted to estimate the alternate-forms reliability of TAAS test scores using two independent sources of data. First I employed the cross-tabulations reported by Linton & Debeic (1992) of test-retest data on students in several large Texas districts who took the TAAS exit level test in October 1990 and again in April 1991. Using the Linton & Debeic cross tabular results, I calculated the following test-retest correlations: TAAS-Reading 0.536; TAAS-Math 0.643; and TAAS-Writing 0.555. Second, as part of the background work for the TAAS case, Mark Fassold developed a remarkable longitudinal database of all 1995 sophomore students in Texas and their TAAS scores on up to ten different administrations of TAAS:

     1     March 1995
     2     May 1995
     3     July 1995
     4     October 1995
     5     March 1996
     6     May 1996
     7     July 1996
     8     October 1996
     9     February 1996
    10    April 1996
          At my request Mr. Fassold ran an analysis of all test-retest correlations on this cohort of students (total N of about 230,000). Correlations were calculated separately by ethnic group and for TAAS Reading and Math tests. Given 16 different test-retest possibilities this yielded 214 different coefficients (2 x 16 x 6 ethnic groups). Results varied widely (in part because in some comparisons sample sizes were very small). Overall, however, the observed test-retest correlations tended to cluster in the 0.30 to 0.50 range.
          These test-retest correlations based on both the Linton-Debeic and Fassold data are, however, attenuated in that in both data sets only students who failed a TAAS test took it again. There are methods for correcting observed test-retest correlations for such attenuation (see Haney, Fowler and Wheelock, 1999, for an example), but as a more conservative approach here, let me simply discuss what previously published literature suggests about the relationships between test-retest and internal consistency reliability.
          As mentioned above, the 1996-97 Technical Digest cites internal consistency reliability estimates for the three grade 10 TAAS sub-tests of 0.934, 0.878 and 0.838, and standard errors of measurement of 2.876, 2.352 and 2.195. It is common for tests which show internal consistency reliability of about 0.90 to show alternate forms reliability of 0.85 or 0.80 (see for example, Thorndike & Hagen, 1977, p. 92). On page 42 of the 1996-97 Technical Digest, the example is shown in which a test with an internal consistency reliability of 0.90 (and a standard deviation of 6.3) is estimated to have a standard error of measurement of 2.0. However, if instead of an internal consistency reliability of 0.90, we were to use in these calculations an alternate forms reliability of 0.85 or 0.80, the resulting standard errors of measurement would be 2.44 and 2.82. This suggests that the appropriate standard errors of measurement for the TAAS tests may be on the order of 20 to 40% greater than the estimates reported in the TAAS 1996-97 Technical Digest.
          The second serious error in the technical report's handling of measurement and classification error occurs on pages 30 and 31 in a section labeled " Exit level testing standards and the standard error of measurement." Here the authors of the 1996-97 Technical Digest point out that a student with a "true achievement level at the passing standard would be likely to pass on the first attempt only 50% of the time" (p. 31). This passage then goes on to assert that "if such a student has attempted that test eight times, the student's passing is almost assured (probability of passing is 99.6%)" (p. 31). In other words, the chances of a minimally qualified student failing the TAAS eight times and being misclassified as not having the requisite skills is only 0.4% (0.50 to the 8th power is 0.0039).
          This calculation strikes me as erroneous, or at least potentially badly misleading, because the authors have presented absolutely no evidence to show the probability that a student who fails the TAAS will continue to take the test seven more times. As I explain later, available evidence suggests that students who fail the TAAS grade 10 test more than once or twice are likely to be held back in grade and to drop out of school long before they reach grade 12 by which time they would have had a chance to take the TAAS exit test eight times. Since 0.50 to the second power is 0.25; and to the third power is 0.125, this indicates that a student with a "true achievement level at the passing standard" who takes the TAAS twice or three times, before becoming discouraged and not taking the test again, has a 25% or 12.5% chance of being misclassified as failing.
          Before proceeding to present evidence bearing on this point, let me discuss how the standard error of measurement might usefully have been taken into account in adjusting passing scores. Because of the error of measurement in test scores, when scores are used to make pass-fail decisions about students, two kinds of classification errors can occur. A truly unqualified student can pass the test (a false pass) or a truly qualified student can fail the test (a false failure). How one thinks about the balance of these two misclassification errors depends on the risks (or benefits and costs) associated with each type of misclassification. If one were confident that a student failing TAAS would receive special attention and support educationally, one might be inclined to weigh false passes as more serious than false failures. If on the other hand, one thought that students failing TAAS were unlikely to receive effective instruction, and instead merely to be retained in grade 10 and to be stigmatized as failures, then one would probably feel that false failures would be more harmful than false passes.
          Here is how Berk (1986) discussed this point:
Assessing the relative seriousness of these consequences, is a judgmental process. It is possible to assign plusses (benefits) and minuses (costs or losses) to the consequences so that the cutoff scores can be set in favor of a specific error reduction rate. A loss ratio (benefits: losses) can be specified for each decision application with the cutoff score adjusted accordingly. (Berk, 1986, p. 139).
          To study the relative risks associated with the two kinds of classification errors associated with a high school graduation test, with the assistance of Kelly Shasby, (a doctoral sudent in the Educational Research, Measurement and Evaluation program at Boston College), I undertook what came to be known as our "risk analysis" study.
          The survey form used in the risk analysis study was entitled "Survey of risk associated with classification decisions" and opened with the following introduction:
When classifying large numbers of individuals using standardized exams, two different kinds of mistakes are made. Some people will be falsely classified as "qualified" or "passing" while others will be falsely classified as "unqualified" or "failing." There is a degree of risk associated with mistakes of this kind, both for the individual who is incorrectly classified and for the society in which that individual lives. We would like your help in assessing the severity of the risk, or possible harm, caused to individuals and to society when mistakes are made on a number of different types of standardized tests.
          The purpose of this survey is to assess the public's perception of misclassifications of individuals. These misclassifications can have an impact on the individual and on the society in which that individual lives. This impact has the potential to be harmful, and we are interested in determining how harmful the public thinks different misclassifications can be.
          On a scale from 1 to 10, 1 being "minimum harm" and 10 being "maximum harm," rate each scenario with respect to the degree of harm it would cause that individual and then the degree of harm it would cause society. Then circle the number, which corresponds, to the rating you chose.
          After this introduction, respondents were asked to rate the risk on a 1 to 10 scale of harm associated with 16 different misclassifications that might results from classifying people pass-fail based on standardized test results. Respondents were asked to rate separately the harm to individuals and to society—and to give credit where it is due, this distinction, a clear improvement over the initial version of our survey, was suggested by Ms. Shasby. Specifically, survey respondents were asked to rate the degree of harm, separately for individuals and society, associated with the following kinds of misclassification:
  1. A kindergartner who is ready to enter school is denied entrance.
  2. A kindergartner who is not ready to enter school is granted entrance.
  3. An airline pilot who is not qualified is given a license to fly.
  4. An airline pilot who is qualified is denied a license to fly.
  5. A qualified high school student is denied a diploma.
  6. An unqualified high school student is granted a diploma.
  7. A qualified accountant is denied certification.
  8. An unqualified accountant is granted certification.
  9. A qualified student is denied promotion from grade eight to grade nine.
  10. An unqualified student is granted promotion from grade eight to grade nine.
  11. A qualified doctor is denied a license to practice.
  12. An unqualified doctor is granted a license to practice.
  13. A qualified candidate is denied admission into college.
  14. An unqualified candidate is granted admission into college.
  15. A qualified teacher is denied certification.
  16. An unqualified teacher is granted certification.
          The risk survey form was sent to a random sample of 500 secondary teachers in Texas (specifically only math and English/Language Arts teachers) on May 23, 1999. As of June 30, 1999, we had received 66 responses (representing a response rate of 13.2%). (Note 11)
          Table 4.4 below summarizes the results of the risk analysis survey.

Table 4.4
Results of Risk Analysis Survey with Secondary Teachers in Texas

 
For
individual
For
society
  Mean SD Mean SD
1. A kindergartner who is ready to enter school is denied entrance. 6.45 2.67 3.94 2.64
2. A kindergartner who is not ready to enter school is granted entrance. 7.20 2.23 5.06 2.71
3. An airline pilot who is not qualified is given a license to fly. 8.36 2.32 9.55 1.00
4. An airline pilot who is qualified is denied a license to fly. 7.74 2.37 4.39 2.99
5. A qualified high school student is denied a diploma. 9.11 1.69 6.39 2.58
6. An unqualified high school student is granted a diploma. 6.85 2.72 7.74 2.26
7. A qualified accountant is denied certification. 8.65 1.50 5.32 2.62
8. An unqualified accountant is granted certification. 8.65 1.50 5.32 2.62
9. A qualified student is denied promotion from grade eight to grade nine. 8.89 1.52 6.15 2.39
10. An unqualified student is granted promotion from grade eight to grade nine. 8.15 2.01 7.80 2.12
11. A qualified doctor is denied a license to practice. 8.80 1.68 7.32 2.64
12. An unqualified doctor is granted a license to practice. 7.15 2.87 9.37 1.72
13. A qualified candidate is denied admission into college. 8.83 1.73 6.30 2.43
14. An unqualified candidate is granted admission into college. 6.08 2.66 6.08 2.66
15. A qualified teacher is denied certification. 8.64 1.76 8.38 2.13
16. An unqualified teacher is granted certification. 6.62 2.84 9.15 1.60

          As this table shows, the risk associated with denying a high school diploma to a qualified student is for individuals the most severe risk associated with any of the misclassification scenarios we asked respondents to rate. The only scenarios showing higher average risks are the risks for society associated with licensing an unqualified pilot (mean = 9.55), licensing an unqualified doctor (9.37) and licensing an unqualified teacher (9.15).
          Particularly germane to our discussion of the setting of passing scores on the TAAS graduation test are the relative risks associated with denying a diploma to a qualified high school student (mean = 9.11) and granting a diploma to an unqualified student (6.85). These results indicate that the risk of denying a diploma to a qualified student is much more severe than granting a diploma to an unqualified student (the difference, by the way, is statistically significant).
          These results indicate that if a rational passing score had been established on the TAAS exit test, the passing or cutoff scores should be adjusted downward in order to minimize overall risk. A common practice in setting passing scores on important tests is to reduce an empirically established passing score by one or two standard errors of measurement. While I want to stress that the passing scores of 70% correct on the TAAS are arbitrary, unjustified and discriminatory, we can see from Figures 4.1 and 4.2 what the consequences would be for Black and Hispanic pass rates (on the TAAS field test) if the passing scores of 70% had been corrected for error of measurement. Recall that the passing scores set by the Board on the field test administration of the TAAS were 34 items correct on the reading test and 42 on the math test. Recall also that the standard errors for the reading and math tests reported in the Technical Digest were in the range of 2.5 to 3.0 raw score points. Suppose that to take error of measurement into account, the initially selected passing scores of 34 and 42 were lowered 5 points, to 29 and 37 on the reading and math tests, respectively. What can be easily seen from Figures 4.1 and 4.2 is that these adjustments would have increased the passing rates for Black and Hispanic students about 12% on the math test and 20% on the reading test.
          The foregoing results were presented in a written report before the TAAS trial (Haney, 1999) and also discussed during testimony at trial. Judge Prado (2000) apparently did not find these points persuasive for he commented merely that in setting the passing score on the TAAS tests, "the State Board of Education looked at the passing standard for the TEAMS test, which was also 70 percent, and also considered input from educator committees" (p. 11). Regarding the disparate impact of the passing score, he commented simply, "The TEA understood the consequences of setting the cut score at 70 percent" (p. 11).

4.4 Doubtful Validity of TAAS Scores

          The Technical Digest on TAAS (TEA, 1997) contains an extremely short section (pp. 45-47) discussing test validity. Though this three-page passage mentions content, construct and criterion-related validity, it maintains that "the primary evidence for the validity of the TAAS and end-of-course tests lies in the content validity of the test" (TEA, 1997, p. 47). This discussion, it seems to me is woefully inadequate because test validation should never rest primarily on test content. Test validation refers to the interpretation and meaning of test scores and these depend not just on test content, but also on a host of other factors, such as the conditions under which tests are administered, and how results are scored and interpreted (e.g., in terms of a passing score, as discussed in the previous section).
          Nonetheless, the TEA has previously undertaken a number of studies examining the relationship between TAAS scores and course grades. In one study, for example, it was reported that in one large urban district, 50% of the students who had received a grade of B in their math courses failed the TAAS math test (TEA, 1996 Comprehensive Report on Texas Public Schools, pp. 14-15). Another summary finding was that when "TEA correlated exit level students' TAAS mathematics scores with the same students' course grades for several different mathematics courses in the 1992-93 school year . . . the correlation between TAAS scale scores and students' end-of-year grades was only moderately positive (0.32). . . " (TEA, 1997, Technical Digest, p. 47). Inasmuch as this correlation is remarkably low in light of previous research that has generally shown test scores to correlate with high school grades in the range of 0.45 to 0.60 (see Haney, 1993, p. 58), as part of work on the TAAS case I sought to acquire the actual data set on which this TEA finding was based.
          The data set in question contains records for 3,281 students in three districts that TEA documentation describes as "large urban district," "mid-sized suburban district," and "small rural district." The TEA has previously reported analyses of these data in "Section V: A study of correlation of course grades with Exit Level TAAS Reading and Writing Tests" pp. 189-197 in Student Performance Results 1994-95, Texas Student Assessment Program, TAAS and End-of-Course Examinations and Other Studies (Texas Education Agency, Austin, Texas, ND, but presumably 1995).
          After opening the file and verifying its structure, I sought to confirm that the results reported by the TEA could be replicated. This was impossible to do precisely because TEA did not report results with great precision. Nonetheless, initial results corresponded reasonably well with what TEA reported. Also, it should be noted that while the data file included records on a number of grade 11 students, I restricted most analyses to grade 10 students pooled across the three districts, though the bulk of this sample (> 2,400 cases out of 3,300) comes from the one large urban district. Then we calculated basic descriptive statistics on variables of interest, in particular scores for the TAAS reading and writing test administered in March 1995 and grades for the English II courses completed in May 1995 (these data were provided by the districts to the Student Assessment Division of TEA.) Next we calculated relationships between variables. Table 4.5 shows the intercorrelations between the three TAAS test scores (writing, reading and math) and English II course grades. Given the size of this sample (>3,000) all of these correlation coefficients are statistically significant at the 0.01 level.

Table 4.5
Correlations between
TAAS Scores (Standard scores) and English II Grades

 
Write SS
Read SS
Math SS
Grade
Write SS
1.00
     
Read SS
0.50
1.00
   
Math SS
0.51
0.69
1.00
 
Grade
0.32
0.34
0.37
1.00

          Note the magnitudes of the correlations between English II course grades and TAAS scores. They are all in the range of 0.32 to 0.37. As indicated above, previous studies have generally shown test scores to correlate with high school grades in the range of 0.45 to 0.60. Contrary to expectations, English II grades correlate more highly with TAAS math scores (0.37) than with writing (0.32) or reading (0.34) scores. Note also the odd intercorrelations among TAAS scores. The TAAS math scores correlate at the level of 0.69 with the TAAS reading scores, while the TAAS reading scores correlate at the level of 0.50 with the TAAS writing scores. This is contrary to the expectation that scores of two verbal measures (of reading and writing) should correlate more highly with one another than with a measure of quantitative skills. These results cast doubt on the validity and the reliability of TAAS scores.
          People unfamiliar with social science research doubtless find it hard to make sense of correlation coefficients in the range of 0.32 to 0.37. Hence to provide a visual representation, Figure 4.3 shows a scatterplot of the relationship between TAAS reading scores and English II grades. As can be seen from this figure, the relationship between these two variables is a quite weak. Students with grades in the 70 to 100 range have TAAS reading scores from well below 40 to well over 80. Conversely, students with TAAS reading scores in the 80 to 100 range have English II grades from well below 40 to well over 80.

Figure 4.3 Scatterplot of TAAS Reading Scores and English II Grades


          I next examined whether there were differences in the relationships between TAAS scores and English II grades across ethnic groups. Table 4.6 provides an example of the relationship between passing and failing TAAS and passing or failing in terms of English II course grades for Hispanics, Blacks and Whites. As can be seen from this table, of those students who passed their English II courses in the spring of 1995, 27-29% of Black and Hispanic students failed the TAAS reading test taken the same semester as their English courses compared with 10% of White students. In other words, of grade 10 students in these three districts who are passing their English II courses, the rate of failure on the TAAS reading test for Black and Hispanic students is close to triple that of White students. A similar, but slightly smaller, disparity is apparent on the TAAS writing sub-test.

Table 4.6
Rates of Passing and Failing TAAS and English II Course

 
TAAS-Exit Test Results
 
Black students
Hispanic students
White students
 
Reading
Reading
Reading
English II Course Failed Passed Failed Passed Failed Passed
Failed N 39 23 242 189 17 34
(%) 10.1% 5.9% 11.0% 8.6% 3.1% 6.3%
Passed N 111 214 596 1181 55 436
(%) 28.7% 55.3% 27.0% 53.5% 10.1% 80.4%
 
Writing
Writing
Writing
English II Course Failed Passed Failed Passed Failed Passed
Failed N 33 29 173 258 20 31
(%) 8.5% 7.5% 7.8% 11.7% 3.7% 5.7%
Passed N 69 256 366 1411 50 441
(%) 17.8% 66.1% 16.6% 63.9% 9.2% 81.4%

          Such a disparity can result from several causes. First, if the TAAS reading test is in fact a valid and unbiased test of reading skills, the fact that close to 30% of Black and Hispanic students who are passing their sophomore English courses failed the TAAS reading test, as compared with only 10% of White students must indicate that minority students in these three districts are simply not receiving the same quality of education as their White counterparts—especially when one realizes, as I will show in Part 5 of this article that by 1995 Black and Hispanic students in Texas statewide were being retained in grade 9 at much higher rates than White students. The only other explanation for the sharp disparity is that the TAAS tests and the manner in which they are being used (with a passing score of 70% correct) are simply less valid and fair measures of what Black and Hispanic students have had an opportunity to learn, as compared with White students.
          These analyses were reported in the July 1999 report (Haney, 1999) and discussed in direct testimony and cross-examination during the TAAS trial in September 1999. Here is how Judge Prado interpreted these findings in his January 7 ruling:

The Plaintiffs provided evidence that, in many cases, success or failure in relevant subject-matter classes does not predict success or failure in that same area on the TAAS test. See Supplemental Report of Dr. Walter Haney, Plaintiff's expert, at 29-32. In other words, a student may perform reasonably well in a ninth-grade English class, for example, and still fail the English portion of the exit-level TAAS exam. The evidence suggests that the disparities are sharper for ethnic minorities. Id. at 33. However, the TEA has argued that a student's classroom grade cannot be equated to TAAS performance, as grades can measure a variety of factors, ranging from effort and improvement to objective mastery. The TAAS test is a solely objective measurement of mastery. The Court finds that, based on the evidence presented at trial, the test accomplishes what it sets out to accomplish, which is to provide an objective assessment of whether students have mastered a discrete set of skills and knowledge. (Prado, 2000, p. 24)
          With due respect to Judge Prado, I believe there are two flaws in this reasoning. First, Judge Prado interprets the disparities in the rates at which, among students who pass their English II courses, minorities fail the "English portion" of TAAS far more frequently than White students, as evidence of the need for "objective assessment" of student skills. Though he did not explicitly say so, his reasoning seems to be that an objective test is necessary because the grades of minority students are inflated. This interpretation, however, takes one specific finding out of the context in which I presented it, both in the Supplementary report (Haney, 1999, pp. 29-33) and in testimony at trial. In both cases, and as described above, it was shown that even if one ignores the question of possibly inflated grades, the intercorrelations among TAAS scores themselves (i.e., that reading and math scores correlate more highly than reading and writing scores) raise serious doubts about their validity.
          Second, even if we assume the validity of TAAS tests and accept Judge Prado's reasoning that the lack of correspondence between English grades and TAAS reading and writing scores demonstrates the need for objective assessment of student mastery, the fact that "the disparities are sharper for ethnic minorities," represents prima facie evidence of inequality in opportunity to learn. Even if Black and Hispanic students' teachers are covering the same academic content as White students' teachers, that 27-29% of Black and Hispanic students who passed their English II course failed the TAAS reading test (as compared with 10% of White students) obviously must indicate that their teachers are not holding them to the same academic standards as the teachers of White students.

4.5 More appropriate use possible

          This discussion leads naturally to a simple solution for avoiding reliance on test scores in isolation to make high stakes decisions about students. As previously mentioned, the recent High Stakes report of the National Research Council (Heubert & Hauser, 1999) states clearly that using a sliding scale or compensatory model combining test scores and grades would be "more compatible with current professional testing standards" than relying on a single arbitrary passing score on a test (Heubert & Hauser, 1999, pp. 165-66). Moreover this is exactly how test scores are typically used in informing college admissions decisions, such that students with higher high school grade point averages (GPA) need lower test scores to be eligible for admission, and conversely students with lower GPA need higher test scores. Ironically enough this is indeed exactly how institutions of higher education in Texas use admissions test scores in combination with GPA. For example, in 1998, the University of Houston required that in order to be eligible for admissions, high school students who had a grade point average of 3.15 or better needed to have SATI total scores of at least 820, but if their high school GPA was only 2.50, they needed to have SATI total scores of 1080 (University of Houston, 1998).
          Literally decades of research on the validity of college admissions test scores show that such an approach, using test scores and grades in sliding scale combination produces more valid results than relying on either GPA or admissions test scores alone (Linn, 1982; Willingham, Lewis, Morgan & Ramist, 1990). Moreover, such a sliding scale approach generally has been shown to have less disparate impact on ethnic minorities (and women) than relying on test scores alone (Haney, 1993).
          The tendency for a sliding scale approach to have smaller adverse impact on minorities can be illustrated with the data on TAAS scores and English II grades discussed in the last section. Texas now effectively uses a double-cut or conjunctive model of decision-making, whereby students currently must have a grade of 70 in their academic courses (such as English II) and a score of 70 on TAAS to graduate from high school. These requirements are illustrated in Figure 4.4 (which is the same as Figure 4.3 except that a vertical line has been added to represent the 70-grade requirement and a horizontal line has been added to represent the TAAS 70-score requirement.

Figure 4.4 Scatterplot of TAAS Reading Scores and English II Grades with 70 Minima Shown


          Note also that the data shown in Figure 4.4 are the same as those summarized in the top portion of Table 4.6. As indicated there, 80.4% of white students in this sample passed both the English II course and the TAAS reading test, while only 10.1% of White students passed English II and failed the TAAS reading test. In contrast, 53-55% of Black and Hispanic students passed both the course and the test, but 27-29% of Black and Hispanic students passed English II, but failed the TAAS test.
          Suppose now that instead of applying a double cut rule so that students have to have scores of 70 in both the course and the test to pass, they need to have a minimum of 140 combined. This circumstance is illustrated in Figure 4.5, below.

Figure 4.5 Scatterplot of TAAS Reading Scores and English II Grades with Sliding Scale Shown


          As can be seen, under such a sliding scale approach, higher grades can compensate for lower test scores and vice versa (that is why the sliding scale approach is sometimes called a compensatory model). Under this approach, the number of Black and Hispanic students passing would increase from 1,395 to 1,765—a 27% increase. Under a sliding scale approach, the number of White students passing would also increase slightly (from 436 to 487), but since the latter increase is smaller proportionately, the disparate impact on Black and Hispanic students would be reduced.
          The sliding scale decision rule illustrated here (TAAS-R + Eng II grade > 140) was chosen merely for illustrative purposes. As with college admissions tests, in practice such a sliding scale approach ought to be based on empirical validation studies. But the example illustrates the way in which an approach more in accord with professional standards would significantly reduce adverse impact. The literature on college admissions testing strongly suggests it would yield more valid decisions too.


0: Home   |   1: Intro.   |   2: History   |   3: The Myth   |   4: TAAS   |   5: Missing Students
6: Teachers   |   7: Other Evidence   |   8: Summary   |   Notes & Ref.   |   Appendix