The Revised SAT Score and Its Potential Benefits for the Admission of Minority Students to Higher Education

This paper investigates the predictive validity of the Revised SAT (R-SAT) score, proposed by Freedle (2003) as an alternative to compensate minority students for the potential harm caused by the relationship between item difficulty and ethnic DIF observed in the SAT. The R-SAT score is the score minority students would have received if only the hardest questions from the test had been considered and was computed using a formula score and a regression approach. In this article we examine the potential effects of using the R-SAT of minority students in the admissions decision to selective institutions, and its capacity to predict short and long-term academic outcomes as well as its potential benefits regarding differential prediction of college grades for minority students. To test this out, we examined the performance of the R-SAT score compared to the standard SAT score in a sample of graduates from California public schools and in a subsample of Education Policy Analysis Archives Vol. 23 No. 113 2 students who enrolled in the University of California. We found that, in terms of the potential for college admissions for minority students, prediction power and the issue of overprediction, the RSAT score did not perform significantly better than the SAT score.


Introduction 1
Admission examinations are often assessed by how well they predict college outcomes.Predictive validity studies analyze the degree of association between admissions test scores, like SAT scores, and college outcomes, such as college grades and graduation.These sorts of academic outcomes are relatively easy to collect and are also related to other important behaviors linked to success in college.Some studies have also addressed the ability of admission examination scores to predict nonacademic outcomes such as earnings, leadership, job satisfaction, satisfaction with life and civic participation (Allen, Robbins, & Sawyer, 2010;Bowen & Bok, 1998;Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004;Willingham, 1985).
In this study, we examine a measure of academic preparedness that has been proposed to complement the SAT.The "Revised-SAT" or R-SAT, was proposed by Roy Freedle (2003) with the goal of correcting, what he considered to be unfair results found through his application of the Standardization method for DIF, on the SAT results (Dorans & Holland, 1992;Dorans & Kulick, 1983, 1986).The R-SAT is based exclusively on a subset of the SAT questions-specifically, the more difficult items.There is a substantial body of literature on the validity of standardized test scores to predict college outcomes, however, we found no consensus among researchers about the predictive power of the SAT (Geiser & Santelices, 2007;Geiser & Studley, 2002;Ramist, Lewis, & McCamley-Jenkins, 1994;Zwick, 2002).The problem of differential prediction, or differential power of SAT scores for students from different ethnic groups on the prediction of college grades has also been extensively documented (Burton & Ramist, 2001;Geiser & Studley, 2002;Ramist et al., 1994;Zwick, Brown, & Sklar, 2004).In this context, the research presented in this article is important because it explores the potential benefits of considering the R-SAT for the admissions of minorities into higher education, especially into selective institutions, and compares it to the current use of the standard SAT score.If, on the one hand, Freedle´s findings and hypotheses about the R-SAT holds, the R-SAT would strengthen the validity of the use of test scores for admissions decisions of minority students and there would be room for debating about the most appropriate score to use for both White and minority students.On the other hand, a finding of little or no support for the R-SAT would weaken the arguments for the consideration of the R-SAT.

The Revised-SAT
Freedle observed a systematic relationship between item difficulty and differential item functioning in the SAT.This relationship is known as the "Freedle phenomenon": harder items were found to show DIF in favor of minority students while easier items tend to show DIF in favor of White students (Freedle, 2003).Differential item functioning (DIF) studies are used as the first step of fairness studies and refer to how items function after differences in score distributions between groups have been statistically removed.The remaining differences indicate that the items function differently for the two groups.Typically, the groups examined are derived from classifications such as gender, race, ethnicity, or socioeconomic status.The performance of the group of interest (focal group) on a given test item is compared to that of a reference or comparison group.White examinees are often used as the reference group, while minority students are often the focal groups (Holland & Wainer, 1993).DIF study´s results need to be complemented with the analysis of whether the source of difficulty difference is relevant or irrelevant to the test construct in order to judge the fairness for specific groups of students (Camilli, 2006).
Freedle proposed a new way to calculate the score to correct for the potential unfairness of the results caused by the systematic relationship between item difficulty and DIF observed in SAT items.The new score would capture how students perform on the hard half of the SAT test and is called the Revised-SAT or R-SAT (Freedle, 2003).The R-SAT would be provided to colleges as a complement to the SAT for minority students and would be the score that African American students would get if only hard questions were considered.Freedle described this score as a more valid assessment of African American`s knowledge.According to him, the R-SAT would increase the SAT verbal score by as much as 200 to 300 points for individual minority test-takers, it would reduce the mean score difference between White and minority test-takers by a third, and it would produce a score that is a better indicator of the academic ability of minority students.
Freedle, citing the work from Diaz-Guerrero and Szalay (1991), interprets the difference between a student's R-SAT and his/her regular SAT score as a measure of the degree to which the examinee's cultural background diverges from White, middle class culture.In his paper, Freedle recommends exploring the validity of the R-SAT index (a) by examining the correlation between the observed R-SAT index and college grades relative to the correlation between the observed SAT score and college grades and also, (b) by looking at how many admissions decisions would change if we use the R-SAT compered to using the SAT (where we would assume that, say, a score of over 600 indicates that a student qualifies for college).Freedle recognizes that such predictive validity analyses will necessarily be of limited interpretability because of the issue of restriction of range, as many of the students who would potentially be admitted by the R-SAT will be absent from the college grades data-nevertheless he considers it relevant to examine these predictions.
His work was criticized by these researchers on technical grounds: (i) for the way Freedle had implemented the Standardization Approach to DIF and (ii) for using a dataset that preceded the ETS implementation of the bias and DIF sensitivity review for all items in the SAT.Freedle implemented the standardization approach using a non-standard denominator that did not consider omits and not-reached items and ignoring the fact that the SAT is a formula scored test.The sensitivity review was formalized at ETS in 1980. 2he official response from the College Board (Camara & Sathy, 2004) to Freedles' 2003 paper stressed the role of guessing in the phenomenon that Freedle described.This report blamed the systematic issue on students of low ability simply guessing the correct response to harder questions.This is also Bridgeman and Burton's contention (2005), which they illustrated using adhoc examples and the results from computerized testing.In addition, Bridgeman and Burton commented on the R-SAT, questioning its validity and reliability as an indicator of students' knowledge.Wainer (2009) also appealed to guessing to explain the phenomenon described by Freedle.He claim that the two parts used in the standardization methodology (stratification on total SAT and drawing inferences from a division of items into two parts: easy and hard) are contradictory if you consider that students can answer a particular item correctly not only based on ability but also based on chance.Central to the argument is the assumption that, on average, White students have higher ability level than African American students and that both groups have the same probability of guessing correctly.Under those assumptions, the observed relationship between item difficulty and DIF is to be expected, he says, due to a "statistical artifact".Dorans (2004) and Dorans and Zeller (2004a) also criticized the methods Freedle used for calculating the necessary components of the R-SAT: the use of proportion correct rather than formula score, his consideration of different (ethnic) samples for the half-test and his application of inverse regression.Furthermore, Dorans and Zeller (2004b) explored the fairness of Freedle's R-SAT using Score Equity Assessment (SEA), a new methodology presented as a complement to the existing procedures for fairness assessment, namely DIF analysis and differential prediction.Using SEA Dorans and Zeller (2004b) found that the half-test to total test linking may be populationdependent and therefore the scores produced on the hard-half test cannot be used interchangeably with scores produced on the full-length SAT verbal test.
The Freedle phenomenon has been assessed with respect to a large new data set in a series of recent papers (Santelices & Wilson, 2010a, 2010b, 2012).The analyses reported in those papers use a modified R-SAT approach, incorporating the changes recommended by the ETS researchers, and the data sets all post-date the changes that were made to the ETS review procedures.Although the results show it to be less prevalent than Freedle originally reported, the existence of this phenomenon was found to be supported in general.
The research presented in this article sets aside the discussion about the "Freedle phenomenon," which has been largely centered on item functioning and its relationship with item difficulty, and focuses on the use of the R-SAT, highlighting its predictive validity.This paper examines the potential changes in admissions decisions for minorities if the R-SAT were used in combination with the SAT, its overall predictive validity and its potential benefit in differential prediction.In doing so, this study follows closely the recommendations made by both Freedle (2003) and his critic Dorans (2010).In the report written by Dorans (2010), he explicitly argues for the need to conduct predictive validity studies, not just DIF analysis, in order to address the questions raised by Freedle (2003): The fairness questions raised … about access to higher education are score-use questions that cannot be addressed by a DIF analysis … Differential prediction addresses score use.These studies typically assess whether test scores, alone or with other information such as high school grades, predict first-year grade point averages equally well for different subgroups (p.2).

The Role of SAT Scores in the Prediction of College Outcomes
The argument for using standardized scores in admissions decisions, along with other indicators, relies heavily on their contribution to the prediction of college outcomes.Student-level variables such as motivation, academic performance and social integration have been identified by researchers as key factors in explaining college academic success (Bean & Mentzer, 1985;Pascarella & Terenzini, 1991;Tinto, 2006).There is a substantial body of literature on the power of standardized test scores in particular to predict a variety of college outcomes (Bowen & Bok, 1998;Camara & Echternacht, 2000;Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008;Willingham, 1985;Willingham, Lewis, Morgan, & Ramist, 1990) but we will focus our attention on the prediction of (i) college grades and (ii) graduation rates.Although these outcomes offer only a partial portrayal of student's educational achievement, the convenience of their collection and their frequent and systematic reporting makes them the outcomes most commonly used in predictive validity studies.Most often researchers have examined the predictive validity of standardized test scores and high school grades using short-term academic outcomes, especially grades.Long-term outcomes are often assumed to be affected most significantly by financial aid and previous experience in college (Reason, 2009;Wilson, 1983) however the importance of graduation as the milestone and main reason for students pursuing higher education convinced us of the need to examine its prediction.
The findings from the literature are contentious: there is no consensus on the merits of the SAT to predict either short or long term outcomes (Geiser & Santelices, 2007;Geiser & Studley, 2002;Ramist et al., 1994;Zwick, 2002).Furthermore researchers have found that SAT scores do not predict equally well for students from different ethnic groups and, in particular, tend to overpredict the performance of Hispanics and African American students (Burton & Ramist, 2001;Geiser & Studley, 2002;Ramist et al., 1994;Zwick et al., 2004).

College Grade Point Average
The relationship between high school grade point average, SAT scores and freshmen grade point average has been widely examined by researchers at the College Board and research units within higher education institutions (e.g., Geiser & Studley, 2002;Ramist et al., 1994).In general the College Board studies find that SAT scores make a substantial contribution to predicting cumulative college GPAs and that the combination of SAT scores and high school records provide better predictions than either grades or test scores alone (Burton & Ramist, 2001;Hezlett et al., 2001).College Board researchers have studied the validity of the SAT mostly using correlational analysis and have taken into consideration the technical issues of range restriction, differences in grading across colleges and unreliability of college grades to measure success in college (Camara & Echternacht, 2000;Willingham et al., 1990).Typical correlations between first-year grades and the SAT I (Verbal and Math scores combined) range between 0.3 and 0.6 depending on the characteristics of the studies with an average of 0.4 (Ramist et al., 1994;Zwick, 2002).Bridgeman, Pollack, and Burton (2004) for example, report a correlation between freshman grades and the SAT I score composite of 0.55, while the SAT Verbal test score has a correlation of 0.50 with freshman grades, the SAT Math correlates 0.52. 3n 2005 the SAT I was revised in a number of ways (Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008) but still ETS researchers recommend the use of the test (especially the Writing test) in combination with high school grades when making admissions decisions since that combination maximizes predictability of first-year college grades (unadjusted r=0.46, r adjusted for range restriction=0.62).Non-ETS researcher and advocates, however, have stressed the low power of the SAT to predict college grades (FairTest, 2003;Geiser & Studley, 2002;Rothstein, 2004).Their arguments are based on the results of multivariate analyses that consider multiple academic predictors including student variables and school-level sociodemographics.For example, Geiser and Studley (2002), after taking the SAT II and high school GPA into consideration, reported that the SAT I scores improved the overall prediction rate by a negligible 0.1% (from 21.0% to 21.1%).The standardized coefficient of the SAT I, after controlling for SAT II and high school GPA, was 0.07, but statistically significant due , at least in part, to the large number of observations used.

Differential Prediction
Notable differences in the validity and predictive power of SAT scores and high school grades by race have been substantiated through numerous studies (e.g., Young, 2004).These two variables often overpredict the grades of African-American and Hispanic students and underpredict womens' performance (Burton & Ramist, 2001;Geiser & Studley, 2002;Ramist et al., 1994;Zwick et al., 2004).Overprediction means that a group's average predicted first-year grade point average (GPA) is greater than its average actual first-year GPA.Ramist, Lewis, and McCamley-Jenkins (1994) find that overprediction occurs even more strongly when using high school GPA alone (i.e., without the SAT) to predict first-year college grades.
Analyses of differential prediction are used to examine the bias of a test according to Cleary´s definition (1968), which defines bias against a specific subgroup as predictions of the criterion score obtained from a common regression line that are consistently too high or too low for members of that subgroup.There are a number of theories about the reasons for over and underprediction.Some have attributed the phenomenon to statistical artifacts (unreliability of the measures); others believed they are related to the differing college experiences of various student groups.Others hypothesize that students differ in ways that are not fully captured by either their test scores or high school grades.The observed facts still remain a matter of debate (Steele & Aronson, 1998;Zwick, 2002Zwick, , 2006;;Zwick et al., 2004). 4esearchers have looked at the differential prediction of test scores and high school grades among students from different language background (Zwick & Schemler, 2004;Zwick & Sklar, 2005) and from schools with different financial and teaching resources (Zwick & Himelfarb, 2011) as a way to investigate possible explanations to the issue of overprediction and underprediciton.Results show a reduction of prediction error for Hispanic and African American students, but not a complete elimination (from -0.15 to -0.08 and from -0.13 to -0.03 respectively) when using the second approach, and no change when considering first language.

College Graduation
From an economic perspective, the immediate goal of attending post-secondary education at the individual student level is college graduation (Hout, 2012).Studies exploring the role of SAT scores in college persistence and college graduation find only a moderate relationship (Astin, Tsui, & Avalos, 1996;Burton & Ramist, 2001;Mattern & Patterson, 2009, 2011a, 2011b).Wilson (1983) observes that the best predictor of college graduation are persistence to sophomore year and firstyear GPA.This information is closest in time and in content to what is being predicted, and it is not available at admission.Studies attempting to predict interim persistence (return for sophomore year and five-semester persistence) have obtained low correlations, which range from 0.01 (high school grade point average) to 0.17 (SAT Math).
Although the traditional variables included in the multivariate regression models explain a small proportion of the variance associated to graduation, Geiser and Santelices (2007) found high school grades to be the strongest predictor, followed by the SAT II Writing scores.Zwick and Sklar (2005) corroborated the importance of high school grades.Sociodemographic variables play a minor role in explaining college graduation (Geiser & Santelices, 2007); nevertheless Bowen and Bok (1998) found these variables to be more important in the college prediction for African American students than for White students.
The lower correlation between college graduation and preadmission characteristics is to be expected since persistence in college and ultimate graduation are more substantially influenced by nonacademic factors than college GPA.Some of the non-academic variables that research has identified as playing an important role in determining persistence and graduation are finances, motivation, social adjustment, family and health problems, institution´s selectivity and size (Bowen, Chingos, & McPherson, 2009;Reason, 2009).

Non-Academic Predictors of College Success
A number of studies looking into the importance of non-academic variables to predict college success have claimed for the expansion of the definition of college success to include longerterm outcomes, such as persistence and graduation, as well as less-researched outcomes, such as leadership and civic participation (Camara & Kimmel, 2005;Kyllonen, 2008;Robbins, Lauver, Le, Davis, & Langley, 2004;Sternberg 1999Sternberg , 2003)).Doing so allows the prediction of college success using a broader range of indicators and thus avoiding the exclusive reliance on cognitive criteria and predictors.This seems a suitable recommendation in light of universities' broader missions, including social and personal outcomes for their students (Perfetto, 1999;Stemler, 2012) and the potential for reduced adverse impact on the admission of traditional minority students (Breland, Maxey, Gernard, Cumming, & Trapani, 2001;Oswald et al., 2004;Sinha, Oswald, Imus, & Schmitt, 2011;Sternberg, Gabora, & Bonney, 2012).

Why Are Standardized Tests Used in Admissions?
Despite the contentious arguments about the value of standardized tests in the prediction of college grades and graduation, higher education institutions continue to rely on standardized tests to make admissions decisions.Zwick (2002) justifies the use of the standardized test scores in admissions to large institutions by noting the cost of interviewing candidates or reviewing applications in elaborate detail.The cost for the school of collecting and processing the scores, she says, is very small compared to the cost of these alternatives.Tests allow all applicants the opportunity to perform in an environment with the same testing conditions, instructions and timeconstraints.Standardized test scores allows the comparison of students who come from different schools in which grading standards can vary significantly.
Continued reliance of higher education institutions on standardized tests make alternative instruments and complementary scores especially relevant.The mixed conclusions from the research regarding the contribution of the SAT to the prediction of college grades and graduation, the overprediction of African American and Hispanic students´ performance in college, and the observed relationship between DIF and item difficulty, all call into question the validity of the use of SAT standardized test scores in admissions decisions.These validity issues should be considered in addition to the disparate effects of the SAT on minorities and their access to higher education.

Research Questions
The current paper explores the potential benefits of the R-SAT score for minority students.Rather than addressing the criticisms of design of the R-SAT (e.g., differential item functioning), we instead address the questions that bear on the use of the R-SAT, i.e., those that are most relevant for admissions officers:

1)
Would use of the R-SAT score increase the number of minority students admitted to selective institutions?2) Does the R-SAT score better predict the college outcomes of minority students than the SAT score?
3) Does the R-SAT score help ameliorate the issue of overprediction for African American and Hispanic students?
To answer these questions we first calculated the R-SAT and then we studied how beneficial it would be for minority students if the R-SAT were considered in admissions decisions at selective institutions.We compared the predictive power of the R-SAT, relative to the original SAT, both considering all students and then differentially by race.Finally, we analyzed whether the R-SAT score would help ameliorate the issue of overprediction for African American and Hispanic students.The predictive validity analyses considered the maximum score between the SAT Verbal score and R-SAT Verbal for minority students, not just the revised SAT score, in light of Freedle´s recommendation to report both scores and consider the difference between them as the extent to which there are cultural differences between White and minority students. 5In addition, the regression models included sociodemographic variables based on the results of Rothstein (2004), who finds that most of the SAT predictive power comes from the correlation with sociodemographic variables.Although parental income and education play a modest role in the prediction of college performance when controlling for additional academic indicators such as high school grades and standardized tests (Geiser & Studley, 2002) 6 , Rothstein's (2004) estimates show that the predictive contribution of the SAT I score is 60% lower than would be indicated by traditional methods that only consider academic variables.

Methodology Data Sources
To investigate the first research question we drew from the College Board datafile of students from California public high school seniors who took the SAT forms DX and QI in 1994 or SAT forms IZ and VD in 1999 and spoke English as their best language.We only considered groups and forms in which the Freedle phenomenon has been observed and reported before (Santelices & Wilson, 2010a, 2010b, 2012).In particular, the R-SAT was calculated for African Americans in forms IZ, QI and DX and for Hispanics in forms IZ and VD (see Table 1).The College Board datafile allowed us to explore the research questions in a sample that is significant in size, especially for minority students, as it combines students from all public high schools in California.The College Board datafiles contained students' item level responses, and students´ individual scores, as well as students' responses to a Student Data Questionnaire (43 questions), which included self-reported demographic and academic information such as parents' education, family income, and high school grade point average.English as the best language is a standard requirement in DIF studies of the SAT similar to this as a way to analyze a group of students of common and mainstream educational experience and not confound DIF results with other educational needs (see Table 2).
In order to answer the second and third research questions, the information from the College Board just described was complemented with data from the University of California Corporate Data System which contains system wide admissions and performance data for all students who applied and then enrolled at UC. Through their applications to UC, students provide academic and demographic information that is subsequently verified and standardized.For those students who enroll at UC, this information contains their academic history as well-including college grades, number of courses, number of units completed and graduation.Information about parental education level and family income is also available for students who attended.An indicator of school performance on a state standardized test (Academic Performance Index) from the California Department of Education ( 2014) was also added to the file.The school academic performance index information was not available for the students who took the SAT in 1994 because the index was calculated for the first time in 1998, thus only results for students taking the SAT forms IZ and VD in 1999 are presented.This dataset allows us to explore the research questions in a sample that is significant in size, especially for minority students, as it combines students from nine University of California campuses.
As result of the eligibility criteria and of enrollment decisions, the sample used for the predictive validity analyses has a higher mean SAT score, higher high school grade point average, higher family income and parent's education than the College Board sample of all high school juniors from California public high schools who took SAT forms DX and QI in 1994 and SAT forms IZ and VD in 1999 and was used to answer the first research question (see Table 3).

Analyses
This section presents the details of how the R-SAT score was calculated and how the relative predictive power of these scores was assessed.Since previous studies found stronger evidence of the relationship between DIF estimates and item difficulty in the Verbal test than in the Mathematics test (Santelices & Wilson, 2010a, 2010b, 2012), all the analyses focus on the Verbal test although always controlling for the Mathematics scores.
The analyses exploring the impact of Freedle's R-SAT in admissions decisions and subsequent analyses looking at the R-SAT's predictive validity and differential prediction consider the maximum score between the SAT Verbal score and R-SAT Verbal score for minority students, and not just the revised SAT score.This is done in consideration of Freedle's own recommendations: "the solution is to recognize that this is pervasive phenomenon that can be easily remedied by reporting two scores, the usual SAT and the R-SAT" (Freedle, 2003).Since Freedle recommends reporting both scores for minority students and interprets the difference between them as an indication of the magnitude of the difference between the White majority's culture and the cultural background of minority groups, then the consideration of the maximum of the two scores Calculation of the revised SAT score.The R-SAT was obtained by calculating the corresponding formula score 8 in the hardest half of the test for all students who took each test form and then assigning African American/Hispanic students the total score obtained by White students who performed similarly in the hard half of that specific test form.Specifically, in order to obtain the revised score for African American/Hispanic students, first a linear regression was estimated only among the White students who took each form.The linear regression was used then to predict their SAT scores using the formula score obtained in the hard half of the test.A constant and a slope coefficient were estimated and subsequently those parameter estimates were applied to the formula score obtained, in the hard part of the test, by each African American and Hispanic student.This methodology, is the same as the one originally used by Freedle (2003), with the exception that we incorporated Dorans and Zeller's recommendations regarding the use of formula scores rather than the original proportion correct scores that Freedle used (Dorans, 2004;Dorans & Zeller, 2004a).The R-SAT thus allows one to estimate the number of correct responses (adjusted for random guessing) in a score metric that ranged from 200 to 800 just as for the regular SAT Verbal score.The scores of White students are used as the reference because they have been considered the reference group in previous DIF analyses.
Predictive validity analyses.The predictive power of the regular SAT verbal score and the R-SAT score were compared for African American, Hispanic, and White students.Linear regression was used for GPA prediction and logistic regression was used for the prediction of graduation (i.e., because UC GPA is a continuous numerical variable and graduation is a dichotomous outcome variable).Similar to the findings of Rothstein (2004), and contrary to the results reported by Bridgeman et al. (2004), visual inspection of scatterplot and the examination of linear, logarithmic, and exponential trends supported a linear relationship.The ordinary least squares method was used for estimating linear regressions and the maximum likelihood technique was implemented for the estimation of logistic regression.The college ou tcomes examined were the first through fourth year annual UC GPAs, the cumulative fourth year UC GPA, and whether students graduated by their fourth year at UC.All explanatory variables presented in models (1), (2), and (3) were introduced at once.No stepwise procedure was used.The academic outcomes included in this study are of particular interest because they are not limited to grade point averages and span four years of the college career of students taking the SAT in 1999.
Although sociodemographic covariates are not used in admissions, the analyses controlled for academic and sociodemographic variables found to be significant in previous college prediction research (Geiser & Studley, 2002;Zwick et al., 2004) because they change the estimated prediction power of test scores (Rothstein, 2004).The sociodemographic variables included parent's education and income level from the UC systemwide admissions and performance data.The academic 7 The correlation between the SAT Verbal score and the maximum score between the SAT Verbal score and R-SAT Verbal score among minority students in the 1999 cohort is 0.948.Appendix A shows regression results using models that compare the predictive power of the original SAT Verbal score to that of the R-SAT score, in addition to those using the maximum score between the SAT and the R-SAT score.These models also exclude the API rank as explanatory variables.Results do not provide stronger support for using the modified admission scores. 8Formula scoring adjusts scores for the possibility of random guessing (Frary, 1988;Rogers, 1999).
variables included a weighted high school GPA, calculated with up to eight honors-level courses, the SAT Math score, and the school academic performance index expressed as quintile ranks for students who took the SAT in 1999.
Equations 1, 2 and 3 show the general regression equation models for the prediction of annual UC GPA, cumulative fourth-year UC GPA, and fourth-year UC graduation respectively. (1)  CUMUCGPA4 iks =q 0 +q 1 APIQ ik +q 2 Educ ik +q 3 Inc ik +q 4 HSGPA ik +q 5 SATM ik +q 6 Z iks + e iks (2) LOGIT(GRAD4 iks ) = a 0 +a 1 APIQ ik +a 2 Educ ik +a 3 Inc ik +a 4 HSGPA ik +a 5 SATM ik +a 6 Z iks (3) In model ( 1) UCGPAikjs is the grade point average that a student i, of ethnicity k (where k can be equal to 1= African American, 2= Hispanic, 3= White) had in year j of college, considering verbal ability index s, where j ranges between 1 and 4 and s is either the SAT Verbal score (s=1) or the highest score between the R-SAT Verbal score and the original SAT score for minority students (s=2).APIQik refers to the ranking of the school attended by student i of ethnicity k in the California Academic Performance Index; Educik is the maximum number years of education achieved by the parents of student i of ethnicity k as reported in the UC application; Incik refers to the family income of student i of ethnicity k reported in the UC application (expressed in dollars); HSGPAik is the weighted high school GPA considering up to eight honors-level courses of student i of ethnicity k (which was the index used by UC at that time); SAT Mik is the score the student i of ethnicity k obtained in the SAT Mathematics test; and Ziks refers to different indices of verbal ability of student i of ethnicity k.In the first version of model ( 1) the verbal ability indicator is the SAT Verbal score (s=1).The second version of model ( 1) uses the highest score between the R-SAT Verbal score and the original SAT score for minority students (s=2).Thus there are two versions of model (1) for African American students and two versions for Hispanic students for each academic year, which differed in the verbal ability index included (s=1 or s= 2).R-SAT was not available for White students therefore there was only one version of model (1) for each academic year, using just the SAT Verbal score, for them (s=1). 9Finally ei is a random error with expected value equal to 0 and variance equal to σ 2 e.
In model (2) CUMUCGPA4iks refers to the cumulative grade point average at the fourth college year of student i of ethnicity k considering verbal ability index s.In model (3) GRAD4iks is a binary variable indicating whether student i of ethnicity k graduated by the fourth year of college, considering verbal ability index s.For African American and Hispanic students, and just as in model ( 1), there were two versions of models ( 2) and ( 3), which differed in the verbal ability index included (s=1 or s= 2).For White students there was only one version of models ( 2) and ( 3), considering only the SAT Verbal score. 9The model presented in the text includes only SAT I Verbal (both the original and the maximum score between the SAT and the R-SAT scores) and SAT I Math scores as explanatory variables, and not SAT II scores, as (a) students took different SAT II tests, and the characteristics of these different tests vary considerably, and (b) most higher education institutions require only the SAT I exam and hence, results from these models will be more generalizable to other institutions.To check, we did conduct regressions including SAT II test scores as explanatory variables and found that they did not offer stronger evidence in support of the R-SAT Verbal test score, neither through larger and statistically significant coefficients nor through positive changes in the R 2 .Details are available from the authors upon request.
All campus data are aggregated in the regression analyses and there is no control for the effect of discipline or campus on the dependent variable due to the small sample size of minority groups (Brown & Zwick, 2006).Student sample size also limited our ability to consider the within and between school variation in high-school GPA and API quintile (Zwick & Green, 2007), therefore no multilevel modeling was conducted.
The linear regressions analyses compared the explained variance across models measured by the standard R 2 (Singer & Willett, 2003).In logistic regression we used , where is the likelihood of the intercept-only model, is the likelihood of the specified model and n is the sample size.The standardized coefficients for the prediction of first-year GPA, 4 th -year Cumulative GPA and 4 th year graduation are shown in Appendix B.
Differential prediction of freshmen grades.Underprediction or overprediction is usually assessed by fitting one general prediction model for college students from all ethnic groups and then summing the regression residuals for a particular ethnic group.The average individual over or underprediction is calculated by adding the residuals and then dividing them by the number of students in each ethnic group.In this case, regression models 1.1 and 1.2 were estimated and the average residual by ethnic group compared.In this case the regression analyses did not distinguished among ethnicities.Two regressions were conducted and they differed only on the verbal ability indicator: the first one consider the SAT Verbal score (1.1) and the second one consider the maximum score between the SAT Verbal and the R -SAT Verbal score (1.2).All explanatory variables included in these models were described above.

Results
This section presents the results of this research in three parts: the calculation of the R-SAT, its predictive validity compared to the SAT and finally the R-SAT´s performance on the issue of overprediction and underprediction.

Freedle's Revised SAT Verbal Score
The adjusted scores were calculated for a total of 3,922 Hispanic examinees and 2,234 African American examinees who graduated from California public high schools.The R-SAT Verbal score mean is higher than the original SAT Verbal mean score in all of the ethnic groups and test forms that we examined.On average, the R-SAT Verbal score increases the mean score of African American students from 382.5 to 407 (6.4 percent increase) and the mean score of Hispanic students from 471.6 to 484.0 (2.6 percent increase).It is important to consider that the average SAT Verbal score for African American and Hispanic students is 439 and the Standard Deviation is 109 points (see Table 2).The increase between the mean SAT and R-SAT Verbal score, of 17 points, amounts to 16% of a standard deviation.Table 4 contrasts the original mean SAT Verbal score and the mean R-SAT Verbal score for Hispanic and African American examinees.These data are presented for the overall sample of Hispanic and African American examines, as well as for each test form in which the Freedle phenomenon could not be rejected.Table 5 provides greater detail about the degree to which the R-SAT Verbal score benefits minority students.Note that the bottom 3 rows display the students who benefit from the use of the R-SAT Verbal score.We observe that 68% of African American examinees (a total of 1,537 out of 2,234) improve their scores when the R-SAT Verbal score is considered in place of the SAT Verbal score.The same occurs for 58% of the Hispanic sample (a total of 2,271 over 3,922).In addition, the R-SAT Verbal tends to benefit mostly students in the low end of the original SAT Verbal score distribution.While most examinees increase their scores by between 0 and 50 points, the increment reaches as high as 202 points in a number of cases.On average, however, the score increase is not as large as Freedle described it to be (Freedle, 2003) and would be of little benefit to African Americans especially, who tend to start from a lower score, in comparison to Hispanics.The scatterplot of the SAT Verbal Score and R-SAT Verbal score (Figure 1) shows the same phenomenon.It is important to note the relative lower variance of R-SAT scores compared to SAT scores.In order to assess the impact of the revised SAT score in the admissions decisions of minority students, Freedle estimated and compared the number of African American students who would be offered admission at competitive colleges when considering each score.Freedle hypothesized that receiving an R-SAT score of at least 600 would be sufficiently meritorious to interest many colleges in an applicant who received such a score.Freedle chose to consider an SAT score of 600 or above as meritorious because students whose high school grade point average is between the 97th and 100th percentile receive an average SAT verbal score of 610 and, in addition, a score of 600 also reflects a level of test performance that a small proportion of the test-taking population receives (Freedle, 2003).He reported that by considering the revised SAT score instead of the original SAT score, the number of African Americans scoring over 600 in two of the forms he analyzed increased from 166 to 235 (Form 4I) and from 117 to 167 (Form OB023) which he reported as equivalent to an increase in admission to selective colleges by 342% and by 334%, respectively (p. 15).
The analyses reported here show an effect in the same direction Freedle described.However, the impact in the number of African American students whose admissions are likely to have changed is smaller.When using the maximum of the SAT and the R-SAT Verbal scores, the number of African American students scoring over 600 increases from 79 to 86.This represents an increase of 8.9% over the original number of African American students in the sample scoring over 600 (see Table 6) or an increase from 3.5% of all African Americans to 3.8%.When considering both African American and Hispanic students, the number of students scoring over 600 increases from 458 (7.4% of all minority students) to 516 (8.3% of all minority students), which is equivalent to an increase of 12.6%.Overall, 7.4% of minority examinees now score over 600.In comparison, 3,889 White students, or 19.7% of all White examinees, score 600 or above and received an average score of 653.The consideration of a different admission cut-off score other than 600, would only result in significant benefit for minorities if it was drastically reduced from 600.More than 60% of the African American and Hispanic students considered in this analysis would receive an R-SAT Verbal score below 450 therefore only an admission cut-off score around or below this level would result in a different admission decision.This sort of change in an admission score level, however, does not seem consistent with the assumption of being admitted to highly competitive colleges.

Predictive Validity of the Revised SAT Verbal Score
Table 7 shows the standard R 2 for the multivariate models estimated within each ethnic group. 10The overall predictive power of the models examined varies depending on the academic outcome and ethnic group.In general, the models predict college grades better for White students than for minority students.While the capacity to predict annual college grades for White and Hispanic students tends to decline over time in college, the overall prediction of cumulative fourthyear grade point average is unexpectedly high for these same groups.In addition, the prediction of fourth year graduation is weaker than the prediction of annual college grades for both White and Hispanic students as well.Note: Pseudo R 2 is reported for the logistic regression used to predict fourth-year graduation.
10 In order to increase the sample size, results for the R-SAT Verbal score were combined across all SAT forms.This aggregation was possible because the performance in each form was previously scaled by ETS.Scaling refers to a psychometric process conducted to achieve comparability among test score from different test forms.The aggregation conducted also assumes that the four SAT forms were equated during test development.For an introduction to traditional scaling and equating methods see Kolen (1988).Equating is a process different from scaling and aims to adjust for differences in difficulty among test forms.
Table 7 shows that the capacity to predict college outcomes using the R-SAT Verbal score is close to, but slightly less, than the predictive power achieved when using the original SAT score.The model using the R-SAT Verbal score predicts better than the model using the original SAT score only in two out of twelve cases -just for the African American group´s fourth year college grade point average and fourth year cumulative grade point average.The small differences in predictive power are also not of large practical significance as they ranged between 0.1 and 0.5 percentage points.

Differential Prediction of Freshmen Grades
We find underprediction of White students' grades (0.01) and overprediction of Hispanic (-0.025) and African American students' first year grades (-0.098) when using the SAT, just as previous research did (Ramist et al., 1994;Ramist, Lewis, & McCamley-Jenkins, 2001).We found no improvement in the prediction power from using the R-SAT Verbal score for minority groups.On the contrary, the prediction errors for minorities decreased (increased in absolute terms) when using the maximum from the SAT and R-SAT Verbal score, to (-0.032) for Hispanic students and to (-0.114) for African American students respectively.The same analysis was conducted for fourth-year cumulative UC GPA and the average underprediction for African American and Hispanic students increased in absolute terms as well (from -0.181 to -0.194 and from -0.033 to -0.040 respectively). 11These differences are larger and thus more important than differences in the prediction of first-year GPA.

Discussion and Conclusion
Analyses presented above show that in the sample the R-SAT score does result in increased scores for minority students, although not as much as Freedle expected.On average, it increases scores by 24 points (6%) for African American students and by 12 points (2.5%) for Hispanic students.Using Freedle's assumptions, the consideration of the R-SAT would change admissions decisions for minority students admitted into selective colleges by about 10%.This is much less than Freedle´s prediction of approximately 300% increase but it should be given some consideration since such an increase could be educationally significant in some contexts, especially at the most selective institutions.The small increases in R-SAT scores reported in our research are consistent with the magnitude of score increase reported by Dorans (2004) and Dorans and Zeller (2004a).In addition, the predictive validity analyses show virtually no difference in the capacity to predict short and long-term outcomes when using either the original or the revised SAT score.The R 2 using the SAT Verbal score for the prediction of college grades for African American, Hispanic and White students are consistent with the results reported by similar studies (Geiser & Studley, 2002;Zwick et al., 2004).Geiser and Studley (2002), for example, reported R 2 s close to 10% for African American students (pp.15).When predicting graduation, however, the models predict better for African Americans and Hispanics than for White students.The limited incremental predictive power of the maximum score between the R-SAT Verbal and the Verbal scores may be explained by the lower variance observed in R-SAT scores when compared to SAT Verbal scores, which is related to the fact that R-SAT scores are actually regression predictions.
Also, results show that the traditional problem of overprediction and underprediction would remain approximately the same when using the revised SAT score.On average, the overprediction estimated in this study lays in the range between the overprediction reported for African American 11 Regression results are available from authors upon request.
This research has several limitations.Among them is the fact that predictive validity analyses were conducted on a group of students who were already accepted to college and therefore present a significant restriction of range in some of the explanatory variables.In addition, many students who did not attend selective colleges might have matriculated at such schools if their R-SAT scores had been used in the admission process, which limits in some extent the validity of our findings.This limitation, however, has also been the case in other predictive validity studies (Geiser & Studley, 2002;Zwick, 2002;Zwick et al., 2004;Zwick & Sklar, 2005).The aggregation of different ethnic groups in order to obtain the R-SAT scores is still subject to Dorans and Zeller´s original criticisms (Dorans & Zeller, 2004a, 2004b).Recent changes to the content of the SAT and the inclusion of a Writing test may limit the generalizability of the findings presented here since they were based in somewhat older test forms.Larger sample size for each minority group may be desirable in order to implement future research, especially for African American students.Increasing the sample size, however, remains a daunting task as it requires data from an even greater number of colleges and universities than the nine campuses of the University of California examined here.Furthermore, and despite the limited sample size of African American and Hispanic students, we were still able to observe results that were similar to those reported by previous research, such as the statistical significance and practical importance of high school grades for predicting college grades and graduation (see Appendix B).These results provide support for the validity of our results for these particular samples.
We think it is important to highlight the consistency of the results obtained in the numerous and diverse analyses implemented in this research: no strong evidence in favor of the R-SAT score is observed when (a) recalculating the scores using only the most difficult items for minorities, (b) when using that maximum between the R-SAT and the SAT Verbal score to directly predict short and long term outcomes for African American and Hispanic students using models that did not considered SAT II scores, and (c) when evaluating the overprediction and underprediction problem for African American and Hispanic students.Although not presented here, we also found results that did not support the use of the R-SAT score nor the maximum between the R-SAT and the SAT Verbal score in models that considered SAT II scores and when using models that did not control for school quality and allowed us to have larger sample sizes.
The findings presented in this article consistently reveal that there are only quite minimal benefits associated with Freedle's R-SAT and suggest that, rather than using measures aimed to complement the SAT, efforts and energy should be directed to studying the phenomenon behind the systematic relationship between item difficulty and DIF estimates and directly addressing those issues during test development.The investigation of potential causes should explore Freedle's proposed explanation, the influence of academic versus home language (Freedle, 2010) -including examination of the cognitive processes of students while taking the test as well as quantitative analyses and modeling techniques (De Boeck, 2010).In addition, further research should investigate the relationship between Freedle's phenomenon and alternative forms of guessing such as differential guessing strategies between White students and students from other ethnic groups.
These results also suggest that alternative policy options should be considered if the goal is to increase the representation of minority groups in higher education, especially at highly selective institutions (Bowen et al., 2009).Those options may include the use of school quality indices as input in the admissions processes (Zwick & Himelfarb, 2011) as well as explicitly considering nonacademic outcomes as college goals and therefore adjusting the weight of admission indicators accordingly (Sinha et al., 2011).

Figure 1 .
Figure 1.Scatterplot of R-SAT and SAT Verbal Score for all Hispanic and African American Students included in Sample.

Table 1
Number of Students for Whom the Revised Score Was Calculated

Table 2
Sample Descriptive Statistics by Ethnicity.College Bound Students from California Public High Schools who Took theSAT Forms DX  and QI in 1994 or forms IZ and VD in 1999 and for Whom DIF was Observed

Table 3
Sample Descriptive Statistics by Ethnicity.Students who Took Forms IZ and VD in 1999 and for Whom DIF was Observed and who Enrolled at the University of California

Table 4
Mean Score for Minority Groups.Mean SAT Verbal Score Versus Mean R-SAT Verbal Score

Table 5
Distribution of Score Difference by Ethnic Groups and Corresponding Mean SAT Verbal Score.Overall Sample

Table 6
Number of Examinees Scoring 600, or Above, in the Sample and their Mean Scores

Table 7
Standard R 2 Multivariate Regression Model.1999 Cohort.SAT Verbal Scores and the Maximum between the SAT Verbal Scores and the Revised SAT Verbal Scores

Table 2
Prediction Power of Cumulative Fourth-Year UC GPA.Standardized Estimates and Statistical Significance (p-values) by Ethnic Group

Table 3
Prediction Power of Fourth-Year UC Graduation.Standardized Estimates and Statistical Significance  (p-values)by Ethnic Group Note: Pseudo R 2 is reported for the logistic regression used to predict fourth-year graduation.