Butler Journal of Undergraduate Research

Faculty Sponsor

Erin Castro


The following work examines the Random Forest (RF) algorithm as a tool for predicting student outcomes and interrogating the equity of postsecondary education pipelines. The RF model, created using longitudinal data of 41,303 students from Utah's 2008 high school graduation cohort, is compared to logistic and linear models, which are commonly used to predict college access and success. Substantially, this work finds High School GPA to be the best predictor of postsecondary GPA, whereas commonly used ACT and AP test scores are not nearly as important. Each model identified several demographic disparities in higher education access, most significantly the effects of individual-level economic disadvantage. District- and school-level factors such as the proportion of Low Income students and the proportion of Underrepresented Racial Minority (URM) students were important and negatively associated with postsecondary success. Methodologically, the RF model was able to capture non-linearity in the predictive power of school- and district-level variables, a key finding which was undetectable using linear models. The RF algorithm outperforms logistic models in prediction of student enrollment, performs similarly to linear models in prediction of postsecondary GPA, and excels both models in its descriptions of non-linear variable relationships. RF provides novel interpretations of data, challenges conclusions from linear models, and has enormous potential to further the literature around equity in postsecondary pipelines.