library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)Modeling loans - Solutions
eCOTS 2022 - Modernizing the undergraduate regression analysis course
Introduction
The exercises below are drawn from an exam review. Students would have already completed readings, some assignments, and labs prior to attempting these questions.
You may notice some code below has already been pre-populated for you. In these cases, there is a flag set as eval = FALSE. Make sure to remove this flag prior to running the relevant code chunk to avoid any errors when rendering the document.
Data
In today’s workshop, we will explore using the tidymodels framework for modeling along with the tidyverse framework for data wrangling and visualization. We will start with some exploratory data analysis, walk through how to create the key components of a predictive model (models, recipes, and workflows), and how to perform cross-validation. Throughout we will be using the loans_full_schema dataset from the openintro package1 and featured in the OpenIntro textbooks2 .
The data has a bit of peculiarity about it, specifically the application_type variable is a factor variable with an empty level.
levels(loans_full_schema$application_type)[1] "" "individual" "joint"
Let’s clean up this variable using the droplevels() function first. And let’s apply that to the whole dataset, in case there are other variables with similar issues.
loans_full_schema <- droplevels(loans_full_schema)The variables we’ll use in this analysis are:
interest_rate: Interest rate of the loan the applicant received.debt_to_income: Debt-to-income ratio.term: The number of months of the loan the applicant received.inquiries_last_12m: Inquiries into the applicant’s credit during the last 12 months.public_record_bankrupt: Number of bankruptcies listed in the public record for this applicant.application_type: The type of application: eitherindividualorjoint.
Exercises
Exercise 1: Train-test data split
Split the data into a training and test set with a 75%-25% split. Don’t forget to set a seed!
set.seed(210)
loans_split <- initial_split(___)
loans_train <- training(___)
loans_test <- ___(___)Exercise 2: The Model
Write the model for predicting interest rate (interest_rate) from debt to income ratio (debt_to_income), the term of loan (term), the number of inquiries (credit checks) into the applicant’s credit during the last 12 months (inquiries_last_12m), whether there are any bankruptcies listed in the public record for this applicant (bankrupt), and the type of application (application_type). The model should allow for the effect of to income ratio on interest rate to vary by application type.
\[ \begin{aligned} \widehat{\texttt{interest\_rate}} = b_0 &+ b_1\times\texttt{debt\_to\_income} \\ &+ b_2 \times \texttt{term} \\ &+ b_3 \times \texttt{inquiries\_last\_12m} \\ &+ b_4 \times \texttt{bankrupt} \\ &+ b_5 \times \texttt{application\_type} \\ &+ b_6 \times \texttt{debt\_to\_income:application\_type} \end{aligned} \]
Exercise 3: EDA
Explore characteristics of the variables you’ll use for the model using the training data only.
First, take a peek at the relevant variables in the data.
loans_train %>%
select(interest_rate, debt_to_income, term,
inquiries_last_12m, public_record_bankrupt, application_type) %>%
glimpse()Create univariate, bivariate, and multivariate plots, and make sure to think about which plots are the most appropriate and effective given the data types.
- Interest rate:
ggplot(loans_train, aes(x = interest_rate)) +
geom_histogram(binwidth = 1) +
labs(
x = "Interest rate", y = "Count",
title = "Distribution of loan interest rates"
)- Interest rate vs. debt to income ratio by application type:
ggplot(loans_train,
aes(x = debt_to_income, y = interest_rate,
color = application_type, shape = application_type)) +
geom_point() +
labs(
x = "Debt-to-income ratio", y = "Interest rate",
color = "Application type", shape = "Application type",
title = "Interest rate vs. Debt-to-income by application type"
)- Interest rate by bankruptcy:
loans_train %>%
mutate(bankrupt = if_else(public_record_bankrupt == 0, "no", "yes")) %>%
ggplot(aes(x = interest_rate, fill = bankrupt)) +
geom_density(alpha = 0.5) +
labs(
x = "Interest rate", y = "Density",
fill = "Past bankrupcy status",
title = "Interest rate by past bankruptcy status"
)Exercise 4: Model specification
Specify a linear regression model. Call it loans_spec.
loans_spec <- ___Exercise 5: Recipe and formula building
- Predict
interest_ratefromdebt_to_income,term,inquiries_last_12m,public_record_bankrupt, andapplication_type. - Mean center
debt_to_income. - Make
terma factor. - Create a new variable:
bankruptthat takes on the value “no” ifpublic_record_bankruptis 0 and the value “yes” ifpublic_record_bankruptis 1 or higher. Then, removepublic_record_bankrupt. - Interact
application_typewithdebt_to_income. - Create dummy variables where needed and drop any zero variance variables.
loans_rec <- recipe(interest_rate ~ ___, data = ___) %>%
step_center(___) %>%
step_mutate(term = ___) %>%
step_mutate(bankrupt = ___) %>%
step_rm(___) %>%
step_dummy(___) %>%
step_interact(terms = ~ ___) %>%
step_zv(___)Exercise 6: Creating a workflow
Create the workflow that brings together the model specification and recipe.
loans_wflow <- workflow() %>%
add_model(___) %>%
add_recipe(___)
loans_wflowExercise 7: Cross-validation and summary
Conduct 10-fold cross validation.
set.seed(210)
loans_folds <- vfold_cv(loans_train, v = 10)
loans_fit_rs <- ___ %>%
fit_resamples(___)
loans_fit_rsSummarize metrics from your CV resamples.
collect_metrics(___)We can also visualize the metrics across folds.
collect_metrics(loans_fit_rs, summarize = FALSE) %>%
...Breakout 1
Go to https://docs.google.com/presentation/d/1xAesWFvErmAGqsEUwABJThJjaxxlNFwFPbo9mZgNfrY/edit.
Writing Exercise
In this exercise, we will synthesize our work above to create a reader-friendly version of our conclusions. In the classroom, these sorts of writing exercises appear throughout homework and lab assignments as well as exams. They give students an opportunity to demonstrate their understanding while gaining an appreciation that communication is a crucial part of using statistics.
Exploratory data analysis
Using your plots above (along with any other metrics you compute), describe your initial findings about the training data. Discuss why we perform EDA only on the training data and not on the entire data set.
Model fitting and fit evaluation
Although our primary aim is prediction and not inference, it may be of interest to view the model fit nonetheless to make sure nothing looks out of the ordinary. Create a neatly organized table of the model output, and describe your observations, such as which parameters are significant. Make sure to interpret some coefficients appropriately.
Cross-validation
Explain what 10-fold CV does, and why it’s useful. Display a neat table with the outputs of your CV summary, and describe your observations. Make sure to discuss why we are focusing on R-squared and RMSE instead of adjusted R-squared, AIC, and BIC.
Breakout 2
Go to https://docs.google.com/presentation/d/1xAesWFvErmAGqsEUwABJThJjaxxlNFwFPbo9mZgNfrY/edit.
Solutions
See here for solutions to this activity.
Footnotes
Mine Çetinkaya-Rundel, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno and Christopher Barr (2022). openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks and Labs. R package version 2.3.0. https://CRAN.R-project.org/package=openintro.↩︎
Mine Çetinkaya-Rundel and Johanna Hardin. 2021. OpenIntro::Introduction to Modern Statistics. https://openintro-ims.netlify.app.↩︎