library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)
Modeling loans - Solutions
eCOTS 2022 - Modernizing the undergraduate regression analysis course
Introduction
The exercises below are drawn from an exam review. Students would have already completed readings, some assignments, and labs prior to attempting these questions.
You may notice some code below has already been pre-populated for you. In these cases, there is a flag set as eval = FALSE
. Make sure to remove this flag prior to running the relevant code chunk to avoid any errors when rendering the document.
Data
In today’s workshop, we will explore using the tidymodels framework for modeling along with the tidyverse framework for data wrangling and visualization. We will start with some exploratory data analysis, walk through how to create the key components of a predictive model (models, recipes, and workflows), and how to perform cross-validation. Throughout we will be using the loans_full_schema
dataset from the openintro package1 and featured in the OpenIntro textbooks2 .
The data has a bit of peculiarity about it, specifically the application_type
variable is a factor variable with an empty level.
levels(loans_full_schema$application_type)
[1] "" "individual" "joint"
Let’s clean up this variable using the droplevels()
function first. And let’s apply that to the whole dataset, in case there are other variables with similar issues.
<- droplevels(loans_full_schema) loans_full_schema
The variables we’ll use in this analysis are:
interest_rate
: Interest rate of the loan the applicant received.debt_to_income
: Debt-to-income ratio.term
: The number of months of the loan the applicant received.inquiries_last_12m
: Inquiries into the applicant’s credit during the last 12 months.public_record_bankrupt
: Number of bankruptcies listed in the public record for this applicant.application_type
: The type of application: eitherindividual
orjoint
.
Exercises
Exercise 1: Train-test data split
Split the data into a training and test set with a 75%-25% split. Don’t forget to set a seed!
set.seed(210)
<- initial_split(___)
loans_split
<- training(___)
loans_train <- ___(___) loans_test
Exercise 2: The Model
Write the model for predicting interest rate (interest_rate
) from debt to income ratio (debt_to_income
), the term of loan (term
), the number of inquiries (credit checks) into the applicant’s credit during the last 12 months (inquiries_last_12m
), whether there are any bankruptcies listed in the public record for this applicant (bankrupt
), and the type of application (application_type
). The model should allow for the effect of to income ratio on interest rate to vary by application type.
\[ \begin{aligned} \widehat{\texttt{interest\_rate}} = b_0 &+ b_1\times\texttt{debt\_to\_income} \\ &+ b_2 \times \texttt{term} \\ &+ b_3 \times \texttt{inquiries\_last\_12m} \\ &+ b_4 \times \texttt{bankrupt} \\ &+ b_5 \times \texttt{application\_type} \\ &+ b_6 \times \texttt{debt\_to\_income:application\_type} \end{aligned} \]
Exercise 3: EDA
Explore characteristics of the variables you’ll use for the model using the training data only.
First, take a peek at the relevant variables in the data.
%>%
loans_train select(interest_rate, debt_to_income, term,
%>%
inquiries_last_12m, public_record_bankrupt, application_type) glimpse()
Create univariate, bivariate, and multivariate plots, and make sure to think about which plots are the most appropriate and effective given the data types.
- Interest rate:
ggplot(loans_train, aes(x = interest_rate)) +
geom_histogram(binwidth = 1) +
labs(
x = "Interest rate", y = "Count",
title = "Distribution of loan interest rates"
)
- Interest rate vs. debt to income ratio by application type:
ggplot(loans_train,
aes(x = debt_to_income, y = interest_rate,
color = application_type, shape = application_type)) +
geom_point() +
labs(
x = "Debt-to-income ratio", y = "Interest rate",
color = "Application type", shape = "Application type",
title = "Interest rate vs. Debt-to-income by application type"
)
- Interest rate by bankruptcy:
%>%
loans_train mutate(bankrupt = if_else(public_record_bankrupt == 0, "no", "yes")) %>%
ggplot(aes(x = interest_rate, fill = bankrupt)) +
geom_density(alpha = 0.5) +
labs(
x = "Interest rate", y = "Density",
fill = "Past bankrupcy status",
title = "Interest rate by past bankruptcy status"
)
Exercise 4: Model specification
Specify a linear regression model. Call it loans_spec
.
<- ___ loans_spec
Exercise 5: Recipe and formula building
- Predict
interest_rate
fromdebt_to_income
,term
,inquiries_last_12m
,public_record_bankrupt
, andapplication_type
. - Mean center
debt_to_income
. - Make
term
a factor. - Create a new variable:
bankrupt
that takes on the value “no” ifpublic_record_bankrupt
is 0 and the value “yes” ifpublic_record_bankrupt
is 1 or higher. Then, removepublic_record_bankrupt
. - Interact
application_type
withdebt_to_income
. - Create dummy variables where needed and drop any zero variance variables.
<- recipe(interest_rate ~ ___, data = ___) %>%
loans_rec step_center(___) %>%
step_mutate(term = ___) %>%
step_mutate(bankrupt = ___) %>%
step_rm(___) %>%
step_dummy(___) %>%
step_interact(terms = ~ ___) %>%
step_zv(___)
Exercise 6: Creating a workflow
Create the workflow that brings together the model specification and recipe.
<- workflow() %>%
loans_wflow add_model(___) %>%
add_recipe(___)
loans_wflow
Exercise 7: Cross-validation and summary
Conduct 10-fold cross validation.
set.seed(210)
<- vfold_cv(loans_train, v = 10)
loans_folds
<- ___ %>%
loans_fit_rs fit_resamples(___)
loans_fit_rs
Summarize metrics from your CV resamples.
collect_metrics(___)
We can also visualize the metrics across folds.
collect_metrics(loans_fit_rs, summarize = FALSE) %>%
...
Breakout 1
Go to https://docs.google.com/presentation/d/1xAesWFvErmAGqsEUwABJThJjaxxlNFwFPbo9mZgNfrY/edit.
Writing Exercise
In this exercise, we will synthesize our work above to create a reader-friendly version of our conclusions. In the classroom, these sorts of writing exercises appear throughout homework and lab assignments as well as exams. They give students an opportunity to demonstrate their understanding while gaining an appreciation that communication is a crucial part of using statistics.
Exploratory data analysis
Using your plots above (along with any other metrics you compute), describe your initial findings about the training data. Discuss why we perform EDA only on the training data and not on the entire data set.
Model fitting and fit evaluation
Although our primary aim is prediction and not inference, it may be of interest to view the model fit nonetheless to make sure nothing looks out of the ordinary. Create a neatly organized table of the model output, and describe your observations, such as which parameters are significant. Make sure to interpret some coefficients appropriately.
Cross-validation
Explain what 10-fold CV does, and why it’s useful. Display a neat table with the outputs of your CV summary, and describe your observations. Make sure to discuss why we are focusing on R-squared and RMSE instead of adjusted R-squared, AIC, and BIC.
Breakout 2
Go to https://docs.google.com/presentation/d/1xAesWFvErmAGqsEUwABJThJjaxxlNFwFPbo9mZgNfrY/edit.
Solutions
See here for solutions to this activity.
Footnotes
Mine Çetinkaya-Rundel, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno and Christopher Barr (2022). openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks and Labs. R package version 2.3.0. https://CRAN.R-project.org/package=openintro.↩︎
Mine Çetinkaya-Rundel and Johanna Hardin. 2021. OpenIntro::Introduction to Modern Statistics. https://openintro-ims.netlify.app.↩︎