eCOTS 2022 - Modernizing the undergraduate regression analysis course

library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)

# Introduction

The exercises below are drawn from an exam review. Students would have already completed readings, some assignments, and labs prior to attempting these questions.

You may notice some code below has already been pre-populated for you. In these cases, there is a flag set as eval = FALSE. Make sure to remove this flag prior to running the relevant code chunk to avoid any errors when rendering the document.

## Data

In today’s workshop, we will explore using the tidymodels framework for modeling along with the tidyverse framework for data wrangling and visualization. We will start with some exploratory data analysis, walk through how to create the key components of a predictive model (models, recipes, and workflows), and how to perform cross-validation. Throughout we will be using the loans_full_schema dataset from the openintro package1 and featured in the OpenIntro textbooks2 .

The data has a bit of peculiarity about it, specifically the application_type variable is a factor variable with an empty level.

levels(loans_full_schema\$application_type)
[1] ""           "individual" "joint"     

Let’s clean up this variable using the droplevels() function first. And let’s apply that to the whole dataset, in case there are other variables with similar issues.

loans_full_schema <- droplevels(loans_full_schema)

The variables we’ll use in this analysis are:

• interest_rate: Interest rate of the loan the applicant received.
• debt_to_income: Debt-to-income ratio.
• term: The number of months of the loan the applicant received.
• inquiries_last_12m: Inquiries into the applicant’s credit during the last 12 months.
• public_record_bankrupt: Number of bankruptcies listed in the public record for this applicant.
• application_type: The type of application: either individual or joint.

# Exercises

## Exercise 1: Train-test data split

Split the data into a training and test set with a 75%-25% split. Don’t forget to set a seed!

set.seed(210)

loans_split <- initial_split(___)

loans_train <- training(___)
loans_test <- ___(___)

## Exercise 2: The Model

Write the model for predicting interest rate (interest_rate) from debt to income ratio (debt_to_income), the term of loan (term), the number of inquiries (credit checks) into the applicant’s credit during the last 12 months (inquiries_last_12m), whether there are any bankruptcies listed in the public record for this applicant (bankrupt), and the type of application (application_type). The model should allow for the effect of to income ratio on interest rate to vary by application type.

\begin{aligned} \widehat{\texttt{interest\_rate}} = b_0 &+ b_1\times\texttt{debt\_to\_income} \\ &+ b_2 \times \texttt{term} \\ &+ b_3 \times \texttt{inquiries\_last\_12m} \\ &+ b_4 \times \texttt{bankrupt} \\ &+ b_5 \times \texttt{application\_type} \\ &+ b_6 \times \texttt{debt\_to\_income:application\_type} \end{aligned}

## Exercise 3: EDA

Explore characteristics of the variables you’ll use for the model using the training data only.

First, take a peek at the relevant variables in the data.

loans_train %>%
select(interest_rate, debt_to_income, term,
inquiries_last_12m, public_record_bankrupt, application_type) %>%
glimpse()

Create univariate, bivariate, and multivariate plots, and make sure to think about which plots are the most appropriate and effective given the data types.

• Interest rate:
ggplot(loans_train, aes(x = interest_rate)) +
geom_histogram(binwidth = 1) +
labs(
x = "Interest rate", y = "Count",
title = "Distribution of loan interest rates"
)
• Interest rate vs. debt to income ratio by application type:
ggplot(loans_train,
aes(x = debt_to_income, y = interest_rate,
color = application_type, shape = application_type)) +
geom_point() +
labs(
x = "Debt-to-income ratio", y = "Interest rate",
color = "Application type", shape = "Application type",
title = "Interest rate vs. Debt-to-income by application type"
)
• Interest rate by bankruptcy:
loans_train %>%
mutate(bankrupt = if_else(public_record_bankrupt == 0, "no", "yes")) %>%
ggplot(aes(x = interest_rate, fill = bankrupt)) +
geom_density(alpha = 0.5) +
labs(
x = "Interest rate", y = "Density",
fill = "Past bankrupcy status",
title = "Interest rate by past bankruptcy status"
)

## Exercise 4: Model specification

Specify a linear regression model. Call it loans_spec.

loans_spec <- ___

## Exercise 5: Recipe and formula building

• Predict interest_rate from debt_to_income, term, inquiries_last_12m, public_record_bankrupt, and application_type.
• Mean center debt_to_income.
• Make term a factor.
• Create a new variable: bankrupt that takes on the value “no” if public_record_bankrupt is 0 and the value “yes” if public_record_bankrupt is 1 or higher. Then, remove public_record_bankrupt.
• Interact application_type with debt_to_income.
• Create dummy variables where needed and drop any zero variance variables.
loans_rec <- recipe(interest_rate ~ ___, data = ___) %>%
step_center(___) %>%
step_mutate(term = ___) %>%
step_mutate(bankrupt = ___) %>%
step_rm(___) %>%
step_dummy(___) %>%
step_interact(terms = ~ ___) %>%
step_zv(___)

## Exercise 6: Creating a workflow

Create the workflow that brings together the model specification and recipe.

loans_wflow <- workflow() %>%

loans_wflow

## Exercise 7: Cross-validation and summary

Conduct 10-fold cross validation.

set.seed(210)
loans_folds <- vfold_cv(loans_train, v = 10)

loans_fit_rs <- ___ %>%
fit_resamples(___)

loans_fit_rs

Summarize metrics from your CV resamples.

collect_metrics(___)

We can also visualize the metrics across folds.

collect_metrics(loans_fit_rs, summarize = FALSE) %>%
...

# Writing Exercise

In this exercise, we will synthesize our work above to create a reader-friendly version of our conclusions. In the classroom, these sorts of writing exercises appear throughout homework and lab assignments as well as exams. They give students an opportunity to demonstrate their understanding while gaining an appreciation that communication is a crucial part of using statistics.

## Exploratory data analysis

Using your plots above (along with any other metrics you compute), describe your initial findings about the training data. Discuss why we perform EDA only on the training data and not on the entire data set.

## Model fitting and fit evaluation

Although our primary aim is prediction and not inference, it may be of interest to view the model fit nonetheless to make sure nothing looks out of the ordinary. Create a neatly organized table of the model output, and describe your observations, such as which parameters are significant. Make sure to interpret some coefficients appropriately.

## Cross-validation

Explain what 10-fold CV does, and why it’s useful. Display a neat table with the outputs of your CV summary, and describe your observations. Make sure to discuss why we are focusing on R-squared and RMSE instead of adjusted R-squared, AIC, and BIC.

# Solutions

See here for solutions to this activity.

## Footnotes

1. Mine Çetinkaya-Rundel, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno and Christopher Barr (2022). openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks and Labs. R package version 2.3.0. https://CRAN.R-project.org/package=openintro.↩︎

2. Mine Çetinkaya-Rundel and Johanna Hardin. 2021. OpenIntro::Introduction to Modern Statistics. https://openintro-ims.netlify.app.↩︎