Teaching data science with R

ASA/AMATYC: Introduction to data science technology workshop

July 23, 2024

The toolkit

R \(\hspace{15mm}\) RStudio \(\hspace{15mm}\) Quarto

R and RStudio

  • R is a statistical programming language

  • RStudio is a convenient interface for R (an integrated development environment, IDE)

RStudio IDE

Data science with R

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

- tidyverse.org

Data visualization

Create a stacked bar plot, showing the distribution of result within each season for the NC Courage soccer team.

ggplot(data = courage, aes(x = season, fill = result)) + 
  geom_bar(position = "fill") + 
    labs(title = "Distribution of NC Courage game outcomes",
         subtitle = "by Season",
         y = "Proportion of games") +
  scale_fill_viridis_d()

Data wrangling

Calculate the probability a painting contained at least one tree conditioned on whether the painting was created by Bob Ross or a guest painter.


bob_ross |>
  group_by(guest) |>
  count(tree) |>
  mutate(prob_tree = n /sum(n)) |>
  filter(tree == 1)
# A tibble: 2 × 4
# Groups:   guest [2]
  guest  tree     n prob_tree
  <int> <int> <int>     <dbl>
1     0     1   345     0.906
2     1     1    16     0.727

Modeling

Fit a linear model with height as the response and sex and age as predictors. Display the model output.


m_sex_age <- lm(height ~ sex + age, data = kids)
tidy(m_sex_age) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 87.400 1.366 63.979 0.000
sexfemale -1.162 0.464 -2.502 0.013
age 0.433 0.009 47.536 0.000

Quarto

Create fully reproducible reports and other documents

Teaching with R

Accessing R/ RStudio

  • Students install R/ RStudio on a laptop or desktop

  • Students use R/ RStudio through a web interface1

    • Centralized RStudio server maintained within the institution

    • Posit Cloud built and maintained externally by Posit

Why teach R?

  • Students develop computing skills to…

    • work with complex, messy, and non-standard data

    • produce professional data science reports and implement best practices for a reproducible workflow

    • practice data science in academia and industry

  • Students can access R/ RStudio after the course

  • Instructors can use R/ RStudio and Quarto to create teaching materials, course websites, books, etc.

Resources