Tidyverse

The tidyverse is a collection of R packages designed for data science with a shared philosophy: code should be readable, consistent, and focused on the data transformations you’re actually trying to accomplish. If you’ve struggled with base R’s sometimes cryptic syntax, the tidyverse will feel like a relief.

Most computational social science workflows rely on tidyverse packages for data manipulation, visualization, and modeling. Learning this ecosystem means learning the tools your collaborators use, the examples you’ll find online, and the approaches that scale from quick exploration to publication-ready analysis.

What’s in the Tidyverse?

The tidyverse is actually a meta-package that loads eight core packages when you run library(tidyverse):

  • ggplot2: Data visualization with a grammar of graphics
  • dplyr: Data manipulation (filtering, selecting, summarizing, joining)
  • tidyr: Reshaping data between wide and long formats
  • readr: Reading rectangular data (CSV, TSV) efficiently
  • purrr: Functional programming tools for iteration
  • tibble: Modern reimagining of data frames
  • stringr: String manipulation
  • forcats: Working with categorical variables (factors)

Most of the work we do in this chapter relies heavily on dplyr and ggplot2, with the others playing supporting roles as needed.

Installing and Loading the Tidyverse

You install packages like any others in R.

Installation (one time):

install.packages("tidyverse")

Loading (at the start of each session or script):

library(tidyverse)

When you load the tidyverse, you’ll see a message listing which packages were attached and any conflicts (functions from tidyverse packages that mask base R functions). The conflicts are normal and intentional—tidyverse functions are generally preferable for data science work.

Working with Real Data

Let’s load actual data to see the tidyverse in action. We’ll use student assessment data from the Open University Learning Analytics Dataset (OULAD), which contains information about student demographics and course outcomes.

# Read the data
students <- read_csv("data/oulad-students-and-assessments.csv")

# Take a quick look
glimpse(students)
Rows: 32,593
Columns: 17
$ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
$ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
$ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
$ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
$ region                     <chr> "East Anglian Region", "Scotland", "North W…
$ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
$ imd_band                   <dbl> 10, 3, 4, 6, 6, 9, 4, 10, 8, NA, 8, 3, 7, 6…
$ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
$ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
$ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
$ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
$ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
$ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
$ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pass                       <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ mean_weighted_score        <dbl> 780, 700, NA, 720, 690, 790, 700, 720, 720,…

The glimpse() function shows you the structure of your data: how many rows and columns, what type each column is (character, numeric, etc.), and the first few values. This dataset contains information about students (gender, age, region, disability status) along with their course performance (final_result, pass status, mean_weighted_score).

This is the dataset we’ll use in the examples below. It represents the kind of educational data you’ll work with in computational social science: thousands of observations with both categorical and continuous variables.

Essential dplyr Functions

The dplyr package provides functions for the most common data manipulation tasks. Here are the ones you’ll use repeatedly.

filter() - Selecting Rows

Use filter() to keep only rows that meet certain conditions:

# Keep only students who passed
passed_students <- students |>
  filter(pass == 1)

# Multiple conditions with AND (comma means AND)
passed_females <- students |>
  filter(pass == 1, gender == "F")

# OR condition using |
good_outcomes <- students |>
  filter(pass == 1 | mean_weighted_score > 800)

This gives us 22,333 passed students from the original 32,593. Use == for equality tests, > and < for comparisons, and combine conditions with commas (AND) or | (OR).

select() - Choosing Columns

Use select() to pick which columns to keep or remove:

# Select specific columns
student_basics <- students |>
  select(id_student, gender, age_band, final_result)

# Remove columns with minus sign
student_no_dates <- students |>
  select(-date_registration, -date_unregistration)

# Select by pattern
student_demographics <- students |>
  select(starts_with("id"), gender, region)

The select() function is useful when you have many columns but only need a few for analysis. Helper functions like starts_with(), ends_with(), and contains() make it easy to select groups of related columns.

mutate() - Creating or Modifying Columns

Use mutate() to add new columns or change existing ones:

# Create new column
students <- students |>
  mutate(high_achiever = mean_weighted_score > 800)

# Modify existing column
students <- students |>
  mutate(age_band = factor(age_band))

# Multiple new columns at once
students <- students |>
  mutate(
    score_category = case_when(
      mean_weighted_score > 800 ~ "High",
      mean_weighted_score > 650 ~ "Medium",
      TRUE ~ "Low"
    )
  )

The case_when() function inside mutate() handles conditional logic: if mean_weighted_score is over 800, assign “High”; if over 650, assign “Medium”; otherwise (TRUE), assign “Low”.

summarize() - Calculating Summaries

Use summarize() to collapse your data into summary statistics:

# Overall statistics
students |>
  summarize(
    mean_score = mean(mean_weighted_score, na.rm = TRUE),
    pass_rate = mean(pass, na.rm = TRUE),
    n_students = n()
  )
# A tibble: 1 × 3
  mean_score pass_rate n_students
       <dbl>     <dbl>      <int>
1       545.     0.379      32593

The na.rm = TRUE argument tells R to ignore missing values when calculating means. The n() function counts the number of rows.

group_by() - Operations by Group

Use group_by() before summarize() to calculate statistics for each group separately:

# Pass rates by gender
students |>
  group_by(gender) |>
  summarize(
    pass_rate = mean(pass, na.rm = TRUE),
    n = n()
  )
# A tibble: 2 × 3
  gender pass_rate     n
  <chr>      <dbl> <int>
1 F          0.390 14718
2 M          0.371 17875
# Multiple grouping variables
students |>
  group_by(gender, disability) |>
  summarize(mean_score = mean(mean_weighted_score, na.rm = TRUE))
# A tibble: 4 × 3
# Groups:   gender [2]
  gender disability mean_score
  <chr>  <chr>           <dbl>
1 F      N                481.
2 F      Y                456.
3 M      N                603.
4 M      Y                597.

After group_by(), subsequent operations happen separately for each group. This is one of the most powerful patterns in data analysis: split your data into groups, apply a function to each group, then combine the results.

arrange() - Sorting Rows

Use arrange() to reorder rows:

# Sort by score (ascending by default)
students |>
  arrange(mean_weighted_score)
# A tibble: 32,593 × 19
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 BBB         2013B                 521081 F      Yorkshire … Lower Than A Lev…
 2 BBB         2013B                 554986 F      London Reg… Lower Than A Lev…
 3 BBB         2013B                2423078 M      London Reg… A Level or Equiv…
 4 BBB         2013J                 467396 F      Wales       Lower Than A Lev…
 5 BBB         2013J                 559344 M      South Regi… A Level or Equiv…
 6 BBB         2013J                 577965 F      Yorkshire … A Level or Equiv…
 7 BBB         2013J                 581016 F      South East… Lower Than A Lev…
 8 BBB         2013J                 590867 M      London Reg… A Level or Equiv…
 9 BBB         2013J                 591986 M      West Midla… Lower Than A Lev…
10 BBB         2013J                 596988 F      West Midla… Lower Than A Lev…
# ℹ 32,583 more rows
# ℹ 13 more variables: imd_band <dbl>, age_band <fct>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <dbl>,
#   mean_weighted_score <dbl>, high_achiever <lgl>, score_category <chr>
# Sort descending
students |>
  arrange(desc(mean_weighted_score))
# A tibble: 32,593 × 19
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 BBB         2013B                2280038 M      Yorkshire … Lower Than A Lev…
 2 BBB         2013B                 497088 F      South Regi… A Level or Equiv…
 3 BBB         2014B                 633570 M      South East… A Level or Equiv…
 4 BBB         2013J                 595570 F      South East… A Level or Equiv…
 5 BBB         2014B                1626021 F      London Reg… Lower Than A Lev…
 6 BBB         2014B                 555994 M      London Reg… Lower Than A Lev…
 7 BBB         2013J                 595509 M      South Regi… A Level or Equiv…
 8 BBB         2013B                 558903 F      East Midla… HE Qualification 
 9 BBB         2014B                  25997 F      London Reg… A Level or Equiv…
10 FFF         2013B                 267602 M      East Midla… A Level or Equiv…
# ℹ 32,583 more rows
# ℹ 13 more variables: imd_band <dbl>, age_band <fct>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <dbl>,
#   mean_weighted_score <dbl>, high_achiever <lgl>, score_category <chr>
# Multiple sort keys
students |>
  arrange(gender, desc(mean_weighted_score))
# A tibble: 32,593 × 19
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 BBB         2013B                 497088 F      South Regi… A Level or Equiv…
 2 BBB         2013J                 595570 F      South East… A Level or Equiv…
 3 BBB         2014B                1626021 F      London Reg… Lower Than A Lev…
 4 BBB         2013B                 558903 F      East Midla… HE Qualification 
 5 BBB         2014B                  25997 F      London Reg… A Level or Equiv…
 6 FFF         2013B                2387054 F      Yorkshire … Lower Than A Lev…
 7 FFF         2014J                 397671 F      East Angli… Lower Than A Lev…
 8 FFF         2013J                1948159 F      North Regi… A Level or Equiv…
 9 FFF         2014B                 506038 F      West Midla… Lower Than A Lev…
10 FFF         2014J                1837138 F      East Midla… A Level or Equiv…
# ℹ 32,583 more rows
# ℹ 13 more variables: imd_band <dbl>, age_band <fct>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <dbl>,
#   mean_weighted_score <dbl>, high_achiever <lgl>, score_category <chr>

This is useful when you want to see the highest or lowest values, or when you need data in a particular order for plotting or reporting.

count() - Quick Frequency Tables

Use count() as a shortcut for group_by() + summarize(n = n()):

# Count final results
students |>
  count(final_result)
# A tibble: 4 × 2
  final_result     n
  <chr>        <int>
1 Distinction   3024
2 Fail          7052
3 Pass         12361
4 Withdrawn    10156
# Count with percentages
students |>
  count(final_result) |>
  mutate(percent = n / sum(n) * 100)
# A tibble: 4 × 3
  final_result     n percent
  <chr>        <int>   <dbl>
1 Distinction   3024    9.28
2 Fail          7052   21.6 
3 Pass         12361   37.9 
4 Withdrawn    10156   31.2 

The count() function is perfect for quickly understanding the distribution of categorical variables.

The Pipe Operator: |> and %>%

The tidyverse introduced a distinctive style: piping operations together to create readable data transformation pipelines. Instead of nesting functions inside each other, you “pipe” the output of one function into the next.

R now has a native pipe operator |> (introduced in R 4.1), while the tidyverse originally used %>% from the magrittr package. They work almost identically for most purposes.

Here’s a concrete example using the dplyr functions we just learned:

# Without pipes (nested, hard to read)
summarize(
  group_by(
    filter(students, pass == 1),
    gender
  ),
  mean_score = mean(mean_weighted_score, na.rm = TRUE)
)
# With pipes (clear, step-by-step)
students |>
  filter(pass == 1) |>
  group_by(gender) |>
  summarize(mean_score = mean(mean_weighted_score, na.rm = TRUE))
# A tibble: 2 × 2
  gender mean_score
  <chr>       <dbl>
1 F            499.
2 M            638.

The piped version reads like instructions: “Take students, then keep only those who passed, then group by gender, then calculate mean scores.” Each step is on its own line, making it easy to follow the logic.

We’ll use the native pipe |> throughout this book because it’s now built into R, but you’ll encounter %>% in older code and documentation. For practical purposes, they’re interchangeable in most tidyverse workflows.

How to read it: Think of |> as “then.” The pipe takes the result from the left and passes it as the first argument to the function on the right.

Data Visualization with ggplot2

The ggplot2 package uses a “grammar of graphics” approach where you build plots in layers. You start with your data, specify how variables map to visual properties (x-axis, y-axis, colors), then add geometric objects like points, bars, or lines.

Basic Structure

Every ggplot follows this template:

ggplot(data = DATA, aes(x = X_VAR, y = Y_VAR)) +
  geom_FUNCTION()

The aes() function (short for “aesthetics”) maps your data columns to visual properties. The geom_ functions specify what kind of plot to draw.

Bar Charts - Categorical Data

Use geom_bar() to visualize counts of categorical variables:

# Simple bar chart
students |>
  ggplot(aes(x = final_result)) +
  geom_bar()
Figure 4.1: Distribution of student final results

This automatically counts how many students have each final result (Pass, Fail, Distinction, Withdrawn) and displays the counts as bars.

# Stacked bar chart by group
students |>
  ggplot(aes(x = gender, fill = final_result)) +
  geom_bar()
Figure 4.2: Stacked bar chart showing final results by gender

The fill aesthetic colors the bars by final_result, creating stacked bars that show how outcomes differ between genders.

# Side-by-side bars
students |>
  ggplot(aes(x = gender, fill = final_result)) +
  geom_bar(position = "dodge")
Figure 4.3: Side-by-side bar chart showing final results by gender

Setting position = "dodge" puts bars next to each other instead of stacking them, making it easier to compare counts directly.

Histograms - Distributions

Use geom_histogram() to see the distribution of continuous variables:

# Distribution of scores
students |>
  ggplot(aes(x = mean_weighted_score)) +
  geom_histogram(binwidth = 50)
Figure 4.4: Distribution of mean weighted scores

The binwidth argument controls how wide each bar is. Smaller values show more detail, larger values show broader patterns.

# Separate distributions by group
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = mean_weighted_score, fill = factor(pass))) +
  geom_histogram(binwidth = 50, position = "identity", alpha = 0.5)
Figure 4.5: Overlaid histograms of scores by pass status

Using position = "identity" overlays the histograms, and alpha = 0.5 makes them semi-transparent so you can see both distributions.

Boxplots - Comparing Distributions

Use geom_boxplot() to compare distributions across groups:

# Scores by final result
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = final_result, y = mean_weighted_score)) +
  geom_boxplot()
Figure 4.6: Score distributions by final result

Boxplots show the median (middle line), quartiles (box edges), and outliers (individual points). This makes it easy to see that Distinction students have higher median scores than Pass students, who score higher than those who Fail or Withdraw.

# With colors
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = final_result, y = mean_weighted_score, fill = final_result)) +
  geom_boxplot()
Figure 4.7: Score distributions by final result with color

Adding fill colors each boxplot by the category, making them easier to distinguish.

Scatter Plots - Relationships

Use geom_point() to explore relationships between two continuous variables:

# Registration timing vs. scores
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = date_registration, y = mean_weighted_score)) +
  geom_point(alpha = 0.3)
Figure 4.8: Registration timing versus mean weighted scores

The alpha parameter makes points semi-transparent, which helps when many points overlap. This plot would show whether students who register earlier tend to score higher or lower.

# Add a trend line
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = date_registration, y = mean_weighted_score)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm")
Figure 4.9: Registration timing versus scores with trend line

Adding geom_smooth(method = "lm") fits a linear model and displays the trend line with a confidence band, making patterns easier to see.

Improving Plots with Labels and Themes

Good plots need clear labels:

students |>
  ggplot(aes(x = final_result)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Distribution of Student Outcomes",
    x = "Final Result",
    y = "Number of Students"
  ) +
  theme_minimal()
Figure 4.10: Distribution of student outcomes

The labs() function adds titles and axis labels. The theme_minimal() function applies a clean, simple theme. Other themes include theme_bw(), theme_classic(), and theme_light().

Faceting - Small Multiples

Use faceting to create separate plots for different groups:

# Separate plot for each course module
students |>
  ggplot(aes(x = final_result)) +
  geom_bar() +
  facet_wrap(~code_module)
Figure 4.11: Final results by course module

This creates a grid of small bar charts, one for each module. It’s useful when you want to compare patterns across many categories.

# Grid by two variables
students |>
  filter(!is.na(mean_weighted_score)) |>
  ggplot(aes(x = mean_weighted_score)) +
  geom_histogram(binwidth = 50) +
  facet_grid(gender ~ disability)
Figure 4.12: Score distributions by gender and disability status

facet_grid() creates a matrix of plots: rows for one variable (gender), columns for another (disability). This lets you see how score distributions vary across combinations of factors.

Next Steps

You’ve now seen core tidyverse functions in action with real data. The seven dplyr functions covered here—filter(), select(), mutate(), summarize(), group_by(), arrange(), and count()—handle the majority of everyday data manipulation tasks. The ggplot2 visualizations—bar charts, histograms, boxplots, scatter plots, and faceting—cover most exploratory analysis needs.

The subsequent chapters in this book use these functions extensively, and we’ll introduce additional tidyverse capabilities as they become relevant. You’ll continue learning by doing, working with real data and real research questions.

When you want to go deeper:

The tidyverse community is large and welcoming. When you get stuck, someone has likely asked your question before.