Mastering Tidy R: Data Visualization with tidyverse and ggplot2

ggplot2

tidyverse

data visualization

tutorial

Author

Affiliation

Andrew Ellis

Virtual Academy, Bern University of Applied Sciences

Published

05 Jun, 2025

This post demonstrates how to use tidy R code principles with the tidyverse ecosystem to create compelling data visualizations using ggplot2. We’ll explore data manipulation techniques and build publication-ready plots step by step.

The Tidy Data Philosophy

The tidyverse is built around the concept of tidy data, where:

Each variable forms a column
Each observation forms a row
Each value has its own cell

This structure makes data analysis more intuitive and code more readable.

Setup and Data Preparation

First, we load the necessary libraries and create some example data to demonstrate key concepts:

Show the code

# Load the tidyverse (includes dplyr, ggplot2, tidyr, readr, and more)
library(tidyverse)
library(scales)

# Set a custom theme for our plots
theme_set(theme_minimal(base_size = 12))

# Create example dataset: Student performance across different subjects and semesters
set.seed(123)
student_data <- tibble(
  student_id = rep(1:100, each = 6),
  semester = rep(c("Fall 2023", "Spring 2024", "Fall 2024"), times = 200),
  subject = rep(c("Mathematics", "Science"), times = 300),
  score = round(rnorm(600, mean = 78, sd = 12), 1),
  study_hours = round(runif(600, min = 5, max = 25), 1),
  attendance = round(runif(600, min = 75, max = 100), 1)
) |>
  # Add some realistic constraints
  mutate(
    score = pmax(0, pmin(100, score)),
    # Students who study more tend to score higher
    score = score + (study_hours - mean(study_hours)) * 0.8,
    # Better attendance correlates with better scores
    score = score + (attendance - mean(attendance)) * 0.3,
    score = round(pmax(0, pmin(100, score)), 1)
  )

# Display the structure of our data
glimpse(student_data)

Rows: 600
Columns: 6
$ student_id  <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4…
$ semester    <chr> "Fall 2023", "Spring 2024", "Fall 2024", "Fall 2023", "Spr…
$ subject     <chr> "Mathematics", "Science", "Mathematics", "Science", "Mathe…
$ score       <dbl> 74.5, 83.9, 94.4, 83.6, 80.2, 100.0, 79.9, 58.0, 77.0, 73.…
$ study_hours <dbl> 22.2, 22.7, 14.8, 19.4, 14.7, 24.8, 6.3, 8.2, 20.7, 15.8, …
$ attendance  <dbl> 78.9, 96.1, 80.4, 91.7, 90.4, 76.2, 98.7, 89.6, 96.5, 88.0…

Data Exploration with dplyr

Let’s explore our data using tidy data manipulation techniques:

Show the code

# Calculate summary statistics by subject and semester
summary_stats <- student_data |>
  group_by(subject, semester) |>
  summarise(
    n_students = n_distinct(student_id),
    avg_score = mean(score, na.rm = TRUE),
    median_score = median(score, na.rm = TRUE),
    sd_score = sd(score, na.rm = TRUE),
    avg_study_hours = mean(study_hours, na.rm = TRUE),
    avg_attendance = mean(attendance, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(subject, semester)

# Display the summary
summary_stats |>
  knitr::kable(digits = 2, caption = "Summary Statistics by Subject and Semester")

Summary Statistics by Subject and Semester
subject	semester	n_students	avg_score	median_score	sd_score	avg_study_hours	avg_attendance
Mathematics	Fall 2023	100	78.82	79.00	11.08	14.75	87.96
Mathematics	Fall 2024	100	79.59	80.70	13.19	15.70	86.86
Mathematics	Spring 2024	100	78.29	80.20	12.32	14.96	87.64
Science	Fall 2023	100	75.90	76.70	12.02	14.28	87.61
Science	Fall 2024	100	77.69	77.95	13.27	14.87	87.17
Science	Spring 2024	100	77.23	77.30	11.29	15.91	87.07

Creating Effective Visualizations

1. Distribution of Scores by Subject

Show the code

# Create a violin plot with box plots overlay
p1 <- student_data |>
  ggplot(aes(x = subject, y = score, fill = subject)) +
  geom_violin(alpha = 0.7, trim = FALSE) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) +
  scale_fill_viridis_d(option = "plasma", begin = 0.3, end = 0.8) +
  labs(
    title = "Distribution of Student Scores by Subject",
    subtitle = "Violin plots show the full distribution shape with box plots for summary statistics",
    x = "Subject",
    y = "Score (%)",
    caption = "Data: Simulated student performance data"
  ) +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "grey40"),
    panel.grid.minor = element_blank()
  )

print(p1)

2. Trend Analysis Across Semesters

Show the code

# Calculate means and confidence intervals for trend analysis
trend_data <- student_data |>
  group_by(subject, semester) |>
  summarise(
    mean_score = mean(score),
    se = sd(score) / sqrt(n()),
    ci_lower = mean_score - 1.96 * se,
    ci_upper = mean_score + 1.96 * se,
    .groups = "drop"
  ) |>
  mutate(semester = factor(semester, levels = c("Fall 2023", "Spring 2024", "Fall 2024")))

p2 <- trend_data |>
  ggplot(aes(x = semester, y = mean_score, color = subject, group = subject)) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = subject), 
              alpha = 0.2, color = NA) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_color_viridis_d(option = "plasma", begin = 0.3, end = 0.8) +
  scale_fill_viridis_d(option = "plasma", begin = 0.3, end = 0.8) +
  labs(
    title = "Student Performance Trends Across Semesters",
    subtitle = "Mean scores with 95% confidence intervals",
    x = "Semester",
    y = "Average Score (%)",
    color = "Subject",
    fill = "Subject",
    caption = "Shaded areas represent 95% confidence intervals"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "grey40"),
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Show the code

print(p2)

3. Relationship Between Study Time and Performance

Show the code

# Create a sophisticated scatter plot with trend lines
p3 <- student_data |>
  ggplot(aes(x = study_hours, y = score)) +
  geom_point(aes(color = subject, alpha = attendance), size = 2) +
  geom_smooth(aes(color = subject), method = "lm", se = TRUE, size = 1.2) +
  scale_color_viridis_d(option = "plasma", begin = 0.3, end = 0.8) +
  scale_alpha_continuous(range = c(0.3, 0.8), name = "Attendance %") +
  labs(
    title = "Relationship Between Study Hours and Academic Performance",
    subtitle = "Point transparency indicates attendance rate",
    x = "Weekly Study Hours",
    y = "Score (%)",
    color = "Subject",
    caption = "Linear trend lines with 95% confidence intervals"
  ) +
  facet_wrap(~semester, ncol = 3) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "grey40"),
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 12, face = "bold")
  )

print(p3)

`geom_smooth()` using formula = 'y ~ x'

4. Advanced: Heat Map of Performance Patterns

Show the code

# Create performance bins and calculate percentages
heatmap_data <- student_data |>
  mutate(
    study_bins = cut(study_hours, 
                     breaks = c(0, 10, 15, 20, 25),
                     labels = c("Low (5-10h)", "Medium (10-15h)", 
                               "High (15-20h)", "Very High (20-25h)"),
                     include.lowest = TRUE),
    score_grade = case_when(
      score >= 90 ~ "A (90-100)",
      score >= 80 ~ "B (80-89)",
      score >= 70 ~ "C (70-79)",
      score >= 60 ~ "D (60-69)",
      TRUE ~ "F (<60)"
    )
  ) |>
  count(study_bins, score_grade, subject) |>
  group_by(study_bins, subject) |>
  mutate(percentage = n / sum(n) * 100) |>
  ungroup()

p4 <- heatmap_data |>
  ggplot(aes(x = study_bins, y = score_grade, fill = percentage)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            color = "white", fontface = "bold", size = 3) +
  scale_fill_viridis_c(option = "plasma", name = "Percentage\nof Students") +
  labs(
    title = "Grade Distribution by Study Time Investment",
    subtitle = "Percentage of students achieving each grade level by study hours",
    x = "Weekly Study Hours",
    y = "Grade Level",
    caption = "Higher study hours clearly correlate with better grades"
  ) +
  facet_wrap(~subject) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "grey40"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank(),
    strip.text = element_text(size = 12, face = "bold")
  )

print(p4)

Tidy Code Best Practices Demonstrated

Throughout this analysis, we’ve followed key tidy R principles:

1. Pipe Operator (`|>`) for Readable Code Flow

Show the code

# Instead of nested functions:
# plot(summary(filter(data, condition)))

# Use pipes for clarity:
data |>
  filter(condition) |>
  summary() |>
  plot()

2. Consistent Grammar of Graphics

Data (data =)
Aesthetics (aes())
Geometries (geom_*())
Scales (scale_*())
Themes (theme())

3. Meaningful Variable Names and Grouping

Show the code

# Group operations clearly
student_data |>
  group_by(subject, semester) |>
  summarise(avg_score = mean(score)) |>
  ungroup()

4. Functional Approach with Consistent Styling

Use snake_case for variable names
Keep line length reasonable
Add meaningful labels and captions
Use consistent color schemes

Key Takeaways

Tidy data structure makes analysis intuitive and code readable
dplyr verbs (filter, mutate, summarise, group_by) provide powerful data manipulation
ggplot2’s grammar of graphics enables building complex visualizations systematically
Consistent styling and theming makes plots publication-ready
Pipe operator creates readable analysis workflows

The tidyverse ecosystem provides a coherent framework for data science that emphasizes code readability, reproducibility, and elegant solutions to common data challenges.

Reuse

CC BY 4.0

The Tidy Data Philosophy

Setup and Data Preparation

Data Exploration with dplyr

Creating Effective Visualizations

1. Distribution of Scores by Subject

2. Trend Analysis Across Semesters

3. Relationship Between Study Time and Performance

4. Advanced: Heat Map of Performance Patterns

Tidy Code Best Practices Demonstrated

1. Pipe Operator (|>) for Readable Code Flow

2. Consistent Grammar of Graphics

3. Meaningful Variable Names and Grouping

4. Functional Approach with Consistent Styling

Key Takeaways

Reuse

1. Pipe Operator (`|>`) for Readable Code Flow