Data Manipulation with dplyr and tidyr

Introduction to the Tidyverse

The tidyverse is a collection of R packages designed for data science that share a common philosophy and design. In this lesson, we’ll focus on two core packages:

dplyr: For data manipulation
tidyr: For data tidying and reshaping

Let’s start by loading the required packages:

# Install packages if not already installed
if (!require("tidyverse")) install.packages("tidyverse")

# Load the tidyverse (includes dplyr and tidyr)
library(tidyverse)

Helpful Resources

Before we begin, here are some valuable resources to keep handy:

RStudio Cheat Sheets:
- Data Transformation with dplyr
- Data Tidying with tidyr These cheat sheets provide quick references for the functions we’ll be using.
Keyboard Shortcuts:
- Pipe operator (%>%): Ctrl/Cmd + Shift + M
- Assignment operator (<-): Alt + - (Windows/Linux) or Option + - (Mac)

Keep these resources open in another tab while working through the examples - they’re incredibly helpful for remembering function names and arguments!

Sample Dataset

We’ll use a gene expression dataset to demonstrate these concepts:

# Load required packages for data manipulation
library(tidyverse)  # Includes dplyr, tidyr, and other data manipulation tools

# Create sample gene expression dataset
gene_data <- tibble(
  gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2"),        # Gene identifiers
  control_1 = c(100, 150, 80, 200, 120),                        # Expression values for control replicate 1
  control_2 = c(110, 140, 85, 190, 125),                        # Expression values for control replicate 2
  treated_1 = c(200, 300, 90, 180, 240),                        # Expression values for treatment replicate 1
  treated_2 = c(190, 280, 95, 185, 230),                        # Expression values for treatment replicate 2
  chromosome = c("17", "17", "7", "12", "17"),                  # Chromosome locations
  pathway = c("DNA repair", "Cell cycle", "Growth", "Signaling", "Growth")  # Associated biological pathways
)

# Basic data filtering using dplyr
filtered_genes <- gene_data %>%
  filter(chromosome == "17") %>%                # Select only genes on chromosome 17
  select(gene_id, control_1, treated_1) %>%     # Keep only gene ID and first replicates
  arrange(desc(control_1))                      # Sort by control_1 values in descending order

# Calculate mean expression for each condition
mean_expression <- gene_data %>%
  rowwise() %>%                                # Operate on each row independently
  mutate(
    mean_control = mean(c(control_1, control_2)),    # Calculate mean of control replicates
    mean_treated = mean(c(treated_1, treated_2))     # Calculate mean of treated replicates
  ) %>%
  select(gene_id, mean_control, mean_treated)        # Keep only relevant columns

# Calculate fold change between conditions
fold_changes <- mean_expression %>%
  mutate(
    fold_change = mean_treated / mean_control,       # Calculate fold change
    log2_fold_change = log2(fold_change)            # Calculate log2 fold change
  )

# Convert data from wide to long format
long_data <- gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),  # Columns to convert
    names_to = "sample",                                   # New column for sample names
    values_to = "expression"                              # New column for expression values
  ) %>%
  separate(sample,                                        # Split sample column
          into = c("condition", "replicate"),             # New column names
          sep = "_")                                      # Separator in sample names

# Calculate summary statistics by condition
summary_stats <- long_data %>%
  group_by(condition) %>%                                # Group by experimental condition
  summarise(
    mean_expr = mean(expression),                        # Calculate mean expression
    sd_expr = sd(expression),                           # Calculate standard deviation
    n_samples = n(),                                    # Count number of samples
    sem = sd_expr / sqrt(n_samples)                     # Calculate standard error of mean
  )

# Join gene information with pathway data
pathway_analysis <- gene_data %>%
  group_by(pathway) %>%                                 # Group by biological pathway
  summarise(
    n_genes = n(),                                     # Count genes in each pathway
    mean_control = mean(control_1),                    # Calculate mean control expression
    mean_treated = mean(treated_1)                     # Calculate mean treated expression
  ) %>%
  mutate(
    pathway_fc = mean_treated / mean_control           # Calculate pathway-level fold change
  )

# Create summary table with multiple statistics
gene_summary <- gene_data %>%
  rowwise() %>%
  mutate(
    control_mean = mean(c(control_1, control_2)),      # Mean control expression
    treated_mean = mean(c(treated_1, treated_2)),      # Mean treated expression
    control_sd = sd(c(control_1, control_2)),          # Control standard deviation
    treated_sd = sd(c(treated_1, treated_2)),          # Treated standard deviation
    fold_change = treated_mean / control_mean          # Calculate fold change
  ) %>%
  select(gene_id, chromosome, pathway,                 # Select columns for final table
         control_mean, control_sd,
         treated_mean, treated_sd,
         fold_change)

# Filter significant changes
significant_changes <- gene_summary %>%
  filter(fold_change >= 1.5 | fold_change <= 0.67) %>%  # Select genes with 1.5-fold change
  arrange(desc(fold_change))                            # Sort by fold change magnitude

# Export results to CSV file
write_csv(significant_changes, "significant_genes.csv")  # Save results to file

Understanding the Pipe Operator (`%>%`)

Before diving into data manipulation, let’s understand the pipe operator (%>%), which is fundamental to writing clear and readable code in R.

What is Piping?

The pipe operator (%>%) takes the output from one function and passes it as the first argument to the next function. It helps write code that can be read from left to right, making it more intuitive and easier to understand.

Traditional vs. Piped Syntax

# Traditional nested syntax
mean(sqrt(abs(log(c(1:10)))))

# Same operation with pipes
c(1:10) %>%
  log() %>%
  abs() %>%
  sqrt() %>%
  mean()

# Example with our gene data
# Traditional syntax
head(arrange(filter(gene_data, chromosome == "17"), desc(control_1)))

# Same operation with pipes
gene_data %>%
  filter(chromosome == "17") %>%
  arrange(desc(control_1)) %>%
  head()

Why Use Pipes?

Readability: Pipes make code easier to read by showing the sequence of operations from left to right
Maintainability: Each step in the analysis is clearly separated, making it easier to modify or debug
Reduced Nesting: Eliminates the need for nested function calls or multiple intermediate objects
Code Organization: Makes it clear how data flows through a series of transformations

Pro Tips for Using Pipes

Start with the data object
Add one operation per line
Indent lines after the first pipe
Use pipes for 2 or more operations
In RStudio, type Ctrl/Cmd + Shift + M to insert the pipe operator

# Example of well-formatted piped operations
result <- gene_data %>%
  filter(chromosome == "17") %>%
  select(gene_id, control_1, treated_1) %>%
  mutate(fold_change = treated_1 / control_1) %>%
  arrange(desc(fold_change))

Data Manipulation with dplyr

1. Selecting Columns (select)

The select() function helps you choose which columns to keep or remove:

# Select specific columns
gene_data %>%
  select(gene_id, control_1, treated_1)

# Select columns by pattern
gene_data %>%
  select(starts_with("control"))

# Remove columns
gene_data %>%
  select(-ends_with("2"))

2. Filtering Rows (filter)

Use filter() to subset rows based on conditions:

# Filter genes on chromosome 17
gene_data %>%
  filter(chromosome == "17")

# Multiple conditions
gene_data %>%
  filter(chromosome == "17", 
         control_1 > 100)

# Complex conditions
gene_data %>%
  filter(pathway %in% c("Growth", "Signaling"))

3. Creating New Columns (mutate)

mutate() adds new columns based on calculations:

# Calculate mean expression for controls and treated
gene_data %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,
    mean_treated = (treated_1 + treated_2) / 2,
    fold_change = mean_treated / mean_control
  )

# Log transform expression values
gene_data %>%
  mutate(across(
    .cols = c(control_1, control_2, treated_1, treated_2),
    .fns = log2,
    .names = "log2_{.col}"
  ))

4. Summarizing Data (summarize/summarise)

summarize() creates summary statistics:

# Calculate mean expression per condition
gene_data %>%
  summarise(
    mean_control_1 = mean(control_1),
    mean_treated_1 = mean(treated_1),
    n_genes = n()
  )

# Group by pathway and summarize
gene_data %>%
  group_by(pathway) %>%
  summarise(
    n_genes = n(),
    mean_control = mean(control_1),
    mean_treated = mean(treated_1)
  )

Data Tidying with tidyr

1. Reshaping Data (pivot_longer and pivot_wider)

Data often comes in different formats, and we frequently need to convert between them for different types of analysis:

Wide format: Each sample/condition is in a separate column (e.g., control_1, control_2, treated_1, treated_2)
- Good for: Viewing data in spreadsheets, manual data entry
- Example use: When you want to see all measurements for a gene in one row
Long format: Each observation is in a separate row
- Good for: Statistical analysis, plotting with ggplot2, most modeling functions
- Example use: When you need to compare conditions or create box plots

Common reasons for converting formats:

Visualization: Many plotting functions (especially ggplot2) prefer long format
Statistical Analysis: Functions like t.test() and ANOVA expect data in long format
Data Manipulation: Some calculations are easier in one format vs. another
Data Export: Different tools might require specific formats

Let’s see this in practice:

# Convert to long format
gene_data_long <- gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),
    names_to = "sample",
    values_to = "expression"
  )

# Convert back to wide format
gene_data_wide <- gene_data_long %>%
  pivot_wider(
    names_from = sample,
    values_from = expression
  )

2. Separating and Uniting Columns

Sometimes we need to split or combine columns:

# Add a column with gene info to split
gene_data_info <- gene_data %>%
  mutate(gene_info = paste(gene_id, chromosome, sep = "_chr"))

# Separate the gene_info column
gene_data_info %>%
  separate(gene_info, 
          into = c("gene", "chromosome"),
          sep = "_chr")

# Unite columns back together
gene_data_info %>%
  separate(gene_info, 
          into = c("gene", "chromosome"),
          sep = "_chr") %>%
  unite("gene_chr", gene, chromosome, sep = "_chr")

Advanced dplyr Operations

Joining Data Frames

Often we need to combine information from multiple data frames. Let’s create a second dataset with additional gene information:

# Create a second dataset with gene annotations
gene_annotations <- tibble(
  gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2", "PTEN"),
  full_name = c("Breast Cancer 1", "Tumor Protein 53", "Epidermal Growth Factor Receptor",
                "KRAS Proto-Oncogene", "Human Epidermal Growth Factor Receptor 2", 
                "Phosphatase and Tensin Homolog"),
  is_oncogene = c(FALSE, FALSE, TRUE, TRUE, TRUE, FALSE)
)

# Inner join - only keep genes present in both datasets
gene_data %>%
  inner_join(gene_annotations, by = "gene_id")

# Left join - keep all genes from gene_data
gene_data %>%
  left_join(gene_annotations, by = "gene_id")

# Full join - keep all genes from both datasets
gene_data %>%
  full_join(gene_annotations, by = "gene_id")

Set Operations

dplyr provides functions for set operations between data frames:

# Create two datasets with some overlapping genes
set1 <- gene_data %>% filter(chromosome == "17")
set2 <- gene_data %>% filter(pathway == "Growth")

# Union - combine unique rows
bind_rows(set1, set2) %>% distinct()

# Intersect - find common rows
inner_join(set1, set2, by = names(set1))

# Setdiff - find rows in set1 not in set2
anti_join(set1, set2, by = names(set1))

Advanced Grouping Operations

# Group by multiple columns
gene_data %>%
  group_by(chromosome, pathway) %>%
  summarise(
    n_genes = n(),
    mean_expression = mean(control_1),
    .groups = "drop"
  )

# Grouped mutations
gene_data %>%
  group_by(chromosome) %>%
  mutate(
    rel_expression = control_1 / mean(control_1),
    rank = min_rank(desc(control_1))
  ) %>%
  ungroup()

Creating Custom Functions

Basic Function Creation

# Create a function to calculate fold change
calculate_fold_change <- function(treated, control) {
  if (any(control <= 0)) {
    warning("Control values should be positive")
    return(NA)
  }
  return(treated / control)
}

# Use the function with mutate
gene_data %>%
  mutate(
    fc_1 = calculate_fold_change(treated_1, control_1),
    fc_2 = calculate_fold_change(treated_2, control_2)
  )

Functions with Multiple Arguments

# Function to filter and summarize expression data
analyze_expression <- function(data, chr = NULL, min_expression = 0) {
  # Start with the data
  result <- data
  
  # Filter by chromosome if specified
  if (!is.null(chr)) {
    result <- result %>% filter(chromosome == chr)
  }
  
  # Apply expression threshold
  result <- result %>%
    filter(control_1 > min_expression) %>%
    mutate(
      mean_control = (control_1 + control_2) / 2,
      mean_treated = (treated_1 + treated_2) / 2,
      fold_change = mean_treated / mean_control
    ) %>%
    select(gene_id, chromosome, pathway, mean_control, mean_treated, fold_change)
  
  return(result)
}

# Use the custom function
gene_data %>%
  analyze_expression(chr = "17", min_expression = 100)

Creating Pipeline Functions

# Function to perform standard analysis pipeline
standard_analysis <- function(data, group_col) {
  group_col <- enquo(group_col)  # Quote the grouping column
  
  data %>%
    group_by(!!group_col) %>%
    summarise(
      n_genes = n(),
      mean_expression = mean(control_1),
      up_regulated = sum(treated_1 > control_1),
      down_regulated = sum(treated_1 < control_1),
      .groups = "drop"
    ) %>%
    mutate(
      pct_up = up_regulated / n_genes * 100,
      pct_down = down_regulated / n_genes * 100
    )
}

# Use the pipeline function
gene_data %>%
  standard_analysis(pathway)

Practice Exercises

Exercise 1: Data Manipulation

Using the gene_data dataset:

Filter for genes on chromosome 17
Calculate the fold change between treated and control conditions
Select only gene_id and fold change columns

gene_data %>%
  filter(chromosome == "17") %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,
    mean_treated = (treated_1 + treated_2) / 2,
    fold_change = mean_treated / mean_control
  ) %>%
  select(gene_id, fold_change)

Exercise 2: Data Reshaping

Convert the expression data to long format
Calculate mean expression per condition
Create a summary of fold changes per pathway

# Convert to long format and summarize
gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),
    names_to = "sample",
    values_to = "expression"
  ) %>%
  group_by(gene_id, pathway) %>%
  summarise(
    mean_expression = mean(expression),
    .groups = "drop"
  )

Exercise 3: Custom Functions

Create a function to normalize expression values
Apply the function to both control and treated samples
Calculate statistics on the normalized values

# Create normalization function
normalize_expression <- function(x) {
  (x - mean(x)) / sd(x)
}

# Apply to data
gene_data %>%
  mutate(across(
    .cols = c(control_1, control_2, treated_1, treated_2),
    .fns = normalize_expression,
    .names = "norm_{.col}"
  )) %>%
  select(gene_id, starts_with("norm_"))

Next Steps

After mastering these basics, you can move on to:

Advanced data manipulation techniques
Working with grouped data
Complex data transformations
Combining multiple operations with pipes