Introduction to the Tidyverse

The tidyverse is a collection of R packages designed for data science that share a common philosophy and design. These packages are essential for modern data analysis in R, especially for bioinformatics and genomics research.

Key packages we’ll focus on: - dplyr: For efficient data manipulation and transformation - tidyr: For cleaning and reshaping messy data into tidy format

Let’s start by loading the required packages:

# First, check if tidyverse is installed and install if needed
# This is good practice for reproducible code
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the tidyverse meta-package
# This includes: dplyr, tidyr, ggplot2, and other essential packages
library(tidyverse)

Helpful Resources

Before we begin, here are some valuable resources to keep handy:

  1. RStudio Cheat Sheets:
  2. Keyboard Shortcuts:
    • Pipe operator (%>%): Ctrl/Cmd + Shift + M
    • Assignment operator (<-): Alt + - (Windows/Linux) or Option + - (Mac) These shortcuts will speed up your coding significantly!

Sample Dataset

We’ll use a gene expression dataset that mimics real RNA-seq data:

# Create a sample gene expression dataset with:
# - Gene identifiers
# - Expression values for control and treated samples (with replicates)
# - Genomic information (chromosome location)
# - Biological pathway information
gene_data <- tibble(
  gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2"),     # Important cancer genes
  control_1 = c(100, 150, 80, 200, 120),                     # Control replicate 1
  control_2 = c(110, 140, 85, 190, 125),                     # Control replicate 2
  treated_1 = c(200, 300, 90, 180, 240),                     # Treatment replicate 1
  treated_2 = c(190, 280, 95, 185, 230),                     # Treatment replicate 2
  chromosome = c("17", "17", "7", "12", "17"),               # Chromosome locations
  pathway = c("DNA repair", "Cell cycle", "Growth", "Signaling", "Growth")  # Biological pathways
)

# Display the dataset
print(gene_data)
## # A tibble: 5 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 EGFR           80        85        90        95 7          Growth    
## 4 KRAS          200       190       180       185 12         Signaling 
## 5 HER2          120       125       240       230 17         Growth

Understanding the Pipe Operator (%>%)

The pipe operator is a fundamental concept in modern R programming, especially for data analysis workflows.

What is Piping?

The pipe operator (%>%) takes the output from one function and passes it as the first argument to the next function. Think of it like a pipeline where data flows through a series of operations.

Benefits: 1. Makes code more readable (left-to-right flow) 2. Reduces need for intermediate variables 3. Makes complex operations more manageable

Traditional vs. Piped Syntax

# Traditional nested syntax (hard to read)
mean(sqrt(abs(log(c(1:10)))))
## [1] 1.150203
# Same operation with pipes (easier to follow)
# Each step is clearly visible
c(1:10) %>%
  log() %>%      # First take the log
  abs() %>%      # Then the absolute value
  sqrt() %>%     # Then the square root
  mean()         # Finally calculate the mean
## [1] 1.150203
# Example with our gene data
# Traditional syntax (nested, harder to understand)
head(arrange(filter(gene_data, chromosome == "17"), desc(control_1)))
## # A tibble: 3 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 TP53          150       140       300       280 17         Cell cycle
## 2 HER2          120       125       240       230 17         Growth    
## 3 BRCA1         100       110       200       190 17         DNA repair
# Same operation with pipes (clear sequence of operations)
gene_data %>%
  filter(chromosome == "17") %>%         # First, keep only chromosome 17 genes
  arrange(desc(control_1)) %>%           # Then sort by control_1 values
  head()                                 # Finally, show first few rows
## # A tibble: 3 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 TP53          150       140       300       280 17         Cell cycle
## 2 HER2          120       125       240       230 17         Growth    
## 3 BRCA1         100       110       200       190 17         DNA repair

Why Use Pipes?

  1. Readability: Code flows naturally from left to right
  2. Maintainability: Easy to add/remove/modify steps
  3. Reduced Nesting: No more deeply nested function calls
  4. Code Organization: Clear data transformation workflow

Pro Tips for Using Pipes

  1. Start with the data object
  2. Add one operation per line
  3. Indent lines after the first pipe
  4. Use pipes for 2 or more operations
  5. Use keyboard shortcut: Ctrl/Cmd + Shift + M
# Example of a well-formatted analysis pipeline
result <- gene_data %>%
  filter(chromosome == "17") %>%                    # Filter for chromosome 17
  select(gene_id, control_1, treated_1) %>%        # Keep only needed columns
  mutate(fold_change = treated_1 / control_1) %>%  # Calculate fold change
  arrange(desc(fold_change))                       # Sort by fold change

print("Well-formatted pipe example result:")
## [1] "Well-formatted pipe example result:"
print(result)
## # A tibble: 3 × 4
##   gene_id control_1 treated_1 fold_change
##   <chr>       <dbl>     <dbl>       <dbl>
## 1 BRCA1         100       200           2
## 2 TP53          150       300           2
## 3 HER2          120       240           2

Data Manipulation with dplyr

1. Selecting Columns (select)

The select() function helps you choose which columns to keep or remove. This is essential for focusing on relevant data:

# Select specific columns of interest
gene_data %>%
  select(gene_id, control_1, treated_1)  # Keep only gene IDs and one replicate each
## # A tibble: 5 × 3
##   gene_id control_1 treated_1
##   <chr>       <dbl>     <dbl>
## 1 BRCA1         100       200
## 2 TP53          150       300
## 3 EGFR           80        90
## 4 KRAS          200       180
## 5 HER2          120       240
# Select columns by pattern matching
# Useful when you have many similarly named columns
gene_data %>%
  select(starts_with("control"))  # Get all control samples
## # A tibble: 5 × 2
##   control_1 control_2
##       <dbl>     <dbl>
## 1       100       110
## 2       150       140
## 3        80        85
## 4       200       190
## 5       120       125
# Remove unwanted columns
# Useful for excluding replicate 2 data
gene_data %>%
  select(-ends_with("2"))  # Remove all replicate 2 columns
## # A tibble: 5 × 5
##   gene_id control_1 treated_1 chromosome pathway   
##   <chr>       <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       200 17         DNA repair
## 2 TP53          150       300 17         Cell cycle
## 3 EGFR           80        90 7          Growth    
## 4 KRAS          200       180 12         Signaling 
## 5 HER2          120       240 17         Growth

2. Filtering Rows (filter)

Use filter() to subset rows based on conditions. Essential for focusing on genes of interest:

# Filter genes on chromosome 17
# Common in cancer studies (many cancer genes on chr17)
gene_data %>%
  filter(chromosome == "17")
## # A tibble: 3 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 HER2          120       125       240       230 17         Growth
# Multiple conditions
# Find highly expressed genes on chr17
gene_data %>%
  filter(chromosome == "17",    # Must be on chromosome 17
         control_1 > 100)       # Must have high expression
## # A tibble: 2 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 TP53          150       140       300       280 17         Cell cycle
## 2 HER2          120       125       240       230 17         Growth
# Complex conditions using %in%
# Find genes in specific pathways
gene_data %>%
  filter(pathway %in% c("Growth", "Signaling"))  # Growth or signaling genes
## # A tibble: 3 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway  
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>    
## 1 EGFR           80        85        90        95 7          Growth   
## 2 KRAS          200       190       180       185 12         Signaling
## 3 HER2          120       125       240       230 17         Growth

3. Creating New Columns (mutate)

mutate() adds new columns based on calculations. Essential for data analysis:

# Calculate summary statistics
gene_data %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,     # Average control expression
    mean_treated = (treated_1 + treated_2) / 2,     # Average treated expression
    fold_change = mean_treated / mean_control       # Fold change calculation
  )
## # A tibble: 5 × 10
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 EGFR           80        85        90        95 7          Growth    
## 4 KRAS          200       190       180       185 12         Signaling 
## 5 HER2          120       125       240       230 17         Growth    
## # ℹ 3 more variables: mean_control <dbl>, mean_treated <dbl>, fold_change <dbl>
# Log transform expression values
# Common in RNA-seq analysis
gene_data %>%
  mutate(across(
    .cols = c(control_1, control_2, treated_1, treated_2),  # All expression columns
    .fns = log2,                                           # Apply log2 transform
    .names = "log2_{.col}"                                 # Name new columns
  ))
## # A tibble: 5 × 11
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 EGFR           80        85        90        95 7          Growth    
## 4 KRAS          200       190       180       185 12         Signaling 
## 5 HER2          120       125       240       230 17         Growth    
## # ℹ 4 more variables: log2_control_1 <dbl>, log2_control_2 <dbl>,
## #   log2_treated_1 <dbl>, log2_treated_2 <dbl>

4. Summarizing Data (summarize/summarise)

summarize() creates summary statistics. Perfect for getting overview of your data:

# Calculate mean expression per condition
gene_data %>%
  summarise(
    mean_control_1 = mean(control_1),     # Average control expression
    mean_treated_1 = mean(treated_1),     # Average treated expression
    n_genes = n()                         # Number of genes analyzed
  )
## # A tibble: 1 × 3
##   mean_control_1 mean_treated_1 n_genes
##            <dbl>          <dbl>   <int>
## 1            130            202       5
# Group by pathway and summarize
# Useful for pathway-level analysis
gene_data %>%
  group_by(pathway) %>%                   # Group genes by pathway
  summarise(
    n_genes = n(),                        # Genes per pathway
    mean_control = mean(control_1),       # Average control expression
    mean_treated = mean(treated_1)        # Average treated expression
  )
## # A tibble: 4 × 4
##   pathway    n_genes mean_control mean_treated
##   <chr>        <int>        <dbl>        <dbl>
## 1 Cell cycle       1          150          300
## 2 DNA repair       1          100          200
## 3 Growth           2          100          165
## 4 Signaling        1          200          180

Data Tidying with tidyr

1. Reshaping Data (pivot_longer and pivot_wider)

Data reshaping is crucial in bioinformatics, especially when preparing data for different analyses:

Wide format: Each sample in a separate column - Good for: Spreadsheet viewing, manual data entry - Example: RNA-seq count matrix

Long format: Each observation in a separate row - Good for: Statistical tests, plotting - Example: DESeq2 results

# Convert from wide to long format
# This is often needed for visualization and statistical testing
gene_data_long <- gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),  # Expression columns
    names_to = "sample",                                   # Column for sample names
    values_to = "expression"                               # Column for expression values
  )

print("Long format (ideal for plotting and statistics):")
## [1] "Long format (ideal for plotting and statistics):"
print(gene_data_long)
## # A tibble: 20 × 5
##    gene_id chromosome pathway    sample    expression
##    <chr>   <chr>      <chr>      <chr>          <dbl>
##  1 BRCA1   17         DNA repair control_1        100
##  2 BRCA1   17         DNA repair control_2        110
##  3 BRCA1   17         DNA repair treated_1        200
##  4 BRCA1   17         DNA repair treated_2        190
##  5 TP53    17         Cell cycle control_1        150
##  6 TP53    17         Cell cycle control_2        140
##  7 TP53    17         Cell cycle treated_1        300
##  8 TP53    17         Cell cycle treated_2        280
##  9 EGFR    7          Growth     control_1         80
## 10 EGFR    7          Growth     control_2         85
## 11 EGFR    7          Growth     treated_1         90
## 12 EGFR    7          Growth     treated_2         95
## 13 KRAS    12         Signaling  control_1        200
## 14 KRAS    12         Signaling  control_2        190
## 15 KRAS    12         Signaling  treated_1        180
## 16 KRAS    12         Signaling  treated_2        185
## 17 HER2    17         Growth     control_1        120
## 18 HER2    17         Growth     control_2        125
## 19 HER2    17         Growth     treated_1        240
## 20 HER2    17         Growth     treated_2        230
# Now ready for visualization with ggplot2
library(ggplot2)  # For plotting

Advanced dplyr Operations

Joining Data Frames

Often we need to combine information from multiple data frames. Let’s create a second dataset with additional gene information:

# Create a second dataset with gene annotations
gene_annotations <- tibble(
  gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2", "PTEN"),
  full_name = c("Breast Cancer 1", "Tumor Protein 53", "Epidermal Growth Factor Receptor",
                "KRAS Proto-Oncogene", "Human Epidermal Growth Factor Receptor 2", 
                "Phosphatase and Tensin Homolog"),
  is_oncogene = c(FALSE, FALSE, TRUE, TRUE, TRUE, FALSE)
)

# Inner join - only keep genes present in both datasets
gene_data %>%
  inner_join(gene_annotations, by = "gene_id")
## # A tibble: 5 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   full_name
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     <chr>    
## 1 BRCA1         100       110       200       190 17         DNA repa… Breast C…
## 2 TP53          150       140       300       280 17         Cell cyc… Tumor Pr…
## 3 EGFR           80        85        90        95 7          Growth    Epiderma…
## 4 KRAS          200       190       180       185 12         Signaling KRAS Pro…
## 5 HER2          120       125       240       230 17         Growth    Human Ep…
## # ℹ 1 more variable: is_oncogene <lgl>
# Left join - keep all genes from gene_data
gene_data %>%
  left_join(gene_annotations, by = "gene_id")
## # A tibble: 5 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   full_name
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     <chr>    
## 1 BRCA1         100       110       200       190 17         DNA repa… Breast C…
## 2 TP53          150       140       300       280 17         Cell cyc… Tumor Pr…
## 3 EGFR           80        85        90        95 7          Growth    Epiderma…
## 4 KRAS          200       190       180       185 12         Signaling KRAS Pro…
## 5 HER2          120       125       240       230 17         Growth    Human Ep…
## # ℹ 1 more variable: is_oncogene <lgl>
# Full join - keep all genes from both datasets
gene_data %>%
  full_join(gene_annotations, by = "gene_id")
## # A tibble: 6 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   full_name
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     <chr>    
## 1 BRCA1         100       110       200       190 17         DNA repa… Breast C…
## 2 TP53          150       140       300       280 17         Cell cyc… Tumor Pr…
## 3 EGFR           80        85        90        95 7          Growth    Epiderma…
## 4 KRAS          200       190       180       185 12         Signaling KRAS Pro…
## 5 HER2          120       125       240       230 17         Growth    Human Ep…
## 6 PTEN           NA        NA        NA        NA <NA>       <NA>      Phosphat…
## # ℹ 1 more variable: is_oncogene <lgl>

Set Operations

dplyr provides functions for set operations between data frames:

# Create two datasets with some overlapping genes
set1 <- gene_data %>% filter(chromosome == "17")
set2 <- gene_data %>% filter(pathway == "Growth")

# Union - combine unique rows
bind_rows(set1, set2) %>% distinct()
## # A tibble: 4 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 HER2          120       125       240       230 17         Growth    
## 4 EGFR           80        85        90        95 7          Growth
# Intersect - find common rows
inner_join(set1, set2, by = names(set1))
## # A tibble: 1 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>  
## 1 HER2          120       125       240       230 17         Growth
# Setdiff - find rows in set1 not in set2
anti_join(set1, set2, by = names(set1))
## # A tibble: 2 × 7
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle

Advanced Grouping Operations

# Group by multiple columns
gene_data %>%
  group_by(chromosome, pathway) %>%
  summarise(
    n_genes = n(),
    mean_expression = mean(control_1),
    .groups = "drop"
  )
## # A tibble: 5 × 4
##   chromosome pathway    n_genes mean_expression
##   <chr>      <chr>        <int>           <dbl>
## 1 12         Signaling        1             200
## 2 17         Cell cycle       1             150
## 3 17         DNA repair       1             100
## 4 17         Growth           1             120
## 5 7          Growth           1              80
# Grouped mutations
gene_data %>%
  group_by(chromosome) %>%
  mutate(
    rel_expression = control_1 / mean(control_1),
    rank = min_rank(desc(control_1))
  ) %>%
  ungroup()
## # A tibble: 5 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway   
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>     
## 1 BRCA1         100       110       200       190 17         DNA repair
## 2 TP53          150       140       300       280 17         Cell cycle
## 3 EGFR           80        85        90        95 7          Growth    
## 4 KRAS          200       190       180       185 12         Signaling 
## 5 HER2          120       125       240       230 17         Growth    
## # ℹ 2 more variables: rel_expression <dbl>, rank <int>

Creating Custom Functions

Basic Function Creation

# Create a function to calculate fold change
calculate_fold_change <- function(treated, control) {
  if (any(control <= 0)) {
    warning("Control values should be positive")
    return(NA)
  }
  return(treated / control)
}

# Use the function with mutate
gene_data %>%
  mutate(
    fc_1 = calculate_fold_change(treated_1, control_1),
    fc_2 = calculate_fold_change(treated_2, control_2)
  )
## # A tibble: 5 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway  fc_1  fc_2
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>   <dbl> <dbl>
## 1 BRCA1         100       110       200       190 17         DNA re…  2    1.73 
## 2 TP53          150       140       300       280 17         Cell c…  2    2    
## 3 EGFR           80        85        90        95 7          Growth   1.12 1.12 
## 4 KRAS          200       190       180       185 12         Signal…  0.9  0.974
## 5 HER2          120       125       240       230 17         Growth   2    1.84

Functions with Multiple Arguments

# Function to filter and summarize expression data
analyze_expression <- function(data, chr = NULL, min_expression = 0) {
  # Start with the data
  result <- data
  
  # Filter by chromosome if specified
  if (!is.null(chr)) {
    result <- result %>% filter(chromosome == chr)
  }
  
  # Apply expression threshold
  result <- result %>%
    filter(control_1 > min_expression) %>%
    mutate(
      mean_control = (control_1 + control_2) / 2,
      mean_treated = (treated_1 + treated_2) / 2,
      fold_change = mean_treated / mean_control
    ) %>%
    select(gene_id, chromosome, pathway, mean_control, mean_treated, fold_change)
  
  return(result)
}

# Use the custom function
gene_data %>%
  analyze_expression(chr = "17", min_expression = 100)
## # A tibble: 2 × 6
##   gene_id chromosome pathway    mean_control mean_treated fold_change
##   <chr>   <chr>      <chr>             <dbl>        <dbl>       <dbl>
## 1 TP53    17         Cell cycle         145           290        2   
## 2 HER2    17         Growth             122.          235        1.92

Creating Pipeline Functions

# Function to perform standard analysis pipeline
standard_analysis <- function(data, group_col) {
  group_col <- enquo(group_col)  # Quote the grouping column
  
  data %>%
    group_by(!!group_col) %>%
    summarise(
      n_genes = n(),
      mean_expression = mean(control_1),
      up_regulated = sum(treated_1 > control_1),
      down_regulated = sum(treated_1 < control_1),
      .groups = "drop"
    ) %>%
    mutate(
      pct_up = up_regulated / n_genes * 100,
      pct_down = down_regulated / n_genes * 100
    )
}

# Use the pipeline function
gene_data %>%
  standard_analysis(pathway)
## # A tibble: 4 × 7
##   pathway    n_genes mean_expression up_regulated down_regulated pct_up pct_down
##   <chr>        <int>           <dbl>        <int>          <int>  <dbl>    <dbl>
## 1 Cell cycle       1             150            1              0    100        0
## 2 DNA repair       1             100            1              0    100        0
## 3 Growth           2             100            2              0    100        0
## 4 Signaling        1             200            0              1      0      100

Practice Exercises

Exercise 1: Basic Data Manipulation

Using the gene_data dataset: 1. Filter for genes on chromosome 17 2. Calculate the fold change between treated and control conditions 3. Select only gene_id and fold change columns

gene_data %>%
  filter(chromosome == "17") %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,
    mean_treated = (treated_1 + treated_2) / 2,
    fold_change = mean_treated / mean_control
  ) %>%
  select(gene_id, fold_change)
## # A tibble: 3 × 2
##   gene_id fold_change
##   <chr>         <dbl>
## 1 BRCA1          1.86
## 2 TP53           2   
## 3 HER2           1.92

Exercise 2: Data Reshaping

  1. Convert the expression data to long format
  2. Calculate mean expression per condition
  3. Create a summary of fold changes per pathway
# Convert to long format and summarize
gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),
    names_to = "sample",
    values_to = "expression"
  ) %>%
  group_by(gene_id, pathway) %>%
  summarise(
    mean_expression = mean(expression),
    .groups = "drop"
  )
## # A tibble: 5 × 3
##   gene_id pathway    mean_expression
##   <chr>   <chr>                <dbl>
## 1 BRCA1   DNA repair           150  
## 2 EGFR    Growth                87.5
## 3 HER2    Growth               179. 
## 4 KRAS    Signaling            189. 
## 5 TP53    Cell cycle           218.

Exercise 3: Custom Functions

  1. Create a function to normalize expression values
  2. Apply the function to both control and treated samples
  3. Calculate statistics on the normalized values
# Create normalization function
normalize_expression <- function(x) {
  (x - mean(x)) / sd(x)
}

# Apply to data
gene_data %>%
  mutate(across(
    .cols = c(control_1, control_2, treated_1, treated_2),
    .fns = normalize_expression,
    .names = "norm_{.col}"
  )) %>%
  select(gene_id, starts_with("norm_"))
## # A tibble: 5 × 5
##   gene_id norm_control_1 norm_control_2 norm_treated_1 norm_treated_2
##   <chr>            <dbl>          <dbl>          <dbl>          <dbl>
## 1 BRCA1           -0.640         -0.510        -0.0258        -0.0881
## 2 TP53             0.426          0.255         1.26           1.23  
## 3 EGFR            -1.07          -1.15         -1.44          -1.48  
## 4 KRAS             1.49           1.53         -0.284         -0.161 
## 5 HER2            -0.213         -0.128         0.490          0.499

Common Mistakes and Tips

  1. Chain Order Matters
# This works (filter then summarize)
gene_data %>%
  filter(chromosome == "17") %>%
  summarise(mean_exp = mean(control_1))
## # A tibble: 1 × 1
##   mean_exp
##      <dbl>
## 1     123.
# This fails (summarize then filter)
# gene_data %>%
#   summarise(mean_exp = mean(control_1)) %>%
#   filter(chromosome == "17")  # Error: object 'chromosome' not found
  1. Grouping Persistence
# Grouping remains until explicitly removed
grouped_data <- gene_data %>%
  group_by(pathway) %>%
  summarise(
    mean_control = mean(control_1),
    .groups = "drop"  # Explicitly remove grouping
  )
  1. Missing Values
# Create data with NA
gene_data_na <- gene_data %>%
  mutate(control_1 = ifelse(gene_id == "BRCA1", NA, control_1))

# Handle NA values
gene_data_na %>%
  summarise(
    mean_with_na = mean(control_1),
    mean_without_na = mean(control_1, na.rm = TRUE)
  )
## # A tibble: 1 × 2
##   mean_with_na mean_without_na
##          <dbl>           <dbl>
## 1           NA            138.

Next Steps

After mastering these advanced concepts, you can explore: - Data visualization with ggplot2

Create example visualization

ggplot(gene_data_long, aes(x = gene_id, y = expression, fill = sample)) + # Map variables to plot geom_bar(stat = “identity”, position = “dodge”) + # Create grouped bars theme_minimal() + # Use clean theme labs(title = “Gene Expression by Sample”, # Add informative labels x = “Gene”, y = “Expression Level”) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate gene names

2. Separating and Uniting Columns (tidyr)

Sometimes we need to work with complex identifiers or combine/split information:

# Create complex identifiers (common in bioinformatics)
gene_data_info <- gene_data %>%
  mutate(gene_info = paste(gene_id, chromosome, sep = "_chr"))  # Create combined ID

# Separate the gene_info column into components
# Useful when working with complex identifiers
gene_data_split <- gene_data_info %>%
  separate(gene_info,                    # Column to split
          into = c("gene", "chr_num"),   # New column names
          sep = "_chr")                  # Split at "_chr"

print("Data with separated columns:")
## [1] "Data with separated columns:"
print(gene_data_split)
## # A tibble: 5 × 9
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway    gene 
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>      <chr>
## 1 BRCA1         100       110       200       190 17         DNA repair BRCA1
## 2 TP53          150       140       300       280 17         Cell cycle TP53 
## 3 EGFR           80        85        90        95 7          Growth     EGFR 
## 4 KRAS          200       190       180       185 12         Signaling  KRAS 
## 5 HER2          120       125       240       230 17         Growth     HER2 
## # ℹ 1 more variable: chr_num <chr>
# Unite columns back together
# Useful for creating standardized identifiers
gene_data_united <- gene_data_split %>%
  unite("gene_chr",                      # Name of new column
        gene, chr_num,                   # Columns to combine
        sep = "_chr")                    # Separator to use

print("\nData with united columns:")
## [1] "\nData with united columns:"
print(gene_data_united)
## # A tibble: 5 × 8
##   gene_id control_1 control_2 treated_1 treated_2 chromosome pathway    gene_chr
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl> <chr>      <chr>      <chr>   
## 1 BRCA1         100       110       200       190 17         DNA repair BRCA1_c…
## 2 TP53          150       140       300       280 17         Cell cycle TP53_ch…
## 3 EGFR           80        85        90        95 7          Growth     EGFR_ch…
## 4 KRAS          200       190       180       185 12         Signaling  KRAS_ch…
## 5 HER2          120       125       240       230 17         Growth     HER2_ch…

Practice Exercises

Exercise 1: Basic Data Manipulation

Try these operations with the gene expression data:

# 1. Calculate log2 fold change for each gene
gene_data %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,     # Average control
    mean_treated = (treated_1 + treated_2) / 2,     # Average treated
    log2_fold_change = log2(mean_treated / mean_control)  # Log2 fold change
  )

# 2. Find genes with absolute log2 fold change > 1
# This indicates a 2-fold or greater change
gene_data %>%
  mutate(
    mean_control = (control_1 + control_2) / 2,
    mean_treated = (treated_1 + treated_2) / 2,
    log2_fold_change = log2(mean_treated / mean_control)
  ) %>%
  filter(abs(log2_fold_change) > 1)  # |log2 FC| > 1 means >2-fold change

Exercise 2: Grouped Analysis

Analyze expression patterns by pathway:

# 1. Calculate average expression and standard deviation by pathway
gene_data %>%
  group_by(pathway) %>%
  summarise(
    mean_control = mean(control_1),           # Average control expression
    sd_control = sd(control_1),               # Variability in control
    mean_treated = mean(treated_1),           # Average treated expression
    sd_treated = sd(treated_1),               # Variability in treated
    n_genes = n()                             # Number of genes in pathway
  )

# 2. Find pathways with the highest average fold change
gene_data %>%
  group_by(pathway) %>%
  summarise(
    mean_fc = mean(treated_1 / control_1),    # Average fold change
    max_fc = max(treated_1 / control_1)       # Maximum fold change
  ) %>%
  arrange(desc(mean_fc))                      # Sort by average fold change

Exercise 3: Data Reshaping

Practice reshaping data for different analyses:

# 1. Convert to long format and calculate statistics
gene_data %>%
  pivot_longer(
    cols = c(control_1, control_2, treated_1, treated_2),
    names_to = c("condition", "replicate"),
    names_pattern = "(.+)_(.+)",              # Split names into condition and replicate
    values_to = "expression"
  ) %>%
  group_by(condition) %>%
  summarise(
    mean_expr = mean(expression),             # Average by condition
    sd_expr = sd(expression)                  # Variability by condition
  )

# 2. Create a correlation matrix between samples
gene_data %>%
  select(control_1, control_2, treated_1, treated_2) %>%  # Select expression columns
  cor()                                                   # Calculate correlations

Tips for Effective Data Manipulation

  1. Start with Raw Data
    • Keep original data unchanged
    • Document all transformations
    • Make analysis reproducible
  2. Use Appropriate Data Structures
    • Wide format for viewing/manual entry
    • Long format for analysis/plotting
    • Consider memory usage for large datasets
  3. Check Your Work
    • Verify transformations with small examples
    • Use summary statistics to catch errors
    • Plot data to visualize changes
  4. Optimize Performance
    • Use appropriate data types
    • Minimize redundant calculations
    • Consider using data.table for large datasets

Next Steps

After mastering these basics, explore:

  1. Advanced dplyr Functions
    • across() for applying functions to multiple columns
    • rowwise() for row-wise operations
    • Window functions for ranking and comparisons
  2. Complex Data Transformations
    • Nested data frames
    • List columns
    • Custom functions for repeated operations
  3. Integration with Other Tools
    • Bioconductor packages
    • Statistical analysis packages
    • Advanced visualization tools

Remember: The key to effective data manipulation is choosing the right tools for your specific needs and ensuring your code is clear and reproducible.