The tidyverse is a collection of R packages designed for data science that share a common philosophy and design. These packages are essential for modern data analysis in R, especially for bioinformatics and genomics research.
Key packages we’ll focus on: - dplyr: For efficient data manipulation and transformation - tidyr: For cleaning and reshaping messy data into tidy format
Let’s start by loading the required packages:
# First, check if tidyverse is installed and install if needed
# This is good practice for reproducible code
if (!require("tidyverse")) install.packages("tidyverse")## Loading required package: tidyverse
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Before we begin, here are some valuable resources to keep handy:
%>%): Ctrl/Cmd + Shift + M<-): Alt + - (Windows/Linux) or
Option + - (Mac) These shortcuts will speed up your coding
significantly!We’ll use a gene expression dataset that mimics real RNA-seq data:
# Create a sample gene expression dataset with:
# - Gene identifiers
# - Expression values for control and treated samples (with replicates)
# - Genomic information (chromosome location)
# - Biological pathway information
gene_data <- tibble(
gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2"), # Important cancer genes
control_1 = c(100, 150, 80, 200, 120), # Control replicate 1
control_2 = c(110, 140, 85, 190, 125), # Control replicate 2
treated_1 = c(200, 300, 90, 180, 240), # Treatment replicate 1
treated_2 = c(190, 280, 95, 185, 230), # Treatment replicate 2
chromosome = c("17", "17", "7", "12", "17"), # Chromosome locations
pathway = c("DNA repair", "Cell cycle", "Growth", "Signaling", "Growth") # Biological pathways
)
# Display the dataset
print(gene_data)## # A tibble: 5 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 EGFR 80 85 90 95 7 Growth
## 4 KRAS 200 190 180 185 12 Signaling
## 5 HER2 120 125 240 230 17 Growth
%>%)The pipe operator is a fundamental concept in modern R programming, especially for data analysis workflows.
The pipe operator (%>%) takes the output from one
function and passes it as the first argument to the next function. Think
of it like a pipeline where data flows through a series of
operations.
Benefits: 1. Makes code more readable (left-to-right flow) 2. Reduces need for intermediate variables 3. Makes complex operations more manageable
## [1] 1.150203
# Same operation with pipes (easier to follow)
# Each step is clearly visible
c(1:10) %>%
log() %>% # First take the log
abs() %>% # Then the absolute value
sqrt() %>% # Then the square root
mean() # Finally calculate the mean## [1] 1.150203
# Example with our gene data
# Traditional syntax (nested, harder to understand)
head(arrange(filter(gene_data, chromosome == "17"), desc(control_1)))## # A tibble: 3 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 TP53 150 140 300 280 17 Cell cycle
## 2 HER2 120 125 240 230 17 Growth
## 3 BRCA1 100 110 200 190 17 DNA repair
# Same operation with pipes (clear sequence of operations)
gene_data %>%
filter(chromosome == "17") %>% # First, keep only chromosome 17 genes
arrange(desc(control_1)) %>% # Then sort by control_1 values
head() # Finally, show first few rows## # A tibble: 3 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 TP53 150 140 300 280 17 Cell cycle
## 2 HER2 120 125 240 230 17 Growth
## 3 BRCA1 100 110 200 190 17 DNA repair
# Example of a well-formatted analysis pipeline
result <- gene_data %>%
filter(chromosome == "17") %>% # Filter for chromosome 17
select(gene_id, control_1, treated_1) %>% # Keep only needed columns
mutate(fold_change = treated_1 / control_1) %>% # Calculate fold change
arrange(desc(fold_change)) # Sort by fold change
print("Well-formatted pipe example result:")## [1] "Well-formatted pipe example result:"
## # A tibble: 3 × 4
## gene_id control_1 treated_1 fold_change
## <chr> <dbl> <dbl> <dbl>
## 1 BRCA1 100 200 2
## 2 TP53 150 300 2
## 3 HER2 120 240 2
The select() function helps you choose which columns to
keep or remove. This is essential for focusing on relevant data:
# Select specific columns of interest
gene_data %>%
select(gene_id, control_1, treated_1) # Keep only gene IDs and one replicate each## # A tibble: 5 × 3
## gene_id control_1 treated_1
## <chr> <dbl> <dbl>
## 1 BRCA1 100 200
## 2 TP53 150 300
## 3 EGFR 80 90
## 4 KRAS 200 180
## 5 HER2 120 240
# Select columns by pattern matching
# Useful when you have many similarly named columns
gene_data %>%
select(starts_with("control")) # Get all control samples## # A tibble: 5 × 2
## control_1 control_2
## <dbl> <dbl>
## 1 100 110
## 2 150 140
## 3 80 85
## 4 200 190
## 5 120 125
# Remove unwanted columns
# Useful for excluding replicate 2 data
gene_data %>%
select(-ends_with("2")) # Remove all replicate 2 columns## # A tibble: 5 × 5
## gene_id control_1 treated_1 chromosome pathway
## <chr> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 200 17 DNA repair
## 2 TP53 150 300 17 Cell cycle
## 3 EGFR 80 90 7 Growth
## 4 KRAS 200 180 12 Signaling
## 5 HER2 120 240 17 Growth
Use filter() to subset rows based on conditions.
Essential for focusing on genes of interest:
# Filter genes on chromosome 17
# Common in cancer studies (many cancer genes on chr17)
gene_data %>%
filter(chromosome == "17")## # A tibble: 3 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 HER2 120 125 240 230 17 Growth
# Multiple conditions
# Find highly expressed genes on chr17
gene_data %>%
filter(chromosome == "17", # Must be on chromosome 17
control_1 > 100) # Must have high expression## # A tibble: 2 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 TP53 150 140 300 280 17 Cell cycle
## 2 HER2 120 125 240 230 17 Growth
# Complex conditions using %in%
# Find genes in specific pathways
gene_data %>%
filter(pathway %in% c("Growth", "Signaling")) # Growth or signaling genes## # A tibble: 3 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 EGFR 80 85 90 95 7 Growth
## 2 KRAS 200 190 180 185 12 Signaling
## 3 HER2 120 125 240 230 17 Growth
mutate() adds new columns based on calculations.
Essential for data analysis:
# Calculate summary statistics
gene_data %>%
mutate(
mean_control = (control_1 + control_2) / 2, # Average control expression
mean_treated = (treated_1 + treated_2) / 2, # Average treated expression
fold_change = mean_treated / mean_control # Fold change calculation
)## # A tibble: 5 × 10
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 EGFR 80 85 90 95 7 Growth
## 4 KRAS 200 190 180 185 12 Signaling
## 5 HER2 120 125 240 230 17 Growth
## # ℹ 3 more variables: mean_control <dbl>, mean_treated <dbl>, fold_change <dbl>
# Log transform expression values
# Common in RNA-seq analysis
gene_data %>%
mutate(across(
.cols = c(control_1, control_2, treated_1, treated_2), # All expression columns
.fns = log2, # Apply log2 transform
.names = "log2_{.col}" # Name new columns
))## # A tibble: 5 × 11
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 EGFR 80 85 90 95 7 Growth
## 4 KRAS 200 190 180 185 12 Signaling
## 5 HER2 120 125 240 230 17 Growth
## # ℹ 4 more variables: log2_control_1 <dbl>, log2_control_2 <dbl>,
## # log2_treated_1 <dbl>, log2_treated_2 <dbl>
summarize() creates summary statistics. Perfect for
getting overview of your data:
# Calculate mean expression per condition
gene_data %>%
summarise(
mean_control_1 = mean(control_1), # Average control expression
mean_treated_1 = mean(treated_1), # Average treated expression
n_genes = n() # Number of genes analyzed
)## # A tibble: 1 × 3
## mean_control_1 mean_treated_1 n_genes
## <dbl> <dbl> <int>
## 1 130 202 5
# Group by pathway and summarize
# Useful for pathway-level analysis
gene_data %>%
group_by(pathway) %>% # Group genes by pathway
summarise(
n_genes = n(), # Genes per pathway
mean_control = mean(control_1), # Average control expression
mean_treated = mean(treated_1) # Average treated expression
)## # A tibble: 4 × 4
## pathway n_genes mean_control mean_treated
## <chr> <int> <dbl> <dbl>
## 1 Cell cycle 1 150 300
## 2 DNA repair 1 100 200
## 3 Growth 2 100 165
## 4 Signaling 1 200 180
Data reshaping is crucial in bioinformatics, especially when preparing data for different analyses:
Wide format: Each sample in a separate column - Good for: Spreadsheet viewing, manual data entry - Example: RNA-seq count matrix
Long format: Each observation in a separate row - Good for: Statistical tests, plotting - Example: DESeq2 results
# Convert from wide to long format
# This is often needed for visualization and statistical testing
gene_data_long <- gene_data %>%
pivot_longer(
cols = c(control_1, control_2, treated_1, treated_2), # Expression columns
names_to = "sample", # Column for sample names
values_to = "expression" # Column for expression values
)
print("Long format (ideal for plotting and statistics):")## [1] "Long format (ideal for plotting and statistics):"
## # A tibble: 20 × 5
## gene_id chromosome pathway sample expression
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA1 17 DNA repair control_1 100
## 2 BRCA1 17 DNA repair control_2 110
## 3 BRCA1 17 DNA repair treated_1 200
## 4 BRCA1 17 DNA repair treated_2 190
## 5 TP53 17 Cell cycle control_1 150
## 6 TP53 17 Cell cycle control_2 140
## 7 TP53 17 Cell cycle treated_1 300
## 8 TP53 17 Cell cycle treated_2 280
## 9 EGFR 7 Growth control_1 80
## 10 EGFR 7 Growth control_2 85
## 11 EGFR 7 Growth treated_1 90
## 12 EGFR 7 Growth treated_2 95
## 13 KRAS 12 Signaling control_1 200
## 14 KRAS 12 Signaling control_2 190
## 15 KRAS 12 Signaling treated_1 180
## 16 KRAS 12 Signaling treated_2 185
## 17 HER2 17 Growth control_1 120
## 18 HER2 17 Growth control_2 125
## 19 HER2 17 Growth treated_1 240
## 20 HER2 17 Growth treated_2 230
Often we need to combine information from multiple data frames. Let’s create a second dataset with additional gene information:
# Create a second dataset with gene annotations
gene_annotations <- tibble(
gene_id = c("BRCA1", "TP53", "EGFR", "KRAS", "HER2", "PTEN"),
full_name = c("Breast Cancer 1", "Tumor Protein 53", "Epidermal Growth Factor Receptor",
"KRAS Proto-Oncogene", "Human Epidermal Growth Factor Receptor 2",
"Phosphatase and Tensin Homolog"),
is_oncogene = c(FALSE, FALSE, TRUE, TRUE, TRUE, FALSE)
)
# Inner join - only keep genes present in both datasets
gene_data %>%
inner_join(gene_annotations, by = "gene_id")## # A tibble: 5 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway full_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repa… Breast C…
## 2 TP53 150 140 300 280 17 Cell cyc… Tumor Pr…
## 3 EGFR 80 85 90 95 7 Growth Epiderma…
## 4 KRAS 200 190 180 185 12 Signaling KRAS Pro…
## 5 HER2 120 125 240 230 17 Growth Human Ep…
## # ℹ 1 more variable: is_oncogene <lgl>
# Left join - keep all genes from gene_data
gene_data %>%
left_join(gene_annotations, by = "gene_id")## # A tibble: 5 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway full_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repa… Breast C…
## 2 TP53 150 140 300 280 17 Cell cyc… Tumor Pr…
## 3 EGFR 80 85 90 95 7 Growth Epiderma…
## 4 KRAS 200 190 180 185 12 Signaling KRAS Pro…
## 5 HER2 120 125 240 230 17 Growth Human Ep…
## # ℹ 1 more variable: is_oncogene <lgl>
# Full join - keep all genes from both datasets
gene_data %>%
full_join(gene_annotations, by = "gene_id")## # A tibble: 6 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway full_name
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repa… Breast C…
## 2 TP53 150 140 300 280 17 Cell cyc… Tumor Pr…
## 3 EGFR 80 85 90 95 7 Growth Epiderma…
## 4 KRAS 200 190 180 185 12 Signaling KRAS Pro…
## 5 HER2 120 125 240 230 17 Growth Human Ep…
## 6 PTEN NA NA NA NA <NA> <NA> Phosphat…
## # ℹ 1 more variable: is_oncogene <lgl>
dplyr provides functions for set operations between data frames:
# Create two datasets with some overlapping genes
set1 <- gene_data %>% filter(chromosome == "17")
set2 <- gene_data %>% filter(pathway == "Growth")
# Union - combine unique rows
bind_rows(set1, set2) %>% distinct()## # A tibble: 4 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 HER2 120 125 240 230 17 Growth
## 4 EGFR 80 85 90 95 7 Growth
## # A tibble: 1 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 HER2 120 125 240 230 17 Growth
## # A tibble: 2 × 7
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
# Group by multiple columns
gene_data %>%
group_by(chromosome, pathway) %>%
summarise(
n_genes = n(),
mean_expression = mean(control_1),
.groups = "drop"
)## # A tibble: 5 × 4
## chromosome pathway n_genes mean_expression
## <chr> <chr> <int> <dbl>
## 1 12 Signaling 1 200
## 2 17 Cell cycle 1 150
## 3 17 DNA repair 1 100
## 4 17 Growth 1 120
## 5 7 Growth 1 80
# Grouped mutations
gene_data %>%
group_by(chromosome) %>%
mutate(
rel_expression = control_1 / mean(control_1),
rank = min_rank(desc(control_1))
) %>%
ungroup()## # A tibble: 5 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair
## 2 TP53 150 140 300 280 17 Cell cycle
## 3 EGFR 80 85 90 95 7 Growth
## 4 KRAS 200 190 180 185 12 Signaling
## 5 HER2 120 125 240 230 17 Growth
## # ℹ 2 more variables: rel_expression <dbl>, rank <int>
# Create a function to calculate fold change
calculate_fold_change <- function(treated, control) {
if (any(control <= 0)) {
warning("Control values should be positive")
return(NA)
}
return(treated / control)
}
# Use the function with mutate
gene_data %>%
mutate(
fc_1 = calculate_fold_change(treated_1, control_1),
fc_2 = calculate_fold_change(treated_2, control_2)
)## # A tibble: 5 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway fc_1 fc_2
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 BRCA1 100 110 200 190 17 DNA re… 2 1.73
## 2 TP53 150 140 300 280 17 Cell c… 2 2
## 3 EGFR 80 85 90 95 7 Growth 1.12 1.12
## 4 KRAS 200 190 180 185 12 Signal… 0.9 0.974
## 5 HER2 120 125 240 230 17 Growth 2 1.84
# Function to filter and summarize expression data
analyze_expression <- function(data, chr = NULL, min_expression = 0) {
# Start with the data
result <- data
# Filter by chromosome if specified
if (!is.null(chr)) {
result <- result %>% filter(chromosome == chr)
}
# Apply expression threshold
result <- result %>%
filter(control_1 > min_expression) %>%
mutate(
mean_control = (control_1 + control_2) / 2,
mean_treated = (treated_1 + treated_2) / 2,
fold_change = mean_treated / mean_control
) %>%
select(gene_id, chromosome, pathway, mean_control, mean_treated, fold_change)
return(result)
}
# Use the custom function
gene_data %>%
analyze_expression(chr = "17", min_expression = 100)## # A tibble: 2 × 6
## gene_id chromosome pathway mean_control mean_treated fold_change
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 TP53 17 Cell cycle 145 290 2
## 2 HER2 17 Growth 122. 235 1.92
# Function to perform standard analysis pipeline
standard_analysis <- function(data, group_col) {
group_col <- enquo(group_col) # Quote the grouping column
data %>%
group_by(!!group_col) %>%
summarise(
n_genes = n(),
mean_expression = mean(control_1),
up_regulated = sum(treated_1 > control_1),
down_regulated = sum(treated_1 < control_1),
.groups = "drop"
) %>%
mutate(
pct_up = up_regulated / n_genes * 100,
pct_down = down_regulated / n_genes * 100
)
}
# Use the pipeline function
gene_data %>%
standard_analysis(pathway)## # A tibble: 4 × 7
## pathway n_genes mean_expression up_regulated down_regulated pct_up pct_down
## <chr> <int> <dbl> <int> <int> <dbl> <dbl>
## 1 Cell cycle 1 150 1 0 100 0
## 2 DNA repair 1 100 1 0 100 0
## 3 Growth 2 100 2 0 100 0
## 4 Signaling 1 200 0 1 0 100
Using the gene_data dataset: 1. Filter for genes on chromosome 17 2. Calculate the fold change between treated and control conditions 3. Select only gene_id and fold change columns
gene_data %>%
filter(chromosome == "17") %>%
mutate(
mean_control = (control_1 + control_2) / 2,
mean_treated = (treated_1 + treated_2) / 2,
fold_change = mean_treated / mean_control
) %>%
select(gene_id, fold_change)## # A tibble: 3 × 2
## gene_id fold_change
## <chr> <dbl>
## 1 BRCA1 1.86
## 2 TP53 2
## 3 HER2 1.92
# Convert to long format and summarize
gene_data %>%
pivot_longer(
cols = c(control_1, control_2, treated_1, treated_2),
names_to = "sample",
values_to = "expression"
) %>%
group_by(gene_id, pathway) %>%
summarise(
mean_expression = mean(expression),
.groups = "drop"
)## # A tibble: 5 × 3
## gene_id pathway mean_expression
## <chr> <chr> <dbl>
## 1 BRCA1 DNA repair 150
## 2 EGFR Growth 87.5
## 3 HER2 Growth 179.
## 4 KRAS Signaling 189.
## 5 TP53 Cell cycle 218.
# Create normalization function
normalize_expression <- function(x) {
(x - mean(x)) / sd(x)
}
# Apply to data
gene_data %>%
mutate(across(
.cols = c(control_1, control_2, treated_1, treated_2),
.fns = normalize_expression,
.names = "norm_{.col}"
)) %>%
select(gene_id, starts_with("norm_"))## # A tibble: 5 × 5
## gene_id norm_control_1 norm_control_2 norm_treated_1 norm_treated_2
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BRCA1 -0.640 -0.510 -0.0258 -0.0881
## 2 TP53 0.426 0.255 1.26 1.23
## 3 EGFR -1.07 -1.15 -1.44 -1.48
## 4 KRAS 1.49 1.53 -0.284 -0.161
## 5 HER2 -0.213 -0.128 0.490 0.499
# This works (filter then summarize)
gene_data %>%
filter(chromosome == "17") %>%
summarise(mean_exp = mean(control_1))## # A tibble: 1 × 1
## mean_exp
## <dbl>
## 1 123.
# This fails (summarize then filter)
# gene_data %>%
# summarise(mean_exp = mean(control_1)) %>%
# filter(chromosome == "17") # Error: object 'chromosome' not found# Grouping remains until explicitly removed
grouped_data <- gene_data %>%
group_by(pathway) %>%
summarise(
mean_control = mean(control_1),
.groups = "drop" # Explicitly remove grouping
)# Create data with NA
gene_data_na <- gene_data %>%
mutate(control_1 = ifelse(gene_id == "BRCA1", NA, control_1))
# Handle NA values
gene_data_na %>%
summarise(
mean_with_na = mean(control_1),
mean_without_na = mean(control_1, na.rm = TRUE)
)## # A tibble: 1 × 2
## mean_with_na mean_without_na
## <dbl> <dbl>
## 1 NA 138.
After mastering these advanced concepts, you can explore: - Data visualization with ggplot2
ggplot(gene_data_long, aes(x = gene_id, y = expression, fill = sample)) + # Map variables to plot geom_bar(stat = “identity”, position = “dodge”) + # Create grouped bars theme_minimal() + # Use clean theme labs(title = “Gene Expression by Sample”, # Add informative labels x = “Gene”, y = “Expression Level”) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate gene names
Sometimes we need to work with complex identifiers or combine/split information:
# Create complex identifiers (common in bioinformatics)
gene_data_info <- gene_data %>%
mutate(gene_info = paste(gene_id, chromosome, sep = "_chr")) # Create combined ID
# Separate the gene_info column into components
# Useful when working with complex identifiers
gene_data_split <- gene_data_info %>%
separate(gene_info, # Column to split
into = c("gene", "chr_num"), # New column names
sep = "_chr") # Split at "_chr"
print("Data with separated columns:")## [1] "Data with separated columns:"
## # A tibble: 5 × 9
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway gene
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair BRCA1
## 2 TP53 150 140 300 280 17 Cell cycle TP53
## 3 EGFR 80 85 90 95 7 Growth EGFR
## 4 KRAS 200 190 180 185 12 Signaling KRAS
## 5 HER2 120 125 240 230 17 Growth HER2
## # ℹ 1 more variable: chr_num <chr>
# Unite columns back together
# Useful for creating standardized identifiers
gene_data_united <- gene_data_split %>%
unite("gene_chr", # Name of new column
gene, chr_num, # Columns to combine
sep = "_chr") # Separator to use
print("\nData with united columns:")## [1] "\nData with united columns:"
## # A tibble: 5 × 8
## gene_id control_1 control_2 treated_1 treated_2 chromosome pathway gene_chr
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 BRCA1 100 110 200 190 17 DNA repair BRCA1_c…
## 2 TP53 150 140 300 280 17 Cell cycle TP53_ch…
## 3 EGFR 80 85 90 95 7 Growth EGFR_ch…
## 4 KRAS 200 190 180 185 12 Signaling KRAS_ch…
## 5 HER2 120 125 240 230 17 Growth HER2_ch…
Try these operations with the gene expression data:
# 1. Calculate log2 fold change for each gene
gene_data %>%
mutate(
mean_control = (control_1 + control_2) / 2, # Average control
mean_treated = (treated_1 + treated_2) / 2, # Average treated
log2_fold_change = log2(mean_treated / mean_control) # Log2 fold change
)
# 2. Find genes with absolute log2 fold change > 1
# This indicates a 2-fold or greater change
gene_data %>%
mutate(
mean_control = (control_1 + control_2) / 2,
mean_treated = (treated_1 + treated_2) / 2,
log2_fold_change = log2(mean_treated / mean_control)
) %>%
filter(abs(log2_fold_change) > 1) # |log2 FC| > 1 means >2-fold changeAnalyze expression patterns by pathway:
# 1. Calculate average expression and standard deviation by pathway
gene_data %>%
group_by(pathway) %>%
summarise(
mean_control = mean(control_1), # Average control expression
sd_control = sd(control_1), # Variability in control
mean_treated = mean(treated_1), # Average treated expression
sd_treated = sd(treated_1), # Variability in treated
n_genes = n() # Number of genes in pathway
)
# 2. Find pathways with the highest average fold change
gene_data %>%
group_by(pathway) %>%
summarise(
mean_fc = mean(treated_1 / control_1), # Average fold change
max_fc = max(treated_1 / control_1) # Maximum fold change
) %>%
arrange(desc(mean_fc)) # Sort by average fold changePractice reshaping data for different analyses:
# 1. Convert to long format and calculate statistics
gene_data %>%
pivot_longer(
cols = c(control_1, control_2, treated_1, treated_2),
names_to = c("condition", "replicate"),
names_pattern = "(.+)_(.+)", # Split names into condition and replicate
values_to = "expression"
) %>%
group_by(condition) %>%
summarise(
mean_expr = mean(expression), # Average by condition
sd_expr = sd(expression) # Variability by condition
)
# 2. Create a correlation matrix between samples
gene_data %>%
select(control_1, control_2, treated_1, treated_2) %>% # Select expression columns
cor() # Calculate correlationsAfter mastering these basics, explore:
across() for applying functions to multiple
columnsrowwise() for row-wise operationsRemember: The key to effective data manipulation is choosing the right tools for your specific needs and ensuring your code is clear and reproducible.