Vectors and Matrices in R
Introduction
In R, vectors and matrices are fundamental data structures that allow us to work with collections of data. They are especially important in bioinformatics and RNA-seq analysis, where we often need to handle large sets of gene expression values or sample measurements.
Vectors
A vector is a one-dimensional array that can hold elements of the same type (numeric, character, or logical).
Creating Vectors
There are several ways to create vectors in R:
# Create vectors for different types of gene expression data
gene_expression <- c(156.7, 238.9, 184.3, 145.6) # Raw expression values
gene_names <- c("BRCA1", "TP53", "EGFR", "KRAS") # Gene identifiers
is_significant <- c(TRUE, FALSE, TRUE, TRUE) # Significance flags
# Create sequences for different purposes
sample_numbers <- 1:10 # Create sequence from 1 to 10
time_points <- seq(0, 48, by = 6) # Create sequence from 0 to 48 in steps of 6 hours
# RNA-seq data manipulation example
raw_counts <- c(1200, 1500, 800, 2000) # Raw read counts from sequencing
scaling_factor <- 0.5 # Normalization factor
normalized_counts <- raw_counts * scaling_factor # Apply normalization
# Compare expression between conditions
control_expression <- c(100, 150, 80, 200) # Expression values in control samples
treated_expression <- c(150, 180, 90, 250) # Expression values in treated samples
expression_difference <- treated_expression - control_expression # Calculate absolute difference
fold_change <- treated_expression / control_expression # Calculate fold change
# Calculate basic statistics for a vector
length(gene_names) # Count number of genes
mean(raw_counts) # Calculate average read count
median(raw_counts) # Find middle value
max(raw_counts) # Find highest count
min(raw_counts) # Find lowest count
sum(raw_counts) # Total read count
# Sort values in different ways
sorted_counts <- sort(raw_counts) # Sort counts in ascending order
sorted_counts_desc <- sort(raw_counts, decreasing = TRUE) # Sort in descending order
# Access vector elements by position
first_gene <- gene_names[1] # Get first gene name
selected_genes <- gene_names[c(1, 3)] # Get first and third gene names
# Filter elements using logical conditions
high_expression <- raw_counts > 1000 # Create logical vector for high expression
high_expressed_genes <- gene_names[high_expression] # Get names of highly expressed genes
# Use which() to find positions meeting criteria
significant_indices <- which(is_significant) # Get indices of significant genes
significant_genes <- gene_names[significant_indices] # Get names of significant genes
# Create a matrix from expression data
expression_matrix <- matrix(
c(100, 200, 150, 300, # Expression values for sample 1
120, 180, 160, 280, # Expression values for sample 2
90, 220, 170, 320), # Expression values for sample 3
nrow = 3, # Number of rows (genes)
ncol = 4, # Number of columns (samples)
byrow = TRUE # Fill matrix by rows
)
# Add row and column names to matrix
rownames(expression_matrix) <- c("Gene1", "Gene2", "Gene3") # Label genes
colnames(expression_matrix) <- c("Sample1", "Sample2", "Sample3", "Sample4") # Label samples
# Perform matrix operations
transposed_matrix <- t(expression_matrix) # Transpose matrix (swap rows and columns)
correlation_matrix <- t(expression_matrix) %*% expression_matrix # Calculate correlation matrix
scaled_matrix <- expression_matrix * 2 # Multiply all values by 2
log_matrix <- log2(expression_matrix) # Convert to log2 scale
# Access matrix elements
value <- expression_matrix[1, 2] # Get value for Gene1, Sample2
gene1_expression <- expression_matrix[1, ] # Get all values for Gene1
sample1_values <- expression_matrix[, 1] # Get all values for Sample1
subset_matrix <- expression_matrix[1:2, c(1,3)] # Get values for first two genes in samples 1 and 3
# Create RNA-seq expression matrix with replicates
expression_data <- matrix(
c(
1200, 1300, 1250, 1800, 1900, 1850, # Expression values for Gene1
800, 750, 780, 1200, 1180, 1220, # Expression values for Gene2
2000, 2100, 2050, 2080, 2150, 2090, # Expression values for Gene3
300, 320, 310, 900, 920, 880, # Expression values for Gene4
1500, 1450, 1480, 1600, 1580, 1620 # Expression values for Gene5
),
nrow = 5, # 5 genes
ncol = 6, # 6 samples (3 control, 3 treated)
byrow = TRUE # Fill matrix by rows
)
# Label rows and columns
rownames(expression_data) <- c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5") # Gene names
colnames(expression_data) <- c("Ctrl1", "Ctrl2", "Ctrl3", "Treat1", "Treat2", "Treat3") # Sample names
# Calculate mean expression for each condition
control_means <- rowMeans(expression_data[, 1:3]) # Mean of control replicates
treated_means <- rowMeans(expression_data[, 4:6]) # Mean of treated replicates
# Calculate fold changes
fold_changes <- treated_means / control_means # Ratio of treated to control means
# Identify differentially expressed genes
is_differential <- abs(fold_changes) > 1.5 # Genes with >1.5-fold change
differential_genes <- rownames(expression_data)[is_differential] # Names of differential genes
# Practice exercises
exp_condition1 <- c(1200, 800, 2000, 300, 1500) # Expression values for condition 1
exp_condition2 <- c(1800, 1200, 2080, 900, 1600) # Expression values for condition 2
# Calculate statistics
mean_exp1 <- mean(exp_condition1) # Average expression in condition 1
mean_exp2 <- mean(exp_condition2) # Average expression in condition 2
# Find highly expressed genes
high_exp1 <- exp_condition1 > 1000 # Logical vector for high expression in condition 1
high_exp2 <- exp_condition2 > 1000 # Logical vector for high expression in condition 2
# Calculate fold changes between conditions
fold_changes <- exp_condition2 / exp_condition1 # Ratio of condition 2 to condition 1
# Extract specific genes from matrix
gene1_expression <- expression_data["Gene1", ] # Expression profile of Gene1
# Calculate mean expression per sample
sample_means <- colMeans(expression_data) # Average expression in each sample
# Find highest expressed gene in each sample
max_expression <- apply(expression_data, 2, which.max) # Index of highest expressed gene per sample
highest_genes <- rownames(expression_data)[max_expression] # Names of highest expressed genes
# Create and transform sample matrix
sample_matrix <- matrix(rnorm(20, mean=100, sd=20), # Generate random expression data
nrow=4, # 4 genes
ncol=5) # 5 samples
rownames(sample_matrix) <- paste0("Gene", 1:4) # Label genes
colnames(sample_matrix) <- paste0("Sample", 1:5) # Label samples
# Transform data
log_data <- log2(sample_matrix) # Convert to log2 scale
scaled_data <- t(scale(t(log_data))) # Scale and center the data
# Results:
# Gene Expression Values: 156.7, 238.9, 184.3, 145.6
# Gene Names: "BRCA1", "TP53", "EGFR", "KRAS"
# Time Points: 0, 6, 12, 18, 24, 30, 36, 42, 48
Vector Operations
Vectors in R support element-wise operations:
# RNA-seq example: normalizing expression values
raw_counts <- c(1200, 1500, 800, 2000)
scaling_factor <- 0.5
normalized_counts <- raw_counts * scaling_factor
# Adding vectors
control_expression <- c(100, 150, 80, 200)
treated_expression <- c(150, 180, 90, 250)
expression_difference <- treated_expression - control_expression
# Calculate fold change
fold_change <- treated_expression / control_expression
# Results:
# Normalized counts: 600, 750, 400, 1000
# Expression difference: 50, 30, 10, 50
# Fold change: 1.5, 1.2, 1.125, 1.25
Vector Functions
R provides many useful functions for working with vectors:
# Length of a vector
length(gene_names) # Returns 4
# Summary statistics
mean(raw_counts) # 1375
median(raw_counts) # 1350
max(raw_counts) # 2000
min(raw_counts) # 800
sum(raw_counts) # 5500
# Sort values
sorted_counts <- sort(raw_counts)
sorted_counts_desc <- sort(raw_counts, decreasing = TRUE)
# Results:
# Sorted counts (ascending): 800, 1200, 1500, 2000
# Sorted counts (descending): 2000, 1500, 1200, 800
Accessing Vector Elements
Elements in vectors can be accessed using indices or logical conditions:
# By position (index)
first_gene <- gene_names[1] # Gets "BRCA1"
selected_genes <- gene_names[c(1, 3)] # Gets "BRCA1" and "EGFR"
# By logical condition
high_expression <- raw_counts > 1000
high_expressed_genes <- gene_names[high_expression]
# Using which() function
significant_indices <- which(is_significant)
significant_genes <- gene_names[significant_indices]
# Results:
# First gene: "BRCA1"
# Selected genes: "BRCA1", "EGFR"
# Highly expressed genes: "BRCA1", "TP53", "KRAS"
Matrices
Matrices are two-dimensional arrays that also hold elements of the same type. They are particularly useful for representing expression data where rows might be genes and columns might be samples.
Creating Matrices
# Create a matrix from vectors
expression_matrix <- matrix(
c(100, 200, 150, 300,
120, 180, 160, 280,
90, 220, 170, 320),
nrow = 3,
ncol = 4,
byrow = TRUE
)
# Add row and column names
rownames(expression_matrix) <- c("Gene1", "Gene2", "Gene3")
colnames(expression_matrix) <- c("Sample1", "Sample2", "Sample3", "Sample4")
# Results:
# Sample1 Sample2 Sample3 Sample4
# Gene1 100 200 150 300
# Gene2 120 180 160 280
# Gene3 90 220 170 320
Matrix Operations
# Transpose a matrix
transposed_matrix <- t(expression_matrix)
# Matrix multiplication
correlation_matrix <- t(expression_matrix) %*% expression_matrix
# Element-wise operations
scaled_matrix <- expression_matrix * 2
log_matrix <- log2(expression_matrix)
# Results shown in scientific notation for correlation matrix
Accessing Matrix Elements
# Access individual elements and subsets of the matrix
value <- expression_matrix[1, 2] # Expression of Gene1 in Sample2
gene1_expression <- expression_matrix[1, ] # All samples for Gene1
sample1_values <- expression_matrix[, 1] # All genes in Sample1
subset_matrix <- expression_matrix[1:2, c(1,3)] # Expression for 2 genes in 2 samples
# Display different types of matrix access
print("Value at position [1,2]:") # Single expression value
print(value)
print("\nExpression values for Gene1:") # Expression profile of one gene
print(gene1_expression)
print("\nValues for Sample1:") # Expression profile of one sample
print(sample1_values)
print("\nSubset Matrix:") # Selected genes and samples
print(subset_matrix)
Practice Exercises
- Create a vector of p-values and find which genes are significant (p < 0.05)
- Calculate the log2 fold change instead of regular fold change
- Find genes that are both:
- Significantly changed (fold change > 1.5)
- Highly expressed (mean expression > 1000)
Tips for Working with Vectors and Matrices
- Always check dimensions
- Use
dim()for matrices - Use
length()for vectors - Ensure your data is structured as expected
- Use
- Handle missing values
- Use
is.na()to find missing values - Consider how to handle them (remove, impute, etc.)
- Use
- Choose appropriate transformations
- Log transformation for skewed data
- Scaling/normalization when comparing samples
- Consider the biological meaning of your data
- Document your analysis
- Add clear comments
- Use meaningful variable names
- Keep track of transformations applied
Next Steps
After mastering vectors and matrices, you can move on to:
- Working with data frames
- Statistical analysis and hypothesis testing
- Advanced visualization techniques
- Machine learning applications in R