In R, vectors and matrices are fundamental data structures that allow us to work with collections of data. They are especially important in bioinformatics and RNA-seq analysis, where we often need to handle large sets of gene expression values or sample measurements.
Key concepts we’ll cover: - Vectors: One-dimensional arrays of values - Matrices: Two-dimensional arrays of values - Operations and functions for both data types - Real-world applications in RNA-seq analysis
A vector is a one-dimensional array that can hold elements of the same type (numeric, character, or logical). Think of it as a single row or column of data.
There are several ways to create vectors in R. Here are common methods with bioinformatics examples:
# Create vectors using c() (combine) function for different data types
gene_expression <- c(156.7, 238.9, 184.3, 145.6) # Numeric vector: Expression values in TPM
gene_names <- c("BRCA1", "TP53", "EGFR", "KRAS") # Character vector: Gene symbols
is_significant <- c(TRUE, FALSE, TRUE, TRUE) # Logical vector: Significance flags (p < 0.05)
# Create a sequence using the : operator
# Useful for sample numbers or time points
sample_numbers <- 1:10 # Creates sequence from 1 to 10 (e.g., patient IDs)
# Create more complex sequences using seq()
# Perfect for time series experiments
time_points <- seq(0, 48, by = 6) # Creates sequence 0,6,12,...,48 (hours post-treatment)
# Display the created vectors
print("Gene Expression Values:") # TPM values for each gene## [1] "Gene Expression Values:"
## [1] 156.7 238.9 184.3 145.6
## [1] "Gene Names:"
## [1] "BRCA1" "TP53" "EGFR" "KRAS"
Vectors in R support element-wise operations, making them perfect for data transformations:
# Example: RNA-seq data normalization
raw_counts <- c(1200, 1500, 800, 2000) # Raw read counts from sequencing
scaling_factor <- 0.5 # Library size normalization factor
normalized_counts <- raw_counts * scaling_factor # Scale counts by library size
# Compare control vs treated samples
control_expression <- c(100, 150, 80, 200) # Expression in control condition
treated_expression <- c(150, 180, 90, 250) # Expression after treatment
expression_difference <- treated_expression - control_expression # Absolute change
# Calculate fold change (treated/control)
# A common metric in differential expression analysis
fold_change <- treated_expression / control_expression # Relative change
# Display results of calculations
print("Normalized counts:") # Library-size normalized values## [1] "Normalized counts:"
## [1] 600 750 400 1000
## [1] "Expression difference:"
## [1] 50 30 10 50
## [1] "Fold change:"
## [1] 1.500 1.200 1.125 1.250
R provides many useful functions for working with vectors. These are essential for data analysis:
## [1] "Length of gene_names:"
## [1] 4
## [1] "Summary statistics for raw_counts:"
## [1] 1375
## [1] 1350
## [1] 2000
## [1] 800
## [1] 5500
# Sort values (useful for ranking genes)
sorted_counts <- sort(raw_counts) # Sort by expression (ascending)
sorted_counts_desc <- sort(raw_counts, decreasing = TRUE) # Find highest expressed genes
print("Sorted counts (ascending):") # Show expression ranking## [1] "Sorted counts (ascending):"
## [1] 800 1200 1500 2000
## [1] "Sorted counts (descending):"
## [1] 2000 1500 1200 800
# Find unique elements (useful for finding unique genes/features)
unique_genes <- unique(gene_names) # Remove duplicate gene names
print("Unique genes:") # Show deduplicated list## [1] "Unique genes:"
## [1] "BRCA1" "TP53" "EGFR" "KRAS"
Elements in vectors can be accessed using indices or logical conditions. This is crucial for filtering data:
# Access elements by position (1-based indexing in R)
first_gene <- gene_names[1] # Get first gene in list
selected_genes <- gene_names[c(1, 3)] # Get specific genes of interest
# Access elements by condition (logical filtering)
high_expression <- raw_counts > 1000 # Find highly expressed genes
high_expressed_genes <- gene_names[high_expression] # Get names of high expressors
# Use which() to get indices of TRUE values
# Useful for finding significant results
significant_indices <- which(is_significant) # Find significant results
significant_genes <- gene_names[significant_indices] # Get significant gene names
# Display results of different access methods
print("First gene:") # Single gene access## [1] "First gene:"
## [1] "BRCA1"
## [1] "Selected genes:"
## [1] "BRCA1" "EGFR"
## [1] "Highly expressed genes:"
## [1] "BRCA1" "TP53" "KRAS"
Matrices are two-dimensional arrays that also hold elements of the same type. They are particularly useful for representing expression data where: - Rows typically represent genes or features - Columns typically represent samples or conditions
# Create a matrix from vector data
# This could represent an expression matrix with:
# - 3 genes (rows)
# - 4 samples (columns)
expression_matrix <- matrix(
c(100, 200, 150, 300, # Expression values for Gene1
120, 180, 160, 280, # Expression values for Gene2
90, 220, 170, 320), # Expression values for Gene3
nrow = 3, # Number of genes
ncol = 4, # Number of samples
byrow = TRUE # Fill matrix by rows (each row = one gene)
)
# Add descriptive names to rows (genes) and columns (samples)
rownames(expression_matrix) <- c("Gene1", "Gene2", "Gene3") # Gene IDs
colnames(expression_matrix) <- c("Sample1", "Sample2", "Sample3", "Sample4") # Sample IDs
# Display the created matrix
print("Expression Matrix:") # Show expression data table## [1] "Expression Matrix:"
## Sample1 Sample2 Sample3 Sample4
## Gene1 100 200 150 300
## Gene2 120 180 160 280
## Gene3 90 220 170 320
# Demonstrate common matrix transformations used in bioinformatics
transposed_matrix <- t(expression_matrix) # Transpose (useful for certain analyses)
# Matrix multiplication (used in many statistical methods)
# For example, calculating correlation between samples
correlation_matrix <- t(expression_matrix) %*% expression_matrix # Sample correlations
# Element-wise operations (common in data preprocessing)
scaled_matrix <- expression_matrix * 2 # Scale up all values
log_matrix <- log2(expression_matrix) # Log2 transform (standard in RNA-seq)
# Display results of matrix operations
print("Transposed Matrix:") # Genes as columns, samples as rows## [1] "Transposed Matrix:"
## Gene1 Gene2 Gene3
## Sample1 100 120 90
## Sample2 200 180 220
## Sample3 150 160 170
## Sample4 300 280 320
## [1] "\nCorrelation Matrix:"
## Sample1 Sample2 Sample3 Sample4
## Sample1 32500 61400 49500 92400
## Sample2 61400 120800 96200 180800
## Sample3 49500 96200 77000 144200
## Sample4 92400 180800 144200 270800
## [1] "\nLog2 Transformed Matrix:"
## Sample1 Sample2 Sample3 Sample4
## Gene1 6.643856 7.643856 7.228819 8.228819
## Gene2 6.906891 7.491853 7.321928 8.129283
## Gene3 6.491853 7.781360 7.409391 8.321928
# Access individual elements and subsets of the matrix
value <- expression_matrix[1, 2] # Expression of Gene1 in Sample2
gene1_expression <- expression_matrix[1, ] # All samples for Gene1
sample1_values <- expression_matrix[, 1] # All genes in Sample1
subset_matrix <- expression_matrix[1:2, c(1,3)] # Expression for 2 genes in 2 samples
# Display different types of matrix access
print("Value at position [1,2]:") # Single expression value## [1] "Value at position [1,2]:"
## [1] 200
## [1] "\nExpression values for Gene1:"
## Sample1 Sample2 Sample3 Sample4
## 100 200 150 300
## [1] "\nValues for Sample1:"
## Gene1 Gene2 Gene3
## 100 120 90
## [1] "\nSubset Matrix:"
## Sample1 Sample3
## Gene1 100 150
## Gene2 120 160
Let’s work through a complete example using RNA-seq data to find differentially expressed genes:
# Create example RNA-seq expression matrix
# This represents counts for:
# - 5 genes (rows)
# - 6 samples (3 control, 3 treated)
expression_data <- matrix(
c(
1200, 1300, 1250, 1800, 1900, 1850, # Gene1: Shows upregulation
800, 750, 780, 1200, 1180, 1220, # Gene2: Shows upregulation
2000, 2100, 2050, 2080, 2150, 2090, # Gene3: Stable expression
300, 320, 310, 900, 920, 880, # Gene4: Strong upregulation
1500, 1450, 1480, 1600, 1580, 1620 # Gene5: Slight upregulation
),
nrow = 5, # 5 genes to analyze
ncol = 6, # 6 total samples
byrow = TRUE # Each row is one gene
)
# Add descriptive names for clarity
rownames(expression_data) <- c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5") # Gene IDs
colnames(expression_data) <- c("Ctrl1", "Ctrl2", "Ctrl3", "Treat1", "Treat2", "Treat3") # Sample IDs
# Calculate mean expression for each condition
control_means <- rowMeans(expression_data[, 1:3]) # Average expression in controls
treated_means <- rowMeans(expression_data[, 4:6]) # Average expression in treated
# Calculate fold changes (treated/control)
# This shows relative change in expression
fold_changes <- treated_means / control_means # Fold change calculation
# Identify differentially expressed genes
# Here we use a 1.5-fold change threshold
is_differential <- abs(fold_changes) > 1.5 # Find significant changes
differential_genes <- rownames(expression_data)[is_differential] # Get gene names
# Display results of our analysis
print("Expression Data Matrix:") # Raw count data## [1] "Expression Data Matrix:"
## Ctrl1 Ctrl2 Ctrl3 Treat1 Treat2 Treat3
## Gene1 1200 1300 1250 1800 1900 1850
## Gene2 800 750 780 1200 1180 1220
## Gene3 2000 2100 2050 2080 2150 2090
## Gene4 300 320 310 900 920 880
## Gene5 1500 1450 1480 1600 1580 1620
## [1] "\nControl Means:"
## Gene1 Gene2 Gene3 Gene4 Gene5
## 1250.0000 776.6667 2050.0000 310.0000 1476.6667
## [1] "\nTreated Means:"
## Gene1 Gene2 Gene3 Gene4 Gene5
## 1850.000 1200.000 2106.667 900.000 1600.000
## [1] "\nFold Changes:"
## Gene1 Gene2 Gene3 Gene4 Gene5
## 1.480000 1.545064 1.027642 2.903226 1.083521
## [1] "\nDifferentially Expressed Genes:"
## [1] "Gene2" "Gene4"
dim() for matriceslength() for vectorsis.na() to find missing valuesAfter mastering vectors and matrices, you can move on to: - Working with data frames - Statistical analysis and hypothesis testing - Advanced visualization techniques - Machine learning applications in R