Vectors and Matrices in R

Introduction

In R, vectors and matrices are fundamental data structures that allow us to work with collections of data. They are especially important in bioinformatics and RNA-seq analysis, where we often need to handle large sets of gene expression values or sample measurements.

Key concepts we’ll cover: - Vectors: One-dimensional arrays of values - Matrices: Two-dimensional arrays of values - Operations and functions for both data types - Real-world applications in RNA-seq analysis

Vectors

A vector is a one-dimensional array that can hold elements of the same type (numeric, character, or logical). Think of it as a single row or column of data.

Creating Vectors

There are several ways to create vectors in R. Here are common methods with bioinformatics examples:

# Create vectors using c() (combine) function for different data types
gene_expression <- c(156.7, 238.9, 184.3, 145.6)  # Numeric vector: Expression values in TPM
gene_names <- c("BRCA1", "TP53", "EGFR", "KRAS")  # Character vector: Gene symbols
is_significant <- c(TRUE, FALSE, TRUE, TRUE)       # Logical vector: Significance flags (p < 0.05)

# Create a sequence using the : operator
# Useful for sample numbers or time points
sample_numbers <- 1:10                # Creates sequence from 1 to 10 (e.g., patient IDs)

# Create more complex sequences using seq()
# Perfect for time series experiments
time_points <- seq(0, 48, by = 6)     # Creates sequence 0,6,12,...,48 (hours post-treatment)

# Display the created vectors
print("Gene Expression Values:")      # TPM values for each gene

## [1] "Gene Expression Values:"

print(gene_expression)

## [1] 156.7 238.9 184.3 145.6

print("Gene Names:")                  # Corresponding gene symbols

## [1] "Gene Names:"

print(gene_names)

## [1] "BRCA1" "TP53"  "EGFR"  "KRAS"

Vector Operations

Vectors in R support element-wise operations, making them perfect for data transformations:

# Example: RNA-seq data normalization
raw_counts <- c(1200, 1500, 800, 2000)     # Raw read counts from sequencing
scaling_factor <- 0.5                       # Library size normalization factor
normalized_counts <- raw_counts * scaling_factor  # Scale counts by library size

# Compare control vs treated samples
control_expression <- c(100, 150, 80, 200)      # Expression in control condition
treated_expression <- c(150, 180, 90, 250)      # Expression after treatment
expression_difference <- treated_expression - control_expression  # Absolute change

# Calculate fold change (treated/control)
# A common metric in differential expression analysis
fold_change <- treated_expression / control_expression  # Relative change

# Display results of calculations
print("Normalized counts:")                # Library-size normalized values

## [1] "Normalized counts:"

print(normalized_counts)

## [1]  600  750  400 1000

print("Expression difference:")            # Absolute expression changes

## [1] "Expression difference:"

print(expression_difference)

## [1] 50 30 10 50

print("Fold change:")                      # Relative expression changes

## [1] "Fold change:"

print(fold_change)

## [1] 1.500 1.200 1.125 1.250

Vector Functions

R provides many useful functions for working with vectors. These are essential for data analysis:

# Basic vector operations
print("Length of gene_names:")            # Number of genes in our dataset

## [1] "Length of gene_names:"

length(gene_names)                        # Count elements in vector

## [1] 4

# Calculate common statistical measures
print("Summary statistics for raw_counts:")

## [1] "Summary statistics for raw_counts:"

mean(raw_counts)                          # Average expression level

## [1] 1375

median(raw_counts)                        # Middle expression value (robust to outliers)

## [1] 1350

max(raw_counts)                          # Highest expression value

## [1] 2000

min(raw_counts)                          # Lowest expression value

## [1] 800

sum(raw_counts)                          # Total counts (library size)

## [1] 5500

# Sort values (useful for ranking genes)
sorted_counts <- sort(raw_counts)         # Sort by expression (ascending)
sorted_counts_desc <- sort(raw_counts, decreasing = TRUE)  # Find highest expressed genes

print("Sorted counts (ascending):")       # Show expression ranking

## [1] "Sorted counts (ascending):"

print(sorted_counts)

## [1]  800 1200 1500 2000

print("Sorted counts (descending):")      # Show highest to lowest

## [1] "Sorted counts (descending):"

print(sorted_counts_desc)

## [1] 2000 1500 1200  800

# Find unique elements (useful for finding unique genes/features)
unique_genes <- unique(gene_names)        # Remove duplicate gene names
print("Unique genes:")                    # Show deduplicated list

## [1] "Unique genes:"

print(unique_genes)

## [1] "BRCA1" "TP53"  "EGFR"  "KRAS"

Accessing Vector Elements

Elements in vectors can be accessed using indices or logical conditions. This is crucial for filtering data:

# Access elements by position (1-based indexing in R)
first_gene <- gene_names[1]               # Get first gene in list
selected_genes <- gene_names[c(1, 3)]     # Get specific genes of interest

# Access elements by condition (logical filtering)
high_expression <- raw_counts > 1000      # Find highly expressed genes
high_expressed_genes <- gene_names[high_expression]  # Get names of high expressors

# Use which() to get indices of TRUE values
# Useful for finding significant results
significant_indices <- which(is_significant)  # Find significant results
significant_genes <- gene_names[significant_indices]  # Get significant gene names

# Display results of different access methods
print("First gene:")                      # Single gene access

## [1] "First gene:"

print(first_gene)

## [1] "BRCA1"

print("Selected genes:")                  # Multiple gene access

## [1] "Selected genes:"

print(selected_genes)

## [1] "BRCA1" "EGFR"

print("Highly expressed genes:")          # Filtered gene list

## [1] "Highly expressed genes:"

print(high_expressed_genes)

## [1] "BRCA1" "TP53"  "KRAS"

Matrices

Matrices are two-dimensional arrays that also hold elements of the same type. They are particularly useful for representing expression data where: - Rows typically represent genes or features - Columns typically represent samples or conditions

Creating Matrices

# Create a matrix from vector data
# This could represent an expression matrix with:
# - 3 genes (rows)
# - 4 samples (columns)
expression_matrix <- matrix(
  c(100, 200, 150, 300,    # Expression values for Gene1
    120, 180, 160, 280,    # Expression values for Gene2
    90, 220, 170, 320),    # Expression values for Gene3
  nrow = 3,                # Number of genes
  ncol = 4,                # Number of samples
  byrow = TRUE            # Fill matrix by rows (each row = one gene)
)

# Add descriptive names to rows (genes) and columns (samples)
rownames(expression_matrix) <- c("Gene1", "Gene2", "Gene3")  # Gene IDs
colnames(expression_matrix) <- c("Sample1", "Sample2", "Sample3", "Sample4")  # Sample IDs

# Display the created matrix
print("Expression Matrix:")               # Show expression data table

## [1] "Expression Matrix:"

print(expression_matrix)

##       Sample1 Sample2 Sample3 Sample4
## Gene1     100     200     150     300
## Gene2     120     180     160     280
## Gene3      90     220     170     320

Matrix Operations

# Demonstrate common matrix transformations used in bioinformatics
transposed_matrix <- t(expression_matrix)  # Transpose (useful for certain analyses)

# Matrix multiplication (used in many statistical methods)
# For example, calculating correlation between samples
correlation_matrix <- t(expression_matrix) %*% expression_matrix  # Sample correlations

# Element-wise operations (common in data preprocessing)
scaled_matrix <- expression_matrix * 2     # Scale up all values
log_matrix <- log2(expression_matrix)      # Log2 transform (standard in RNA-seq)

# Display results of matrix operations
print("Transposed Matrix:")               # Genes as columns, samples as rows

## [1] "Transposed Matrix:"

print(transposed_matrix)

##         Gene1 Gene2 Gene3
## Sample1   100   120    90
## Sample2   200   180   220
## Sample3   150   160   170
## Sample4   300   280   320

print("\nCorrelation Matrix:")            # Sample-to-sample correlations

## [1] "\nCorrelation Matrix:"

print(correlation_matrix)

##         Sample1 Sample2 Sample3 Sample4
## Sample1   32500   61400   49500   92400
## Sample2   61400  120800   96200  180800
## Sample3   49500   96200   77000  144200
## Sample4   92400  180800  144200  270800

print("\nLog2 Transformed Matrix:")       # Log2-transformed expression values

## [1] "\nLog2 Transformed Matrix:"

print(log_matrix)

##        Sample1  Sample2  Sample3  Sample4
## Gene1 6.643856 7.643856 7.228819 8.228819
## Gene2 6.906891 7.491853 7.321928 8.129283
## Gene3 6.491853 7.781360 7.409391 8.321928

Accessing Matrix Elements

# Access individual elements and subsets of the matrix
value <- expression_matrix[1, 2]           # Expression of Gene1 in Sample2
gene1_expression <- expression_matrix[1, ]  # All samples for Gene1
sample1_values <- expression_matrix[, 1]    # All genes in Sample1
subset_matrix <- expression_matrix[1:2, c(1,3)]  # Expression for 2 genes in 2 samples

# Display different types of matrix access
print("Value at position [1,2]:")          # Single expression value

## [1] "Value at position [1,2]:"

print(value)

## [1] 200

print("\nExpression values for Gene1:")    # Expression profile of one gene

## [1] "\nExpression values for Gene1:"

print(gene1_expression)

## Sample1 Sample2 Sample3 Sample4 
##     100     200     150     300

print("\nValues for Sample1:")             # Expression profile of one sample

## [1] "\nValues for Sample1:"

print(sample1_values)

## Gene1 Gene2 Gene3 
##   100   120    90

print("\nSubset Matrix:")                  # Selected genes and samples

## [1] "\nSubset Matrix:"

print(subset_matrix)

##       Sample1 Sample3
## Gene1     100     150
## Gene2     120     160

RNA-seq Example: Differential Expression Analysis

Let’s work through a complete example using RNA-seq data to find differentially expressed genes:

# Create example RNA-seq expression matrix
# This represents counts for:
# - 5 genes (rows)
# - 6 samples (3 control, 3 treated)
expression_data <- matrix(
  c(
    1200, 1300, 1250, 1800, 1900, 1850,  # Gene1: Shows upregulation
    800,  750,  780,  1200, 1180, 1220,   # Gene2: Shows upregulation
    2000, 2100, 2050, 2080, 2150, 2090,   # Gene3: Stable expression
    300,  320,  310,  900,  920,  880,    # Gene4: Strong upregulation
    1500, 1450, 1480, 1600, 1580, 1620    # Gene5: Slight upregulation
  ),
  nrow = 5,                               # 5 genes to analyze
  ncol = 6,                               # 6 total samples
  byrow = TRUE                            # Each row is one gene
)

# Add descriptive names for clarity
rownames(expression_data) <- c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5")  # Gene IDs
colnames(expression_data) <- c("Ctrl1", "Ctrl2", "Ctrl3", "Treat1", "Treat2", "Treat3")  # Sample IDs

# Calculate mean expression for each condition
control_means <- rowMeans(expression_data[, 1:3])    # Average expression in controls
treated_means <- rowMeans(expression_data[, 4:6])    # Average expression in treated

# Calculate fold changes (treated/control)
# This shows relative change in expression
fold_changes <- treated_means / control_means        # Fold change calculation

# Identify differentially expressed genes
# Here we use a 1.5-fold change threshold
is_differential <- abs(fold_changes) > 1.5           # Find significant changes
differential_genes <- rownames(expression_data)[is_differential]  # Get gene names

# Display results of our analysis
print("Expression Data Matrix:")                     # Raw count data

## [1] "Expression Data Matrix:"

print(expression_data)

##       Ctrl1 Ctrl2 Ctrl3 Treat1 Treat2 Treat3
## Gene1  1200  1300  1250   1800   1900   1850
## Gene2   800   750   780   1200   1180   1220
## Gene3  2000  2100  2050   2080   2150   2090
## Gene4   300   320   310    900    920    880
## Gene5  1500  1450  1480   1600   1580   1620

print("\nControl Means:")                           # Average control expression

## [1] "\nControl Means:"

print(control_means)

##     Gene1     Gene2     Gene3     Gene4     Gene5 
## 1250.0000  776.6667 2050.0000  310.0000 1476.6667

print("\nTreated Means:")                           # Average treated expression

## [1] "\nTreated Means:"

print(treated_means)

##    Gene1    Gene2    Gene3    Gene4    Gene5 
## 1850.000 1200.000 2106.667  900.000 1600.000

print("\nFold Changes:")                            # Expression changes

## [1] "\nFold Changes:"

print(fold_changes)

##    Gene1    Gene2    Gene3    Gene4    Gene5 
## 1.480000 1.545064 1.027642 2.903226 1.083521

print("\nDifferentially Expressed Genes:")          # Genes with significant changes

## [1] "\nDifferentially Expressed Genes:"

print(differential_genes)

## [1] "Gene2" "Gene4"

Practice Exercises

Create a vector of p-values and find which genes are significant (p < 0.05)
Calculate the log2 fold change instead of regular fold change
Find genes that are both:
- Significantly changed (fold change > 1.5)
- Highly expressed (mean expression > 1000)

Tips for Working with Vectors and Matrices

Always check dimensions
- Use dim() for matrices
- Use length() for vectors
- Ensure your data is structured as expected
Handle missing values
- Use is.na() to find missing values
- Consider how to handle them (remove, impute, etc.)
Choose appropriate transformations
- Log transformation for skewed data
- Scaling/normalization when comparing samples
- Consider the biological meaning of your data
Document your analysis
- Add clear comments
- Use meaningful variable names
- Keep track of transformations applied

Next Steps

After mastering vectors and matrices, you can move on to: - Working with data frames - Statistical analysis and hypothesis testing - Advanced visualization techniques - Machine learning applications in R