Introduction

In R, vectors and matrices are fundamental data structures that allow us to work with collections of data. They are especially important in bioinformatics and RNA-seq analysis, where we often need to handle large sets of gene expression values or sample measurements.

Key concepts we’ll cover: - Vectors: One-dimensional arrays of values - Matrices: Two-dimensional arrays of values - Operations and functions for both data types - Real-world applications in RNA-seq analysis

Vectors

A vector is a one-dimensional array that can hold elements of the same type (numeric, character, or logical). Think of it as a single row or column of data.

Creating Vectors

There are several ways to create vectors in R. Here are common methods with bioinformatics examples:

# Create vectors using c() (combine) function for different data types
gene_expression <- c(156.7, 238.9, 184.3, 145.6)  # Numeric vector: Expression values in TPM
gene_names <- c("BRCA1", "TP53", "EGFR", "KRAS")  # Character vector: Gene symbols
is_significant <- c(TRUE, FALSE, TRUE, TRUE)       # Logical vector: Significance flags (p < 0.05)

# Create a sequence using the : operator
# Useful for sample numbers or time points
sample_numbers <- 1:10                # Creates sequence from 1 to 10 (e.g., patient IDs)

# Create more complex sequences using seq()
# Perfect for time series experiments
time_points <- seq(0, 48, by = 6)     # Creates sequence 0,6,12,...,48 (hours post-treatment)

# Display the created vectors
print("Gene Expression Values:")      # TPM values for each gene
## [1] "Gene Expression Values:"
print(gene_expression)
## [1] 156.7 238.9 184.3 145.6
print("Gene Names:")                  # Corresponding gene symbols
## [1] "Gene Names:"
print(gene_names)
## [1] "BRCA1" "TP53"  "EGFR"  "KRAS"

Vector Operations

Vectors in R support element-wise operations, making them perfect for data transformations:

# Example: RNA-seq data normalization
raw_counts <- c(1200, 1500, 800, 2000)     # Raw read counts from sequencing
scaling_factor <- 0.5                       # Library size normalization factor
normalized_counts <- raw_counts * scaling_factor  # Scale counts by library size

# Compare control vs treated samples
control_expression <- c(100, 150, 80, 200)      # Expression in control condition
treated_expression <- c(150, 180, 90, 250)      # Expression after treatment
expression_difference <- treated_expression - control_expression  # Absolute change

# Calculate fold change (treated/control)
# A common metric in differential expression analysis
fold_change <- treated_expression / control_expression  # Relative change

# Display results of calculations
print("Normalized counts:")                # Library-size normalized values
## [1] "Normalized counts:"
print(normalized_counts)
## [1]  600  750  400 1000
print("Expression difference:")            # Absolute expression changes
## [1] "Expression difference:"
print(expression_difference)
## [1] 50 30 10 50
print("Fold change:")                      # Relative expression changes
## [1] "Fold change:"
print(fold_change)
## [1] 1.500 1.200 1.125 1.250

Vector Functions

R provides many useful functions for working with vectors. These are essential for data analysis:

# Basic vector operations
print("Length of gene_names:")            # Number of genes in our dataset
## [1] "Length of gene_names:"
length(gene_names)                        # Count elements in vector
## [1] 4
# Calculate common statistical measures
print("Summary statistics for raw_counts:")
## [1] "Summary statistics for raw_counts:"
mean(raw_counts)                          # Average expression level
## [1] 1375
median(raw_counts)                        # Middle expression value (robust to outliers)
## [1] 1350
max(raw_counts)                          # Highest expression value
## [1] 2000
min(raw_counts)                          # Lowest expression value
## [1] 800
sum(raw_counts)                          # Total counts (library size)
## [1] 5500
# Sort values (useful for ranking genes)
sorted_counts <- sort(raw_counts)         # Sort by expression (ascending)
sorted_counts_desc <- sort(raw_counts, decreasing = TRUE)  # Find highest expressed genes

print("Sorted counts (ascending):")       # Show expression ranking
## [1] "Sorted counts (ascending):"
print(sorted_counts)
## [1]  800 1200 1500 2000
print("Sorted counts (descending):")      # Show highest to lowest
## [1] "Sorted counts (descending):"
print(sorted_counts_desc)
## [1] 2000 1500 1200  800
# Find unique elements (useful for finding unique genes/features)
unique_genes <- unique(gene_names)        # Remove duplicate gene names
print("Unique genes:")                    # Show deduplicated list
## [1] "Unique genes:"
print(unique_genes)
## [1] "BRCA1" "TP53"  "EGFR"  "KRAS"

Accessing Vector Elements

Elements in vectors can be accessed using indices or logical conditions. This is crucial for filtering data:

# Access elements by position (1-based indexing in R)
first_gene <- gene_names[1]               # Get first gene in list
selected_genes <- gene_names[c(1, 3)]     # Get specific genes of interest

# Access elements by condition (logical filtering)
high_expression <- raw_counts > 1000      # Find highly expressed genes
high_expressed_genes <- gene_names[high_expression]  # Get names of high expressors

# Use which() to get indices of TRUE values
# Useful for finding significant results
significant_indices <- which(is_significant)  # Find significant results
significant_genes <- gene_names[significant_indices]  # Get significant gene names

# Display results of different access methods
print("First gene:")                      # Single gene access
## [1] "First gene:"
print(first_gene)
## [1] "BRCA1"
print("Selected genes:")                  # Multiple gene access
## [1] "Selected genes:"
print(selected_genes)
## [1] "BRCA1" "EGFR"
print("Highly expressed genes:")          # Filtered gene list
## [1] "Highly expressed genes:"
print(high_expressed_genes)
## [1] "BRCA1" "TP53"  "KRAS"

Matrices

Matrices are two-dimensional arrays that also hold elements of the same type. They are particularly useful for representing expression data where: - Rows typically represent genes or features - Columns typically represent samples or conditions

Creating Matrices

# Create a matrix from vector data
# This could represent an expression matrix with:
# - 3 genes (rows)
# - 4 samples (columns)
expression_matrix <- matrix(
  c(100, 200, 150, 300,    # Expression values for Gene1
    120, 180, 160, 280,    # Expression values for Gene2
    90, 220, 170, 320),    # Expression values for Gene3
  nrow = 3,                # Number of genes
  ncol = 4,                # Number of samples
  byrow = TRUE            # Fill matrix by rows (each row = one gene)
)

# Add descriptive names to rows (genes) and columns (samples)
rownames(expression_matrix) <- c("Gene1", "Gene2", "Gene3")  # Gene IDs
colnames(expression_matrix) <- c("Sample1", "Sample2", "Sample3", "Sample4")  # Sample IDs

# Display the created matrix
print("Expression Matrix:")               # Show expression data table
## [1] "Expression Matrix:"
print(expression_matrix)
##       Sample1 Sample2 Sample3 Sample4
## Gene1     100     200     150     300
## Gene2     120     180     160     280
## Gene3      90     220     170     320

Matrix Operations

# Demonstrate common matrix transformations used in bioinformatics
transposed_matrix <- t(expression_matrix)  # Transpose (useful for certain analyses)

# Matrix multiplication (used in many statistical methods)
# For example, calculating correlation between samples
correlation_matrix <- t(expression_matrix) %*% expression_matrix  # Sample correlations

# Element-wise operations (common in data preprocessing)
scaled_matrix <- expression_matrix * 2     # Scale up all values
log_matrix <- log2(expression_matrix)      # Log2 transform (standard in RNA-seq)

# Display results of matrix operations
print("Transposed Matrix:")               # Genes as columns, samples as rows
## [1] "Transposed Matrix:"
print(transposed_matrix)
##         Gene1 Gene2 Gene3
## Sample1   100   120    90
## Sample2   200   180   220
## Sample3   150   160   170
## Sample4   300   280   320
print("\nCorrelation Matrix:")            # Sample-to-sample correlations
## [1] "\nCorrelation Matrix:"
print(correlation_matrix)
##         Sample1 Sample2 Sample3 Sample4
## Sample1   32500   61400   49500   92400
## Sample2   61400  120800   96200  180800
## Sample3   49500   96200   77000  144200
## Sample4   92400  180800  144200  270800
print("\nLog2 Transformed Matrix:")       # Log2-transformed expression values
## [1] "\nLog2 Transformed Matrix:"
print(log_matrix)
##        Sample1  Sample2  Sample3  Sample4
## Gene1 6.643856 7.643856 7.228819 8.228819
## Gene2 6.906891 7.491853 7.321928 8.129283
## Gene3 6.491853 7.781360 7.409391 8.321928

Accessing Matrix Elements

# Access individual elements and subsets of the matrix
value <- expression_matrix[1, 2]           # Expression of Gene1 in Sample2
gene1_expression <- expression_matrix[1, ]  # All samples for Gene1
sample1_values <- expression_matrix[, 1]    # All genes in Sample1
subset_matrix <- expression_matrix[1:2, c(1,3)]  # Expression for 2 genes in 2 samples

# Display different types of matrix access
print("Value at position [1,2]:")          # Single expression value
## [1] "Value at position [1,2]:"
print(value)
## [1] 200
print("\nExpression values for Gene1:")    # Expression profile of one gene
## [1] "\nExpression values for Gene1:"
print(gene1_expression)
## Sample1 Sample2 Sample3 Sample4 
##     100     200     150     300
print("\nValues for Sample1:")             # Expression profile of one sample
## [1] "\nValues for Sample1:"
print(sample1_values)
## Gene1 Gene2 Gene3 
##   100   120    90
print("\nSubset Matrix:")                  # Selected genes and samples
## [1] "\nSubset Matrix:"
print(subset_matrix)
##       Sample1 Sample3
## Gene1     100     150
## Gene2     120     160

RNA-seq Example: Differential Expression Analysis

Let’s work through a complete example using RNA-seq data to find differentially expressed genes:

# Create example RNA-seq expression matrix
# This represents counts for:
# - 5 genes (rows)
# - 6 samples (3 control, 3 treated)
expression_data <- matrix(
  c(
    1200, 1300, 1250, 1800, 1900, 1850,  # Gene1: Shows upregulation
    800,  750,  780,  1200, 1180, 1220,   # Gene2: Shows upregulation
    2000, 2100, 2050, 2080, 2150, 2090,   # Gene3: Stable expression
    300,  320,  310,  900,  920,  880,    # Gene4: Strong upregulation
    1500, 1450, 1480, 1600, 1580, 1620    # Gene5: Slight upregulation
  ),
  nrow = 5,                               # 5 genes to analyze
  ncol = 6,                               # 6 total samples
  byrow = TRUE                            # Each row is one gene
)

# Add descriptive names for clarity
rownames(expression_data) <- c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5")  # Gene IDs
colnames(expression_data) <- c("Ctrl1", "Ctrl2", "Ctrl3", "Treat1", "Treat2", "Treat3")  # Sample IDs

# Calculate mean expression for each condition
control_means <- rowMeans(expression_data[, 1:3])    # Average expression in controls
treated_means <- rowMeans(expression_data[, 4:6])    # Average expression in treated

# Calculate fold changes (treated/control)
# This shows relative change in expression
fold_changes <- treated_means / control_means        # Fold change calculation

# Identify differentially expressed genes
# Here we use a 1.5-fold change threshold
is_differential <- abs(fold_changes) > 1.5           # Find significant changes
differential_genes <- rownames(expression_data)[is_differential]  # Get gene names

# Display results of our analysis
print("Expression Data Matrix:")                     # Raw count data
## [1] "Expression Data Matrix:"
print(expression_data)
##       Ctrl1 Ctrl2 Ctrl3 Treat1 Treat2 Treat3
## Gene1  1200  1300  1250   1800   1900   1850
## Gene2   800   750   780   1200   1180   1220
## Gene3  2000  2100  2050   2080   2150   2090
## Gene4   300   320   310    900    920    880
## Gene5  1500  1450  1480   1600   1580   1620
print("\nControl Means:")                           # Average control expression
## [1] "\nControl Means:"
print(control_means)
##     Gene1     Gene2     Gene3     Gene4     Gene5 
## 1250.0000  776.6667 2050.0000  310.0000 1476.6667
print("\nTreated Means:")                           # Average treated expression
## [1] "\nTreated Means:"
print(treated_means)
##    Gene1    Gene2    Gene3    Gene4    Gene5 
## 1850.000 1200.000 2106.667  900.000 1600.000
print("\nFold Changes:")                            # Expression changes
## [1] "\nFold Changes:"
print(fold_changes)
##    Gene1    Gene2    Gene3    Gene4    Gene5 
## 1.480000 1.545064 1.027642 2.903226 1.083521
print("\nDifferentially Expressed Genes:")          # Genes with significant changes
## [1] "\nDifferentially Expressed Genes:"
print(differential_genes)
## [1] "Gene2" "Gene4"

Practice Exercises

  1. Create a vector of p-values and find which genes are significant (p < 0.05)
  2. Calculate the log2 fold change instead of regular fold change
  3. Find genes that are both:
    • Significantly changed (fold change > 1.5)
    • Highly expressed (mean expression > 1000)

Tips for Working with Vectors and Matrices

  1. Always check dimensions
    • Use dim() for matrices
    • Use length() for vectors
    • Ensure your data is structured as expected
  2. Handle missing values
    • Use is.na() to find missing values
    • Consider how to handle them (remove, impute, etc.)
  3. Choose appropriate transformations
    • Log transformation for skewed data
    • Scaling/normalization when comparing samples
    • Consider the biological meaning of your data
  4. Document your analysis
    • Add clear comments
    • Use meaningful variable names
    • Keep track of transformations applied

Next Steps

After mastering vectors and matrices, you can move on to: - Working with data frames - Statistical analysis and hypothesis testing - Advanced visualization techniques - Machine learning applications in R