Basic Statistical Analysis

Welcome to Statistics in R!

Have you ever wondered: - Is this medical treatment actually effective? - Are these two groups really different from each other? - How can we be confident in our research findings?

Statistics helps us answer these questions! In this tutorial, we’ll learn: 1. How to understand and work with data distributions 2. How to make decisions using statistical tests 3. How to visualize and interpret our results

Don’t worry if you’re new to statistics - we’ll explain everything step by step!

Part 1: Understanding Data Distributions

What is a Distribution?

Think of a distribution as a way to see the “shape” of our data. For example: - Heights in a population - Test scores in a class - Gene expression levels in different samples

The most common distribution we see in nature is the “normal” or “bell-shaped” distribution.

The Normal Distribution: A Bell-Shaped Curve

Let’s create and visualize a normal distribution. Imagine we’re measuring heights of 1000 people:

# Set a random seed for reproducibility - this ensures we get the same random numbers each time
set.seed(123)  

# Generate synthetic height data:
# - 1000 random numbers from a normal distribution
# - Mean (average) height of 170 cm
# - Standard deviation of 10 cm (spread of the data)
heights <- rnorm(1000, mean = 170, sd = 10)  

# Create a histogram with density scaling
# This shows the distribution of our height data
hist(heights, 
     breaks = 30,                    # Number of bars in the histogram
     main = "Distribution of Heights in Our Population",  # Title
     xlab = "Height (cm)",          # X-axis label
     ylab = "Density",              # Y-axis label
     col = "lightblue",             # Bar color
     border = "white",              # Border color
     prob = TRUE)                   # Use probability scaling for density curves

# Add the theoretical normal curve (red line)
# This shows what a perfect normal distribution would look like
curve(dnorm(x, mean = mean(heights), sd = sd(heights)), 
      col = "red", 
      lwd = 2,                      # Line width
      add = TRUE)                   # Add to existing plot

# Add the empirical density curve (blue dashed line)
# This shows the actual distribution of our data
lines(density(heights), 
      col = "darkblue", 
      lwd = 2,
      lty = 2)                      # Make line dashed

# Add a legend to explain the lines
legend("topright", 
       legend = c("Theoretical Normal", "Empirical Density"),
       col = c("red", "darkblue"),
       lwd = 2,
       lty = c(1, 2))

# Calculate and display summary statistics
cat("\nSummary of Heights:")

## 
## Summary of Heights:

cat("\nAverage Height:", round(mean(heights), 1), "cm")     # Mean height

## 
## Average Height: 170.2 cm

cat("\nStandard Deviation:", round(sd(heights), 1), "cm")   # Spread of heights

## 
## Standard Deviation: 9.9 cm

cat("\nMinimum:", round(min(heights), 1), "cm")             # Shortest height

## 
## Minimum: 141.9 cm

cat("\nMaximum:", round(max(heights), 1), "cm")             # Tallest height

## 
## Maximum: 202.4 cm

What can we learn from this plot? - Most people have heights near the average (170 cm) - Fewer people have very tall or very short heights - The red curve shows the “ideal” bell shape - The blue dashed line shows our actual data follows this shape closely

How Different Factors Affect Our Distribution

Let’s look at how changing our measurements affects the shape. Imagine measuring heights in different populations:

# Generate three different height distributions to compare:
child_heights <- rnorm(1000, mean = 120, sd = 10)    # Children (shorter average)
adult_heights <- rnorm(1000, mean = 170, sd = 10)    # Adults (taller average)
variable_heights <- rnorm(1000, mean = 170, sd = 20) # More variable group (same average, more spread)

# Create a base plot using the density of child heights
plot(density(child_heights), 
     main = "Comparing Height Distributions",
     xlab = "Height (cm)",
     ylab = "Frequency",
     col = "blue",                  # Blue line for children
     lwd = 2,                       # Line width
     xlim = c(80, 220))            # Set x-axis limits

# Add lines for other distributions
lines(density(adult_heights), col = "red", lwd = 2)      # Red line for adults
lines(density(variable_heights), col = "green", lwd = 2) # Green line for variable group

# Add a legend to explain the lines
legend("topright", 
       legend = c("Children (mean=120cm)", 
                 "Adults (mean=170cm)", 
                 "Variable Group (more spread out)"),
       col = c("blue", "red", "green"),
       lwd = 2)

What do we notice? - The blue curve (children) is centered at a lower height - The red curve (adults) is centered at a higher height - The green curve shows what happens when there’s more variability

Part 2: Making Decisions with Data

The Basic Question: Are Two Groups Different?

Imagine we’re testing a new treatment and want to know if it works. We have: - A control group (standard treatment) - A treatment group (new treatment)

Let’s look at some example data:

# Generate example data for our experiment
# Control group: 30 measurements with mean=100 and SD=15
control <- rnorm(30, mean = 100, sd = 15)    

# Treatment group: 30 measurements with mean=115 and SD=15
# Note: We set a higher mean to simulate a treatment effect
treatment <- rnorm(30, mean = 115, sd = 15)   

# Combine data into a single data frame for easier analysis
# This creates a structured format with two columns:
# - value: the actual measurements
# - group: indicates whether each measurement is from control or treatment
experiment_data <- data.frame(
  value = c(control, treatment),                # All measurements combined
  group = rep(c("Control", "Treatment"), each = 30)  # Group labels
)

# Create a box plot to visualize the data distribution
# Box plots show:
# - The median (middle line)
# - The interquartile range (box)
# - The range of typical values (whiskers)
# - Any outliers (individual points)
boxplot(value ~ group, data = experiment_data,
        main = "Comparing Control vs Treatment",
        ylab = "Measurement Value",
        col = c("lightblue", "lightgreen"))  # Different colors for each group

# Add individual data points to show the actual distribution
# This helps us see the raw data alongside the summary statistics
stripchart(value ~ group, data = experiment_data,
           vertical = TRUE,           # Points arranged vertically
           method = "jitter",         # Add random horizontal jitter to prevent overlap
           add = TRUE,                # Add to existing plot
           pch = 19,                  # Use filled circles for points
           col = "darkgray")          # Point color

# Calculate and display summary statistics for each group
cat("\nControl Group:")

## 
## Control Group:

cat("\n  Average:", round(mean(control), 1))        # Mean of control group

## 
##   Average: 99

cat("\n  SD:", round(sd(control), 1))               # Standard deviation of control

## 
##   SD: 19.1

cat("\n\nTreatment Group:")

## 
## 
## Treatment Group:

cat("\n  Average:", round(mean(treatment), 1))      # Mean of treatment group

## 
##   Average: 111.9

cat("\n  SD:", round(sd(treatment), 1))             # Standard deviation of treatment

## 
##   SD: 15.3

What are we seeing? - Each dot represents one measurement - The boxes show where most of the data falls (25th to 75th percentile) - The horizontal line in each box is the median (middle value) - The treatment group seems to have higher values, but is this difference real?

Testing if the Difference is Real: The t-test

To decide if the difference between groups is “real” (statistically significant), we use a t-test:

# Perform an independent samples t-test
# This test compares the means of two groups and tells us if they're significantly different
test_result <- t.test(value ~ group, data = experiment_data)

# Print results in a user-friendly format
cat("\nResults of our Statistical Test:")

## 
## Results of our Statistical Test:

cat("\n--------------------------------")

## 
## --------------------------------

# Calculate and display the actual difference between group means
cat("\nAverage difference between groups:", 
    round(diff(tapply(experiment_data$value, experiment_data$group, mean)), 1))

## 
## Average difference between groups: 12.9

# Interpret the p-value in plain language
cat("\nIs this difference significant?", 
    ifelse(test_result$p.value < 0.05, "YES!", "No"))

## 
## Is this difference significant? YES!

# Show the actual p-value
cat("\n(p-value =", round(test_result$p.value, 4), ")")

## 
## (p-value = 0.0054 )

What does this mean? - The p-value tells us how likely we would see this difference by chance - If p < 0.05, we say the difference is “statistically significant” - A significant result suggests the treatment has a real effect - But remember: statistical significance ≠ practical importance!

Part 3: Common Mistakes to Avoid

Don’t jump to conclusions!
- A “significant” result doesn’t always mean a practically important difference
- Always look at the actual size of the difference
- Consider the context of your research
Check your assumptions
- The t-test assumes our data is roughly normal (bell-shaped)
- We can check this using Q-Q plots (Quantile-Quantile plots)
- Points following the line suggest normal distribution

# Create Q-Q plots to check if our data follows a normal distribution
# Q-Q plots compare our data quantiles to theoretical normal quantiles
# Points should roughly follow the diagonal line if data is normal

# Set up the plotting area to show two plots side by side
par(mfrow = c(1, 2))

# Q-Q plot for Control group
qqnorm(experiment_data$value[experiment_data$group == "Control"],
       main = "Control Group")
qqline(experiment_data$value[experiment_data$group == "Control"])

# Q-Q plot for Treatment group
qqnorm(experiment_data$value[experiment_data$group == "Treatment"],
       main = "Treatment Group")
qqline(experiment_data$value[experiment_data$group == "Treatment"])

# Reset plotting area to single plot
par(mfrow = c(1, 1))

If the points follow the line closely, our data is approximately normal! - Points falling exactly on the line would indicate perfect normality - Small deviations are usually acceptable - Large deviations might suggest we need different statistical methods

Practice Exercises

Try these exercises to test your understanding:

Exercise 1: Generate and Compare Groups

Create your own experiment comparing two groups:

# Generate data for two groups (e.g., different medical treatments)
group1 <- rnorm(25, mean = 50, sd = 10)    # First treatment group
group2 <- rnorm(25, mean = 55, sd = 10)    # Second treatment group

# Create a data frame
my_experiment <- data.frame(
  value = c(group1, group2),
  group = rep(c("Treatment1", "Treatment2"), each = 25)
)

# Create visualizations
# 1. Box plot with individual points
boxplot(value ~ group, data = my_experiment,
        main = "Comparing Treatments",
        col = c("pink", "lightgreen"))
stripchart(value ~ group, data = my_experiment,
           vertical = TRUE, method = "jitter",
           add = TRUE, col = "darkgray")

# 2. Perform t-test
test_result <- t.test(value ~ group, data = my_experiment)
print(test_result)

Exercise 2: Explore Different Distributions

Create and compare different types of distributions:

# Generate data from different distributions
normal_data <- rnorm(1000, mean = 0, sd = 1)       # Normal distribution
skewed_data <- rexp(1000, rate = 1)                # Exponential (skewed) distribution
uniform_data <- runif(1000, min = -2, max = 2)     # Uniform distribution

# Create histograms to compare
par(mfrow = c(1, 3))  # Set up for three plots side by side

# Normal distribution
hist(normal_data, main = "Normal Distribution",
     prob = TRUE, col = "lightblue")
lines(density(normal_data), col = "red", lwd = 2)

# Skewed distribution
hist(skewed_data, main = "Skewed Distribution",
     prob = TRUE, col = "lightgreen")
lines(density(skewed_data), col = "red", lwd = 2)

# Uniform distribution
hist(uniform_data, main = "Uniform Distribution",
     prob = TRUE, col = "lightpink")
lines(density(uniform_data), col = "red", lwd = 2)

par(mfrow = c(1, 1))  # Reset plotting window

Exercise 3: Working with Real Data

Try these analyses with your own data: 1. Load your dataset 2. Create appropriate visualizations 3. Perform statistical tests 4. Interpret the results

Tips for Success

Always Start with Visualization
- Plot your data before running tests
- Look for patterns and potential problems
- Consider different types of plots
Check Your Assumptions
- Normal distribution when needed
- Equal variances between groups
- Independent samples
Interpret Results Carefully
- Consider practical significance
- Look at effect sizes
- Think about your research context

Additional Resources

UCLA Statistical Methods and Data Analytics
- Comprehensive resource for statistical methods
- Detailed R tutorials and examples
- Practice datasets and exercises
R for Data Science - Statistical Analysis
- Modern R programming techniques
- Data visualization best practices
- Statistical analysis workflows
Quick R: Basic Statistics
- Quick reference for common statistical tests
- Code examples and explanations
- Troubleshooting tips

Advanced Topics (Optional)

For those interested in going further:

Advanced Statistical Tests
- ANOVA for multiple groups
- Non-parametric alternatives
- Mixed effects models
Power Analysis
- Sample size calculations
- Effect size estimation
- Study design considerations
Advanced Visualization
- Interactive plots
- Custom themes
- Publication-ready figures

Remember: The best way to learn statistics is through practice. Try these exercises with different datasets and explore how changing parameters affects your results!