Have you ever wondered: - Is this medical treatment actually effective? - Are these two groups really different from each other? - How can we be confident in our research findings?
Statistics helps us answer these questions! In this tutorial, we’ll learn: 1. How to understand and work with data distributions 2. How to make decisions using statistical tests 3. How to visualize and interpret our results
Don’t worry if you’re new to statistics - we’ll explain everything step by step!
Think of a distribution as a way to see the “shape” of our data. For example: - Heights in a population - Test scores in a class - Gene expression levels in different samples
The most common distribution we see in nature is the “normal” or “bell-shaped” distribution.
Let’s create and visualize a normal distribution. Imagine we’re measuring heights of 1000 people:
# Set a random seed for reproducibility - this ensures we get the same random numbers each time
set.seed(123)
# Generate synthetic height data:
# - 1000 random numbers from a normal distribution
# - Mean (average) height of 170 cm
# - Standard deviation of 10 cm (spread of the data)
heights <- rnorm(1000, mean = 170, sd = 10)
# Create a histogram with density scaling
# This shows the distribution of our height data
hist(heights,
breaks = 30, # Number of bars in the histogram
main = "Distribution of Heights in Our Population", # Title
xlab = "Height (cm)", # X-axis label
ylab = "Density", # Y-axis label
col = "lightblue", # Bar color
border = "white", # Border color
prob = TRUE) # Use probability scaling for density curves
# Add the theoretical normal curve (red line)
# This shows what a perfect normal distribution would look like
curve(dnorm(x, mean = mean(heights), sd = sd(heights)),
col = "red",
lwd = 2, # Line width
add = TRUE) # Add to existing plot
# Add the empirical density curve (blue dashed line)
# This shows the actual distribution of our data
lines(density(heights),
col = "darkblue",
lwd = 2,
lty = 2) # Make line dashed
# Add a legend to explain the lines
legend("topright",
legend = c("Theoretical Normal", "Empirical Density"),
col = c("red", "darkblue"),
lwd = 2,
lty = c(1, 2))##
## Summary of Heights:
##
## Average Height: 170.2 cm
##
## Standard Deviation: 9.9 cm
##
## Minimum: 141.9 cm
##
## Maximum: 202.4 cm
What can we learn from this plot? - Most people have heights near the average (170 cm) - Fewer people have very tall or very short heights - The red curve shows the “ideal” bell shape - The blue dashed line shows our actual data follows this shape closely
Let’s look at how changing our measurements affects the shape. Imagine measuring heights in different populations:
# Generate three different height distributions to compare:
child_heights <- rnorm(1000, mean = 120, sd = 10) # Children (shorter average)
adult_heights <- rnorm(1000, mean = 170, sd = 10) # Adults (taller average)
variable_heights <- rnorm(1000, mean = 170, sd = 20) # More variable group (same average, more spread)
# Create a base plot using the density of child heights
plot(density(child_heights),
main = "Comparing Height Distributions",
xlab = "Height (cm)",
ylab = "Frequency",
col = "blue", # Blue line for children
lwd = 2, # Line width
xlim = c(80, 220)) # Set x-axis limits
# Add lines for other distributions
lines(density(adult_heights), col = "red", lwd = 2) # Red line for adults
lines(density(variable_heights), col = "green", lwd = 2) # Green line for variable group
# Add a legend to explain the lines
legend("topright",
legend = c("Children (mean=120cm)",
"Adults (mean=170cm)",
"Variable Group (more spread out)"),
col = c("blue", "red", "green"),
lwd = 2)What do we notice? - The blue curve (children) is centered at a lower height - The red curve (adults) is centered at a higher height - The green curve shows what happens when there’s more variability
Imagine we’re testing a new treatment and want to know if it works. We have: - A control group (standard treatment) - A treatment group (new treatment)
Let’s look at some example data:
# Generate example data for our experiment
# Control group: 30 measurements with mean=100 and SD=15
control <- rnorm(30, mean = 100, sd = 15)
# Treatment group: 30 measurements with mean=115 and SD=15
# Note: We set a higher mean to simulate a treatment effect
treatment <- rnorm(30, mean = 115, sd = 15)
# Combine data into a single data frame for easier analysis
# This creates a structured format with two columns:
# - value: the actual measurements
# - group: indicates whether each measurement is from control or treatment
experiment_data <- data.frame(
value = c(control, treatment), # All measurements combined
group = rep(c("Control", "Treatment"), each = 30) # Group labels
)
# Create a box plot to visualize the data distribution
# Box plots show:
# - The median (middle line)
# - The interquartile range (box)
# - The range of typical values (whiskers)
# - Any outliers (individual points)
boxplot(value ~ group, data = experiment_data,
main = "Comparing Control vs Treatment",
ylab = "Measurement Value",
col = c("lightblue", "lightgreen")) # Different colors for each group
# Add individual data points to show the actual distribution
# This helps us see the raw data alongside the summary statistics
stripchart(value ~ group, data = experiment_data,
vertical = TRUE, # Points arranged vertically
method = "jitter", # Add random horizontal jitter to prevent overlap
add = TRUE, # Add to existing plot
pch = 19, # Use filled circles for points
col = "darkgray") # Point color##
## Control Group:
##
## Average: 99
##
## SD: 19.1
##
##
## Treatment Group:
##
## Average: 111.9
##
## SD: 15.3
What are we seeing? - Each dot represents one measurement - The boxes show where most of the data falls (25th to 75th percentile) - The horizontal line in each box is the median (middle value) - The treatment group seems to have higher values, but is this difference real?
To decide if the difference between groups is “real” (statistically significant), we use a t-test:
# Perform an independent samples t-test
# This test compares the means of two groups and tells us if they're significantly different
test_result <- t.test(value ~ group, data = experiment_data)
# Print results in a user-friendly format
cat("\nResults of our Statistical Test:")##
## Results of our Statistical Test:
##
## --------------------------------
# Calculate and display the actual difference between group means
cat("\nAverage difference between groups:",
round(diff(tapply(experiment_data$value, experiment_data$group, mean)), 1))##
## Average difference between groups: 12.9
# Interpret the p-value in plain language
cat("\nIs this difference significant?",
ifelse(test_result$p.value < 0.05, "YES!", "No"))##
## Is this difference significant? YES!
##
## (p-value = 0.0054 )
What does this mean? - The p-value tells us how likely we would see this difference by chance - If p < 0.05, we say the difference is “statistically significant” - A significant result suggests the treatment has a real effect - But remember: statistical significance ≠ practical importance!
# Create Q-Q plots to check if our data follows a normal distribution
# Q-Q plots compare our data quantiles to theoretical normal quantiles
# Points should roughly follow the diagonal line if data is normal
# Set up the plotting area to show two plots side by side
par(mfrow = c(1, 2))
# Q-Q plot for Control group
qqnorm(experiment_data$value[experiment_data$group == "Control"],
main = "Control Group")
qqline(experiment_data$value[experiment_data$group == "Control"])
# Q-Q plot for Treatment group
qqnorm(experiment_data$value[experiment_data$group == "Treatment"],
main = "Treatment Group")
qqline(experiment_data$value[experiment_data$group == "Treatment"])If the points follow the line closely, our data is approximately normal! - Points falling exactly on the line would indicate perfect normality - Small deviations are usually acceptable - Large deviations might suggest we need different statistical methods
Try these exercises to test your understanding:
Create your own experiment comparing two groups:
# Generate data for two groups (e.g., different medical treatments)
group1 <- rnorm(25, mean = 50, sd = 10) # First treatment group
group2 <- rnorm(25, mean = 55, sd = 10) # Second treatment group
# Create a data frame
my_experiment <- data.frame(
value = c(group1, group2),
group = rep(c("Treatment1", "Treatment2"), each = 25)
)
# Create visualizations
# 1. Box plot with individual points
boxplot(value ~ group, data = my_experiment,
main = "Comparing Treatments",
col = c("pink", "lightgreen"))
stripchart(value ~ group, data = my_experiment,
vertical = TRUE, method = "jitter",
add = TRUE, col = "darkgray")
# 2. Perform t-test
test_result <- t.test(value ~ group, data = my_experiment)
print(test_result)Create and compare different types of distributions:
# Generate data from different distributions
normal_data <- rnorm(1000, mean = 0, sd = 1) # Normal distribution
skewed_data <- rexp(1000, rate = 1) # Exponential (skewed) distribution
uniform_data <- runif(1000, min = -2, max = 2) # Uniform distribution
# Create histograms to compare
par(mfrow = c(1, 3)) # Set up for three plots side by side
# Normal distribution
hist(normal_data, main = "Normal Distribution",
prob = TRUE, col = "lightblue")
lines(density(normal_data), col = "red", lwd = 2)
# Skewed distribution
hist(skewed_data, main = "Skewed Distribution",
prob = TRUE, col = "lightgreen")
lines(density(skewed_data), col = "red", lwd = 2)
# Uniform distribution
hist(uniform_data, main = "Uniform Distribution",
prob = TRUE, col = "lightpink")
lines(density(uniform_data), col = "red", lwd = 2)
par(mfrow = c(1, 1)) # Reset plotting windowTry these analyses with your own data: 1. Load your dataset 2. Create appropriate visualizations 3. Perform statistical tests 4. Interpret the results
For those interested in going further:
Remember: The best way to learn statistics is through practice. Try these exercises with different datasets and explore how changing parameters affects your results!