Basic Plotting and Statistics in R
Welcome to Data Visualization and Statistics in R!
In this 2-hour session, we’ll learn how to:
- Create beautiful and informative plots using ggplot2
- Understand basic statistical concepts
- Perform data analysis using R
- Make data-driven decisions
Required Packages
We’ll be using these R packages:
library(tidyverse) # Includes ggplot2, dplyr, and more
library(datasets) # For built-in datasets
Part 1: Getting Started with R for Statistics
Let’s begin by exploring some built-in datasets in R. We’ll use the famous mtcars dataset:
# Load and examine mtcars dataset
data("mtcars")
head(mtcars)
# Basic summary statistics using dplyr
mtcars %>%
summarise(
avg_mpg = mean(mpg),
med_mpg = median(mpg),
sd_mpg = sd(mpg)
)
Part 2: Creating Beautiful Visualizations
Basic Histogram
# Create a histogram using ggplot2
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(fill = "skyblue", color = "white", bins = 30) +
labs(
title = "Distribution of Car Mileage",
x = "Miles Per Gallon",
y = "Count"
) +
theme_minimal()
Adding Density Curves
# Add density curve to histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density..), fill = "skyblue", color = "white") +
geom_density(color = "red", linewidth = 1) +
labs(
title = "Distribution of Car Mileage with Density Curve",
x = "Miles Per Gallon",
y = "Density"
) +
theme_minimal()
Part 3: Understanding Data Distributions
What is a Distribution?
A distribution shows us the “shape” of our data. Think of it as a way to see:
- How often different values occur
- What values are typical or unusual
- How spread out the data is
Common examples in research:
- Heights in a population
- Test scores in a class
- Gene expression levels
- Treatment responses
The Normal Distribution
The “normal” or “bell-shaped” distribution is the most common in nature. Key features:
- It’s symmetric around the mean
- The mean, median, and mode are all equal
- About 68% of data falls within 1 standard deviation of the mean
- About 95% of data falls within 2 standard deviations
- About 99.7% of data falls within 3 standard deviations
Let’s visualize this with our data:
# Generate example height data
set.seed(123) # For reproducible results
heights <- rnorm(1000, mean = 170, sd = 10) # 1000 heights, mean=170cm, SD=10cm
# Create histogram with density curve
ggplot(data.frame(height = heights), aes(x = height)) +
geom_histogram(aes(y = ..density..),
fill = "skyblue",
color = "white",
bins = 30) +
geom_density(color = "red", linewidth = 1) +
labs(
title = "Distribution of Heights",
subtitle = "With Normal Density Curve",
x = "Height (cm)",
y = "Density"
) +
theme_minimal()
Understanding Different Types of Distributions
- Symmetric Distributions
- Normal (bell-shaped)
- Uniform (flat)
- Student’s t (like normal but heavier tails)
- Skewed Distributions
- Right-skewed (tail extends right)
- Left-skewed (tail extends left)
- Common in real-world data like income, reaction times
- Other Patterns
- Bimodal (two peaks)
- Multimodal (multiple peaks)
- Bounded (limited range)
Part 4: Statistical Testing
The t-test: A Detective’s Tool
Think of a t-test like being a detective. You start with:
- A question: “Are these groups different?”
- A null hypothesis (H₀): “There is no real difference”
- An alternative hypothesis (H₁): “There is a real difference”
Types of t-tests
- One-sample t-test
- Compare one group to a known value
- Example: Are students scoring above average?
- Independent two-sample t-test
- Compare two separate groups
- Example: Treatment vs Control
- What we’re using in our examples
- Paired t-test
- Compare matched pairs of observations
- Example: Before vs After treatment
Performing and Interpreting a t-test
Let’s walk through an example:
# Create two groups to compare
group1 <- rnorm(30, mean = 100, sd = 15) # Control group
group2 <- rnorm(30, mean = 115, sd = 15) # Treatment group
# Perform t-test
t_result <- t.test(group1, group2)
# Print results with explanation
cat("T-test Results:\n")
cat("----------------\n")
cat("Mean difference:", round(mean(group2) - mean(group1), 2), "\n")
cat("t-statistic:", round(t_result$statistic, 2), "\n")
cat("p-value:", format.pval(t_result$p.value, digits = 3), "\n")
cat("95% CI:", paste(round(t_result$conf.int, 2), collapse = " to "), "\n")
# Visualize the comparison
data.frame(
value = c(group1, group2),
group = rep(c("Control", "Treatment"), each = 30)
) %>%
ggplot(aes(x = value, fill = group)) +
geom_density(alpha = 0.5) +
labs(
title = "Comparing Two Groups",
subtitle = paste("p-value =", format.pval(t_result$p.value, digits = 3)),
x = "Value",
y = "Density"
) +
theme_minimal()
Understanding the Results
- p-value Interpretation
- p < 0.05: “Statistically significant”
- p < 0.01: “Highly significant”
- p < 0.001: “Very highly significant”
- Effect Size
- How big is the difference?
- Is it practically meaningful?
- Look at means and confidence intervals
- Assumptions to Check
- Normal distribution (or large enough sample)
- Equal variances (for independent t-test)
- Independent observations
- Common Mistakes to Avoid
- Don’t rely only on p-values
- Consider practical significance
- Check your assumptions
- Be careful with multiple tests
Visualizing Test Results
Good practice combines:
- Statistical test results
- Visual representation
- Clear labeling
- Effect size information
Example combining all elements:
# Create comprehensive visualization
ggplot(data.frame(
value = c(group1, group2),
group = rep(c("Control", "Treatment"), each = 30)
), aes(x = group, y = value, fill = group)) +
geom_boxplot(alpha = 0.5) +
geom_jitter(width = 0.2, alpha = 0.5) +
labs(
title = "Comparing Groups",
subtitle = paste(
"p =", format.pval(t_result$p.value, digits = 3),
"| Mean Diff =", round(mean(group2) - mean(group1), 2)
),
x = "Group",
y = "Value"
) +
theme_minimal()
Part 5: Exploring Relationships
Scatter Plots and Regression Lines
# Create scatter plot with regression lines
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Relationship: Sepal Length vs Petal Length",
x = "Sepal Length",
y = "Petal Length"
) +
theme_minimal()
Practice Exercises
- Basic Data Exploration
- Calculate summary statistics for the
mtcarsdataset - Create a visualization showing the distribution of car weights
- Calculate summary statistics for the
- Group Comparisons
- Compare the mileage of cars with different numbers of cylinders
- Create appropriate visualizations
- Perform a statistical test
- Relationship Analysis
- Investigate the relationship between car weight and mileage
- Create a scatter plot with a regression line
- Calculate the correlation coefficient
Tips for Success
- Always Start with Data Exploration
- Look at your data first
- Calculate basic statistics
- Create simple visualizations
- Choose the Right Visualization
- Histograms for distributions
- Box plots for group comparisons
- Scatter plots for relationships
- Make Your Plots Clear
- Add proper labels
- Use appropriate colors
- Include titles and subtitles
- Document Your Analysis
- Keep track of your steps
- Comment your code
- Save your plots
Additional Resources
Cheat Sheet
Common ggplot2 Geoms
geom_histogram(): For distributionsgeom_boxplot(): For group comparisonsgeom_point(): For scatter plotsgeom_line(): For trendsgeom_smooth(): For regression lines
Basic dplyr Functions
summarise(): Calculate summary statisticsgroup_by(): Group data by variablesfilter(): Subset dataselect(): Choose columnsarrange(): Sort data
Statistical Functions
mean(): Averagemedian(): Middle valuesd(): Standard deviationt.test(): Compare two groupscor(): Correlation coefficient