This tutorial was created by Aurora Tsai for students/researchers working with education data (e.g., second language acquisition).
GGplot2 is a package for R created by Hadley Wickham. It helps you make beautiful, easy-to-read visualizations of your data. In this tutorial, I cover the following:
Pretend you've conducted an experiment where 20 students took tests A and 20 took test B. You recorded how many hours they studied for the test and calculated the score.
Create a dataframe called "d" and have it include the following:
You want to visualize how study time influences score AND you want to visualize the differences between condition A and B. How can we do this with base R graphics? With ggplot? Let's compare.
Here is a plot created with R's base graphics:
plot(d$stime, d$score,
xlab = "Study Time",
ylab = "Score",
main = "Relationship Between Score and Study Time",
ylim = c(80,120),
xlim = c(0, 15),
lwd = 5, #set thickness of point
col = d$condition) #set color based on condition
Here is one created with ggplot:
scatter <- ggplot(d, aes(x=stime, y=score, color = condition)) +
geom_point(size = 3) +
ggtitle("Relationship Between Score and Study Time") +
ylab("Score") +
xlab("Study Time") +
ylim(80,120) +
xlim(0,15)
scatter
GGplot uses the aes or "aesthetic" to indicate what variables to map onto the plot. Then you can add layers using + (component). For example, geom_point() is used for a scatter plot, while geom_boxplot() is used for a boxplot, geom_line is used for a line plot and so on. You can also use multiple components (e.g., geom_line() + geom_point() connects points with lines). You can add layers all at once or incrementally.
For example, we might want to add a few more layers to our scatter plot:
scatter <- scatter +
theme_classic() + #Use a premade theme with a white background
theme(plot.title = element_text(hjust = 0.5)) + #Center the title
theme(legend.position = "none") #turn off automatic legend
scatter
The graphs in this example are pretty similar. So then, why use ggplot?
One advantage is that its easy to make a large variety of visualizations with the same data set, and you can do so by changing what layers you're adding. In addition, ggplot allows you to add many features to your plot that would be cumbersome with base R graphics (e.g., confidence intervals, residual plots). For a list of other advantages, see Mandy Mejia's blog.
If we want to include confidence intervals in our barplots or line graphs, we need to calculate them first. In order to calculate confidence intervals (CI), we need to calculate our CI multiplier.
The qt() function calculates our t-distribution. In this case, we will use a 95% confidence interval and 39 degrees of freedom (N-1).
ciMult <- qt(0.95/2 + .5, 40-1)
We can then use the summarise() function in dplyr to aggregate means and our confidence intervals into a dataframe. (Your numbers will look slightly different because of the randomly generated dataset)
library(dplyr)
dsum <- summarise(group_by(d, condition), m=mean(score),
sd=sd(score), se=sd/sqrt(40), ci= se*ciMult)
head(dsum)
condition m sd se ci
(fctr) (dbl) (dbl) (dbl) (dbl)
1 A 90.79179 5.712633 1.277384 2.673595
2 B 110.07836 6.071723 1.357679 2.841654
Another way to calculate the CI intervals is by using the summarySE() function in the "Rmisc" package.
library(Rmisc)
dsum <- summarySE(data=d, measurevar="score", groupvars="condition")
dsum
# condition N score sd se ci
#1 A 40 90.79179 5.712633 1.277384 2.673595
#2 B 40 110.07836 6.071723 1.357679 2.841654
Once you have the confidence interval and mean scores, you can create a basic bar plot using ggplot2.
library(ggplot2)
ggplot(dsum, aes(x=condition, y=score)) +
geom_bar(stat="identity") + #we use 'stat="identity"' so that ggplot knows to use the exact values from the dsum dataframe
geom_errorbar(aes(ymin=score-ci, ymax=score+ci), width = .1)
Why is this bar plot hard to read? How can we change it to be "prettier"?
With ggplot, you can keep adding on bits of code to adjust how the bar plot looks. Here are some basic add-ons that are useful for SLA visualizations:
pretty.bar <- ggplot(dsum, aes(x=condition, y=score)) +
geom_bar(aes(fill=condition), #fill the color of the bars based on the condition
stat="identity", width = .5) + #change the width of the bars
geom_errorbar(aes(ymin=score-ci,
ymax=score+ci), width = .1) + #change the width of the error bars
ggtitle("Mean Score for Conditions A & B") + #add a title
ylim(0, 125) + #set the y-axis range
ylab("Mean Score") + #Y-axis label
xlab("Condition") + #X-axis label
theme_bw() + #change the theme to black and white
scale_fill_brewer(palette="Set1") + #add a color palette
theme(plot.title = element_text(hjust = 0.5))#center the title
pretty.bar
Try it
Now that you've seen how to make barplots using the score data, try visualizing the mean study times for condition A and B. Make sure you get the CIs first.
Once you've learned how to do bar plots, box plots are pretty straight forward. We simply use geom_boxplot() instead of geom_bar(). We also use the orginal data from our d dataframe instead of the aggregated means from the dsum dataframe.
ggplot(d, aes(x=condition, y=score)) + geom_boxplot()
Prettify the boxplot by adding a title, x and y-axis titles, colors, and a white background. Hint: the process is very similar to modifying bar plots.
Sometimes we want to use line graphs to visualize our data (e.g., to show changes between groups over time). Lets use the following data set (copy this code into your script):
longstudy <- data.frame(part = 1:50,
group = c(rep('control', times = 25), rep('treatment', times=25)),
L1 = sample(c("Eng", "Chi"), 25, replace = T, prob = c(0.5, 0.5)),
T1 = as.integer(rnorm(50, 50, 10)),
T2 = c(as.integer(rnorm(25, 55, 20)), as.integer(rnorm(25, 70, 10))),
T3 = c(as.integer(rnorm(25, 65, 10)), as.integer(rnorm(25, 85, 20))),
T4 = c(as.integer(rnorm(25, 60, 10)), as.integer(rnorm(25, 75, 10))))
head(longstudy)
part group L1 T1 T2 T3 T4
1 1 control Eng 35 49 67 51
2 2 control Chi 42 54 70 71
3 3 control Chi 50 18 71 46
4 4 control Chi 47 7 69 63
5 5 control Chi 47 16 69 48
6 6 control Chi 58 102 43 40
ggplot provides the geom_line() function to make a line graph:
ggplot(longstudy, aes(x=time, y=score)) + geom_line()
What do we have to do before making a line graph?
Now we can try making a simple line graph.
ggplot(l, aes(x=time, y=score, group = group)) + geom_line()
On your own, add CIs, a title, axis labels, adjust the y-axis range, and modify the colors of the line graph. You may adjust other elements based on your preference.
If you want to change up your color palettes, you have several options.
RColorBrewer::display.brewer.all() #displays built-in color palettes in R
#same as
library(RColorBrewer)
display.brewer.all()
To use one of the built-in color palettes, just replace "Set1" in the scale_fill_brewer(palette="Set1") for the line graphs with another palette (e.g., "Dark2" or "Paired").
If you'd like to create your own color palette, you can do this too by creating a vector of colors. The one below uses html color codes that are colorblind accessible:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
You can also install the package, "viridis," which has color scales that are pretty and easier to read by those with colorblindness:
install.packages("viridis")
library(viridis)
vPalette <-viridis(5) #Creates a vector with 5 colorblind compatible colors.
vPalette
[1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"
To use your custom color palette in a ggplot, add scale_fill_manual(values=palette) to your plot:
color.box <- ggplot(d, aes(x=condition, y=score)) +
geom_boxplot(aes(fill=condition)) +
ggtitle("Mean Score for Conditions A & B") +
ylab("Mean Score") +
xlab("Condition") +
theme_bw() +
scale_fill_manual(values=vPalette) + #Add a color palette manually
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none") +
geom_point(position = "jitter")
color.box
One last neat visualization with color is using a gradient. This works well with scatter plots.
scatter <- ggplot(d, aes(x=stime, y=score, color=score)) +
geom_point(size = 3) + #change the size of the points
ggtitle("Relationship Between Score and Study Time") +
ylab("Score") +
xlab("Study Time") +
theme_classic() +
scale_color_gradient(low = "blue", high = "red") + #Add a color gradient
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none")
scatter
This has only been a brief introduction to the visualizations you can make with ggplot2. There are many other tutorials available on the web that go into depth about ways you can customize your plots, visualize multiple aspects of your data at once, and much more. Once you've learned the basics, it will be easy to explore and learn about these on your own:)