Introduction to ggplot2 (part 2)

Getting bored with the plots you can make using the base R plot? Probably time to spice things up with ggplot!

You can read through this article, or you can watch the tutorial video below (or both!).

Let’s get started. First load the ggplot2 library, since thats what we’re here to learn!


We’re going to be looking at the msleep dataset (same one we were looking at in the last tutorial )

sleep <- ggplot2::msleep

We’re interested in exploring the linear relationship between log(brain weights) and total sleep. In light of this, let’s make a new data.frame for those pieces of data.

data <- data.frame(log_bw=log(sleep$brainwt), st=sleep$sleep_total)
##      log_bw             st       
##  Min.   :-8.874   Min.   : 1.90  
##  1st Qu.:-5.845   1st Qu.: 7.85  
##  Median :-4.390   Median :10.10  
##  Mean   :-4.063   Mean   :10.43  
##  3rd Qu.:-2.085   3rd Qu.:13.75  
##  Max.   : 1.743   Max.   :19.90  
##  NA's   :27

As you can see from the summary, there are missing (NA) values in log_bw. Let’s get rid of those using the which() function.

data_clean <- data[which(!$log_bw)),]
##      log_bw             st        
##  Min.   :-8.874   Min.   : 2.900  
##  1st Qu.:-5.845   1st Qu.: 7.575  
##  Median :-4.390   Median : 9.950  
##  Mean   :-4.063   Mean   :10.171  
##  3rd Qu.:-2.085   3rd Qu.:12.575  
##  Max.   : 1.743   Max.   :19.900

Now the summary shows that we have no missing values, let’s move on. Let’s print a quick scatter plot of the data (just like we did in the previous tutorial). The only new thing here is the call to theme_bw(). This is an example of using themes to change the general feel of the plot. Play around with other themes or even create your own!

g1 <- ggplot(data=data_clean) +
  geom_point(aes(x = st, y = log_bw)) +

You can see that there is some linear trend in the plot, and a linear regression may be appropriate to fit a “best fit” line. We will use the function lm(), which performs an ordinary least squares regression. In this case we have log_bw as our dependent (response) variable, and st as our independent variable. Note that the “+ 0” indicates that we do not want an intercept term in our regression (so the intercept is forced to be the point (0,0)). This choice was made arbitrarily, and in a real analysis you would need to worry about justifying that choice.

mod1 <- lm(formula = log_bw ~ st + 0, data = data_clean)
s <- summary(mod1)

Let’s take the output from the linear regression, and put it into a data.frame so that we can use it easily with ggplot.

data_mod1 <- data.frame(x = mod1$model$st, y_fit = mod1$fitted.values)

Now we can easy add the regression line to our scatter plot!

g2 <- g1 + geom_line(data   = data_mod1, aes(x = x, y = y_fit),
                     colour = "blue",
                     size   = 1)

Adding standard error lines is very similar, and we’ll make use of the standard error already calculated and stored in the model summary. We’ll set the geom_point to have an alpha of 0.5, this will allow the points to fade into the backdrop, and bring out the regression line, which we want to be the visual focus of the plot. We will also use linetype = 2 in order to allow the standard error lines to have a dashed style.

sigma <- s$sigma

g3 <- ggplot() +
  geom_point(data    = data_clean, aes(x = st, y = log_bw),
             alpha   = 0.5) +
  geom_line(data     = data_mod1, aes(x = x, y = y_fit),
            size     = 1,
            colour   = "blue") +
  geom_line(data     = data_mod1, aes(x = x, y = y_fit-sigma),
            colour   = "blue",
            linetype = 2) +
  geom_line(data     = data_mod1, aes(x = x, y = y_fit+sigma),
            colour   = "blue",
            linetype = 2) +

Next up we’ll look at the outliers! There is some flexibility in defining outliers, so for our purposes we will define an outlier to be any point greater than 2 standard errors from the fitted line.

data_outs <- mod1$model[which(abs(mod1$model$log_bw-mod1$fitted.values) > 2*sigma),]
##       log_bw   st
## 17 -8.873868  9.1
## 62 -2.513306 18.1

You can see that we have two outliers, let’s highlight them on our plot in bright red.

g4 <- g3 + geom_point(data   = data_outs, aes(x = st, y = log_bw),
                      colour = "red",
                      size   = 2.25)

Let’s shade in the region between the standard error lines, purely for the joy of shading areas of a plot. We’ll do this using geom_ribbon, and while we’re at it we’ll add prettier labels and a plot title.

g4 + geom_ribbon(data  = data_mod1, aes(x   = x,
                                       ymin = y_fit-sigma,
                                       ymax = y_fit+sigma),
                 alpha = 0.25,
                 fill  = "lightblue") +
  ggtitle("Linear Regression: log(Brain Weight) ~ Total Sleep") +
  xlab("Total Sleep (hours)") +
  ylab("log(Brain Weight) (kg)")

That concludes this tutorial lesson on ggplot2. Now go make some beautiful graphs! Feel free to tweet them @mrparker9090.