3. Getting started with visualization
In this worksheet we will introduce the foundations of creating visualizations using the ggplot2 package. ggplot2 is by far the most widely used package for visualization in R. The package is built on Leland Wilkinson’s work developing a grammar of graphics, which abstracts elements of visualizations into a set of layers.1 For more resources on ggplot2 and visualization, see Resources: Visualization with ggplot2.
This worksheet introduces ggplot2 and the grammar of graphics by recreating the visualization of the Palmer penguins data set found on the front page of this section. Let’s start by refamiliarizing ourselves with the penguins data set that we explored in Working with data frames in R.
And let’s take a look at the plot, which follows one of the visualization from the palmerpenguins package documentation.2
Now we can work on reproducing this plot one step at a time. By doing so we will introduce the foundations of ggplot2. While ggplot2 is a very complex package, the basic structures are highly repeatable.
1 Data
Every ggplot2 visualization begins with the ggplot() function and data. Let’s see what we get when we start with just these two things.
2 Mapping aesthetics
The answer…not much. We get a gray box. We told ggplot() to plot our data but nothing about how to go about doing it. A good place to start is by defining our x and y axes. What we need to do is to map variables in our data to aesthetic properties in the plot. We do this with the mapping argument that is always defined through the aes() function that can take a number of aesthetic mappings, the most significant of which are x and y. Fill in the x and y from the above plot.
3 Geoms
Now we are getting somewhere. List the elements of the plot that created by mapping the x and y aesthetics?
This is a good start, but now we need a geometric object to represent our data. In the example above we have a scatterplot, which is created with geom_point(). This function adds a layer of points, and we can add this layer by literally adding it using the plus sign. ggplot2 has a wide range of geoms that all begin with geom_.
4 Mapping more aesthetics
We now have quite a respectable plot. We have points for the flipper length and body mass of all of the penguins in the data aside from two penguins who had data missing from either flipper length or body mass as the warning message tells us. However, the plot we are looking to replicate shows information about the species of penguin, which is not available in our current plot. We can do this by mapping the species variable in our data to another aesthetic. In this case, color, but since people can perceive color in different ways, it is always a good idea to add another way to distinguish the differences within a variable such as by shape. Since we are mapping the data to aesthetics, we need to use the aes() function again.
By mapping an aesthetic beyond x and y, ggplot2 automatically creates a legend for us: red circles are Adelie, green triangles are Chinstrap, and blue squares are Gentoo. The legend is even given a title with the name of the column used to create it from penguins.
5 Setting aesthetics
But the points do not quite look the same as our example plot. The points are smaller and some of the points overlap each other. Both of these issues make it difficult to perceive the dispersal of points. We can adjust these aesthetic features not by mapping them to the penguins data but by setting them. You can set an aesthetic feature by equating it to a constant value outside the aes() function. We can set size to 3, and the transparency, or alpha, to 0.8. We can also tell geom_point() to remove missing data with na.rm = TRUE.
Play around with these values. What happens if you change size and alpha? What if you set color to a specific color such as "blue" within the aes() function?3 Try setting color to blue outside the aes() function. Look at the documentation for geom_point() to see other possible aesthetics.
Setting color = "blue" within the aes() function will quickly show you the significance of understanding the difference between mapping and setting. You map aesthetics to columns in the data within aes(); you set aesthetics to constants outside the aes() function.
A final distinction that is worth a quick mention here. We could map color and shape to species within the ggplot() function alongside x and y. However, we need to set aesthetics within the specific geom they apply to.
6 Scales
When we mapped color and shape to species ggplot used its default scales to choose which colors and shapes to use. Scales provide the aesthetic choice for what happens when data is mapped to an aesthetic quality, the set of colors, shapes, sizes, etc. Each aesthetic has its own scale. By default, ggplot applies scale_color_discrete() for the colors, but you can adjust this using functions that begin with scale_color_. Try with scale_color_brewer(palette = "Pastel1"). See the Color Brewer website for other color palettes.
There are many different color scales available within ggplot2 and there are dozens of packages that provide more color scales. Check out the paletteer package for a list of color palettes and a consistent way to use them. However, our example plot created its own scale by manually choosing colors using scale_color_manual() and setting values to a character vector of color names: “darkorange”, “purple”, “cyan4”. Let’s do that now.
7 Guides or labels
Now we have the graphical representation of our data set, looking just like our example plot. However, we are not quite done. We still need to make some adjustments to make the plot ready for presentation, namely changing the labels for the x and y axes, the title of the legend, and adding a title for the plot. These aspects do not change the data in the plot, but they help communicate the meaning of the plot. These aspects of a plot are known as guides. The most convenient way to change labels is with the labs() function. Notice again the use of + to iteratively add another layer to the plot.
What happens if you set color and shape to different values.
8 Themes
The final touch will be to adjust the theme of the plot. Themes affect the overall visual defaults of a plot, such as background color, the grids, font and text size, and position of the plot elements among many other aspects. The theme() has dozens of arguments that can be tweaked, but ggplot2 also comes with a handful of complete themes that all begin with theme_. Type _ after theme to see what options there are and try out different themes.
We now have a full, nice looking plot. ggplot2 can do much more. It can make all sorts of plots, including maps and network visualizations. But no matter how complex of a plot you want to make, it is all built on the foundations of the grammar of graphics we have demonstrated here. Plots consist of data mapped to aesthetics, which are represented by a geom and a set of scales. The look and feel of the plot can then be altered through the use of guides (labels) and themes.
Footnotes
Leland Wilkinson, The Grammar of Graphics, Second Edition (Springer-Verlag, 2005), https://doi.org/10.1007/0-387-28695-0; Hadley Wickham, “A Layered Grammar of Graphics,” Journal of Computational and Graphical Statistics 19, no. 1 (2010): 3–28, https://doi.org/10.1198/jcgs.2009.07098.↩︎
Allison Horst, Alison Hill, Kristen Gorman, palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. (2020), https://allisonhorst.github.io/palmerpenguins/. See Wickham et al, R for Data Science, Chapter 1: Data Visualization for a walkthrough of creating a scatterplot with the palmerpenguins data.↩︎
Note that R has a whole set of named colors. You can see all 657 named colors in R with
colors().↩︎