14  Geoms

Published

July 26, 2023

Modified

January 8, 2024

14.1 Resources

library(ggplot2)
library(dplyr)
library(palmerpenguins)

# Data
penguins <- penguins |> 
  filter(!is.na(flipper_length_mm),
         !is.na(sex)) |> 
  # Add factor data that has more levels
  mutate(species_sex = as.factor(
    paste(species, sex)))

14.2 Introduction

The geometric layers that make up the data aspects of a plot are typically added through a geom that may be altered by position and stat, or statistical transformation if necessary. Each geom necessitates certain aesthetics, particularly positional aesthetics. Other aesthetics can be added to map variables to the plot or to alter the visual representation of the geoms.

  • geom: Geometric object drawn on a plot to represent data.
  • stat: The statistical transformations applied to the data.
  • position: Adjustments to position of geoms to resolve overlapping.

There is usually a one-to-one relationship between geoms and stats: geoms are created through specific statistical transformations. The relationship between geom_bar() and stat_count() as discussed in Section 14.3.1 is a particularly simple and clear example. Geoms, stats, and positions all have their functional forms (geom_*(), stat_*(), and position_*()) as well as their form as a string without the functional prefix (geom = "point", stat = "identity", position = "stack"). The relationship between these three aspects of a geometric layer is discussed in some length in Section 14.3. Otherwise the relationship between geoms and stats are noted throughout. An overview of the possible position adjustments is in Section 14.9

The documentation for each geom provides the aesthetics that can be used with it. See Chapter 16 for an overview of aesthetics.

14.3 Bar plots

14.3.1 Geoms and stats

geom_bar() is a good place to start with geoms because it is easy to show the relationship between geoms, stats, and position.

geom_bar()

By default geom_bar() performs a statistical transformation by counting the number of observations in the variable provided in the single positional aesthetic: x for upright bar plot, y for horizontal bar plot. Therefore, you can create a bar plot with either geom_bar() or stat_count(geom = "bar"):

penguins |> 
  ggplot(aes(x = species)) + 
  geom_bar()
penguins |> 
  ggplot(aes(x = species)) + 
  stat_count(geom = "bar")

stat_identity()

Instead of using the default count stat, you can use stat_identity() which gets the height of the bars directly from the data. stat_identity() requires both x and y positional aesthetics because it needs to know which variable to use for the height of the bar.

penguins |> 
  count(species) |> 
  ggplot(aes(x = species,
             y = n)) + 
  geom_bar(stat = "identity")
penguins |> 
  count(species) |> 
  ggplot(aes(x = species,
             y = n)) + 
  stat_identity(geom = "bar")

geom_col()

geom_col() provides a shortcut for creating a bar plot using stat_identity(). The height of the bar is scaled to the y aesthetic, which does not need to be a count.

penguins |> 
  count(species) |> 
  ggplot(aes(x = species, y = n)) + 
  geom_col()

14.3.2 Aesthetics

Add color to the geom with either color or fill aesthetics. As with other polygon geoms, fill provides the color for the area of the geom and color affects the outline of the polygon.

# Fill
penguins |> 
  ggplot(aes(x = species)) + 
  geom_bar(aes(fill = species))
# Color
penguins |> 
  ggplot(aes(x = species)) + 
  geom_bar(aes(color = species))

14.3.3 Position

Each geom has a position argument that takes either "identity", "stack", "dodge", "dodge2", or "fill", or, alternatively, the corresponding position_*() functions that allow more freedom in tweaking aspects of the position. See Section 14.9 for an overview.

The default position for geom_bar() is position_stack(). This can be seen by mapping color or fill to a non-positional variable. With position_stack() each group is stacked on top of each other.

penguins |> 
  ggplot(aes(x = species, fill = sex)) + 
  geom_bar(color = "black")

position_identity() is the default for most geoms, but it does not work well with bar plots. Compare the default position_stack() to position_identity() in which each group starts from 0. Note the difference in the limit of the y-axis.

# Default stack position
penguins |> 
  ggplot(aes(x = island, 
             color = species)) + 
  geom_bar(fill = NA)
# Position identity
penguins |> 
  ggplot(aes(x = island, 
             color = species)) + 
  geom_bar(fill = NA,
           position = "identity")

A more useful positional adjustment with bar plots is to "dodge" the groups, placing them alongside each other instead of stacking them on top of each other. position_dodge2() provides space between the two groups. Note that using the position_*() function makes it possible to better control the behavior such as maintaining the same width for all bars.

penguins |> 
  ggplot(aes(x = island,
             fill = species)) + 
  geom_bar(position = "dodge")
penguins |> 
  ggplot(aes(x = island,
             fill = species)) + 
  geom_bar(position = position_dodge2(
    preserve = "single"))

Finally, position_fill() standardizes the height of the bars to create a ratio or relative frequency plot.

penguins |> 
  ggplot(aes(x = island, fill = species)) + 
  geom_bar(position = "fill")

14.4 Continuous distributions

Like geom_bar(), geoms that visualize continuous distributions count the frequency of data and therefore only needs one positional aesthetic. However, because the data is continuous it needs to be placed into bins before being counted.

Geoms

14.4.1 Histograms and frequency polygons

geom_histogram() and geom_freqpoly() use stat_bin() to count frequency. The nature of the visualization is determined by the width of the bins with these geoms. This can be changed with either bins (number of bins) or binwidth (width of bins in scale).

# binwidth
penguins |> 
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)
# bins
penguins |> 
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(bins = 50)

geom_freqpoly() displays the same counts with lines instead of bars.

# binwidth
penguins |> 
  ggplot(aes(x = body_mass_g)) +
  geom_freqpoly(binwidth = 200)
# bins
penguins |> 
  ggplot(aes(x = body_mass_g)) +
  geom_freqpoly(bins = 50)

Add color or fill to group the frequency plots.

# binwidth
penguins |> 
  ggplot(aes(x = body_mass_g,
             fill = species)) +
  geom_histogram(binwidth = 200)
# bins
penguins |> 
  ggplot(aes(x = body_mass_g,
             color = species)) +
  geom_freqpoly(binwidth = 200)

14.4.2 Density estimates

geom_density() provides another way to group continuous data, but it does so by smoothing out the frequency plots. The area of each density is standardized to one so that you lose information about the relative size of each group. Note that the y-axis is density not count.

penguins |> 
ggplot(aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(alpha = 0.5) + 
  scale_y_continuous(labels = scales::label_number())

The ggridges package makes it easy to do what are essentially faceted density plots

library(ggridges)
penguins |> 
  ggplot(aes(x = body_mass_g, y = species, color = species, fill = species)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)
#> Picking joint bandwidth of 153

14.4.3 Dot plots

Another way to represent a binned continuous data is with a dot plot in which dots representing one observation are stacked. geom_dotplot() has two different methods for binning: "dotdensity" and "histodot". "histodot" uses fixed-width bins, whereas "dotdensity", the default, calculates binwidth based on the data.

penguins |> 
  ggplot(aes(x = body_mass_g)) +
  geom_dotplot(binwidth = 100)

The labels used on the y-axis are not meaningful with geom_dotplot(). You can either hide the y-axis or manually scale it.

geom_dotplot() works differently in stacking groups. It has a stackgroups argument and this goes together with the method to use for binning and whether binpositions should be calculated "bygroup" or for "all" the data.

Without stacking you can see that each group has its own binning position by default.

penguins |> 
  ggplot(aes(x = body_mass_g, fill = species)) +
  geom_dotplot(binwidth = 100,
               alpha = 0.7)

There are two slightly different ways to stack the dots with either method = "histodot" or binpositions = "all":

# method = "histodot"
penguins |> 
  ggplot(aes(x = body_mass_g,
             fill = species)) +
  geom_dotplot(binwidth = 100,
               stackgroups = TRUE,
               method = "histodot")
# binpositions = "all"
penguins |> 
  ggplot(aes(x = body_mass_g,
             fill = species)) +
  geom_dotplot(binwidth = 100,
               stackgroups = TRUE,
               binpositions = "all")

Mapping an x and y aesthetic makes it possible to create a beeswarm plot with occurrences stacking from the center and bins on the y-axis.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g, fill = species)) +
  geom_dotplot(binwidth = 100,
               binaxis = "y",
               stackdir = "center",
               binpositions = "all")

14.5 Statistical distributions

There are a variety of geoms intended to show the statistical distribution of a variable in the data.

Geoms

14.5.1 geom_boxplot()

A box plot displays the distribution of a continuous variable, showing the median, 25th, and 75th percentile, whiskers to farthest non-outlier point, and outliers.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_boxplot()

The orientation of the plot follows the discrete axis.

penguins |> 
  ggplot(aes(x = body_mass_g, y = species)) +
  geom_boxplot()

Box plots are automatically dodged when any aesthetic is a factor. By default, geom_boxplot() uses position_dodge2() to add space between the boxes.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g, fill = sex)) +
  geom_boxplot()

geom_jitter() is particularly useful with geom_boxplot() to show the actual points. geom_jitter() is a shortcut for geom_point(position = "jitter"). Control the width and height of the jitter with the corresponding arguments. geom_jitter() also has a position argument in case you want to make further changes. For instance, with a dodged box plot you need to use position = position_jitterdodge(). Finally, when adding the points with geom_jitter(), the outlier points from geom_boxplot() should be removed with outlier.shape = NA.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g, color = sex)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(alpha = 0.8,
              position = position_jitterdodge())

It is possible to use a boxplot with a continuous variable by binning the data with one of the helper functions: cut_width(), cut_interval(), or cut_number().

penguins |> 
  ggplot(aes(x = bill_depth_mm, body_mass_g)) + 
  geom_boxplot(aes(group = cut_width(bill_depth_mm, 1)))

14.5.2 geom_violin()

geom_violin() is similar to geom_boxplot() but it adds a density measurement. geom_violin() behaves very similarly to geom_boxplot() but it does not show statistical quantiles.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_violin()

To show the quantiles use the draw_quantiles argument with a vector of quantiles to draw.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))

Like box plots, geom_violin() automatically dodges when any aesthetic is a factor.

penguins |> 
  ggplot(aes(x = species, y = body_mass_g, fill = sex)) +
  geom_violin()

And geom_violin() works well with geom_jitter().

penguins |> 
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_violin() + 
  geom_jitter(aes(color = sex),
    width = 0.25, height = 0,
    alpha = 0.6)

14.5.3 geom_linerange() and geom_pointrange()

Another way to show statistical aspects of data is with either a line or point range that draws a vertical line from ymin to ymax. There needs to be variables in the data with ymin and ymax data for each group. This is done by grouping the data and then mutating:

penguins_range <- penguins |> 
  group_by(species) |> 
  mutate(lower = min(body_mass_g),
         upper = max(body_mass_g))

geom_pointrange() is the same as geom_linerange() but it adds the points along the line.

penguins_range |> 
  ggplot(aes(x = species,
             y = body_mass_g)) + 
  geom_linerange(aes(ymin = lower,
                     ymax = upper))
penguins_range |> 
  ggplot(aes(x = species,
             y = body_mass_g)) + 
  geom_pointrange(aes(ymin = lower,
                      ymax = upper))

In this case, it is probably more useful to put points at specific statistical quantiles by calculating these and using geom_point().

penguins |> 
  group_by(species) |> 
  mutate(lower = min(body_mass_g),
         upper = max(body_mass_g),
         med = median(body_mass_g)) |> 
  ggplot(aes(x = species,
             y = body_mass_g)) + 
  geom_linerange(aes(ymin = lower,
                     ymax = upper)) + 
  geom_point(aes(x = species, y = med))

Or, you can recreate this plot more easily with stat_summary():

penguins |> 
  ggplot(aes(x = species,
             y = body_mass_g)) + 
  stat_summary(fun = median, fun.min = min, fun.max = max)

One useful point to note is that you can, of course, place a numeric value in either ymin or ymax such as beginning the line at 0.

penguins_range |> 
  ggplot(aes(x = species,
             y = body_mass_g)) + 
  geom_linerange(aes(ymin = 0,
                     ymax = upper)) + 
  geom_point(aes(y = upper), size = 4)

14.6 Scatter plots

Scatter plots are best used to display the relationship between two continuous variables.

Geoms

14.6.1 geom_point()

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g,
             color = species_sex)) + 
  geom_point()

A complementary geom to go along with geom_point() is geom_rug(), which creates a compact visualizations of observations along the x- and y-axis of a plot. The documentation notes that it is best used with smaller data sets.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g,
             color = species_sex)) + 
  geom_point() + 
  geom_rug()

The above plot shows one of the potential issues with geom_rug() in that many observations occur at the same x or y value. You can see that this is particularly true for flipper length where the scale of the variable is much smaller. You can choose where to place geom_rug() and thus which axes to measure with the sides argument that takes a string containing any of "trbl", for top, right, bottom, and left.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g,
             color = species_sex)) + 
  geom_point() + 
  geom_rug(sides = "l")

14.6.2 Overplotting

The easiest way to deal with overplotting, having multiple points drawn in the same or nearly the same place, is to use transparency. The more overplotting there is the lower the transparency can be.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_point(alpha = 0.5)

The use of geom_jitter() has already been shown in Section 14.5.1 for spreading out points along a discrete variable, but it can also be helpful if there are multiple points at specific coordinates.

penguins |> 
  ggplot(aes(x = species_sex,
             y = body_mass_g)) + 
  geom_point()
penguins |> 
  ggplot(aes(x = species_sex,
             y = body_mass_g)) + 
  geom_jitter(width = 0.2, height = 0)

ggplot2 also includes a number of geoms to deal with overplotting. The most specific is geom_count(), which counts the number of observations at each location, then maps the count to point area. geom_count() is a shortcut for geom_point(stat = "sum"). This can be used with two continuous variables, but the documentation specifically notes that it is most useful with discrete data.

# Continuous x continuous
penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_count(alpha = 0.8)
# Discrete x continuous
penguins |> 
  ggplot(aes(x = species_sex,
             y = body_mass_g)) + 
  geom_count(alpha = 0.8)

Adding an aesthetic that maps to a factor creates groups for the “sum” statistical transformation. Compare the scales with and without color.

penguins |> 
  ggplot(aes(x = species,
             y = body_mass_g)
         ) + 
  geom_count(alpha = 0.8)
penguins |> 
  ggplot(aes(x = species,
             y = body_mass_g,
             color = sex)) + 
  geom_count(alpha = 0.8)

14.6.3 Heat maps

Another way to deal with overplotting is to create heat maps with either geom_bin2d() or geom_hex(). geom_bin2d() creates rectangular bins; geom_hex() creates hexagonal bins. Like other binning geoms such as geom_histogram(), see Section 14.4.1, the heat map geoms have arguments for number of bins and binwidth. One difference is that binwidth takes a numeric vector of length 2 for vertical and horizontal size of the bins.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_bin2d()

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_hex()

Both geoms map a continuous color scale to counts within the bin. Thus, using color or fill to map another aesthetic is not really useful here except for showing presence and absence.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g,
             fill = species)) + 
  geom_hex(alpha = 0.7)

14.6.4 Contours

Another way to deal with overplotting and to visualize density along two continuous variables is with contour plots that are a 2D version of geom_density(), see Section 14.4.2.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_density2d_filled() + 
  geom_density2d() + 
  geom_point(alpha = 0.4)

If you map an aesthetic to a categorical variable, you will get a set of contours for each group.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_density2d(aes(color = species))

Using geom_density2d_filled() with multiple groups does not work well, and so it seems best to use facets. You can change the way that the contour is created across the facets with contour_var choosing one of "density", "ndensity", or "count".

  • "density" uses the same scale across the facets.
  • "ndensity" keeps the peak intensity stable across the facets.
  • "count" scales by the number of observations.
penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_density2d_filled() + 
  facet_wrap(vars(species)) + 
  scale_x_continuous(breaks = c(180, 200, 220))

You can also use stat_density2d() to use the density2d statistical transformation with a different geom such as raster to create density tiles instead of contours.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  stat_density2d(
    geom = "raster",
    aes(fill = after_stat(density)),
    contour = FALSE) + 
  scale_fill_viridis_c()

14.7 Lines

14.7.1 Line plots

ggplot2 has three main geoms to draw lines to connect observations:

  • geom_line(): Connect observations in order of the variable on the x axis.
  • geom_path(): Connects the observations in the order in which they appear in the data.
  • geom_step(): Creates a stairstep plot, changes in y are at 90 degree angles, highlighting exactly when changes occur.

Line data: mean body mass per year

Code
penguins_line <- penguins |> 
  summarise(body_mass_g = mean(body_mass_g), n = n(),
            .by = c(species_sex, year))

The group aesthetic determines which cases are connected together. Thus, when you want to draw multiple lines in a plot, you need to use group or another aesthetic such as color or linetype to create the groups, see Section 16.8. The difference between geom_line() and geom_path() can be shown by not including a group aesthetic. When you see a plot like the first one, you know that you are missing a group aesthetic.

penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g)) + 
  geom_line()
penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g)) + 
  geom_path()

Mapping a color or linetype aesthetic creates the correct grouping.

penguins_line |> 
  ggplot(aes(
    x = year,
    y = body_mass_g,
    color = species_sex)) +
  geom_line(linewidth = 1.5)
penguins_line |> 
  ggplot(aes(
    x = year,
    y = body_mass_g,
    linetype = species_sex)) +
  geom_line(linewidth = 1.5)

geom_step() is good for highlighting exactly when changes occur, for instance with financial data where credits and debits are made on specific days, not over a continuous period or at a constant pace. It is thus good for more discrete forms of time data. Think of the difference between accounting data vs stock market prices taken at regular intervals of days or even hours. For the penguin data using geom_step() highlights that the data was collected at three specific point (at least according to the data we have) and does not represent continuous data over time. The changes can be further highlighted by adding points.

penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g,
             color = species_sex)) +
  geom_step() + 
  geom_point()

14.7.2 Line segments and curves

geom_segment() and geom_curve() provide means to draw straight and curved lines with data. These geoms are often used as annotations to a plot and are similar to using “segment” and “curve” geom in annotate(), see Section 15.5.2. They differ in that they need data from a data frame. The geoms use the positional aesthetics of x, y, xend, and yend. geom_curve() has arguments for curvature and angle to specify the curve. See grid::curveGrob() for more control on the details of the curve.

# Data frame for where to begin and end segment and curve
df <- data.frame(x1 = 184, x2 = 209, y1 = 4650, y2 = 5500)

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_point() + 
  geom_curve(data = df,
             aes(x = x1, y = y1,
                 xend = x2, yend = y2,
                 color = "curve"),
             curvature = -0.5) +
  geom_segment(data = df,
              aes(x = x1, y = y1,
                  xend = x2, yend = y2,
                  color = "segment")) + 
  labs(color = "Geom type")

14.7.3 Reference lines

Reference line geoms draw horizontal (geom_hline()), vertical (geom_vline()), or angled (geom_abline()) lines across the plot panel. These geoms are drawn using geom_line() so they support the same aesthetics: alpha, color, linetype, and linewidth. The arguments for the geoms are very simple, and the placement aesthetics are usually provided manually instead of in the data. If you want the lines to vary across facets, you need to construct a data frame and then use them within aesthetics.

penguins |> 
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) + 
  geom_vline(xintercept = median(penguins$flipper_length_mm),
             color = "orchid",
             linewidth = 1.5) + 
  geom_hline(yintercept = median(penguins$body_mass_g),
             color = "tomato",
             linewidth = 1.5) + 
  geom_abline(intercept = 0, slope = 20,
              color = "slateblue",
             linewidth = 1.5) + 
  geom_point(alpha = 0.5)

14.8 Ribbons and areas

geom_ribbon() is similar to geom_line but it creates a polygon between a ymin and a ymax. geom_area() is a special case of geom_ribbon(), where the ymin is fixed to 0 and y is used instead of ymax.

A simple example is to subtract from and add to the y variable to create a buffer around geom_line(). You could calculate quantiles to do this, though using geom_smooth() would be better.

penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g,
             group = species_sex)) +
  geom_ribbon(aes(ymin = body_mass_g - 50,
                  ymax = body_mass_g + 50),
              fill = "gray80") + 
  geom_line(aes(color = species_sex))

By default, geom_area() uses position = "stack" with a special stat (stat_align()) to align the values so that they can be stacked when groups overlap on the x-axis. Thus, with the penguin_line data the areas just stack on top of each other, creating a measure of the total weight of penguins with heights of the areas representing the ratio of weight per species_sex. Note the very different scale on the y-axis.

penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g,
             fill = species_sex)) + 
  geom_area()

Using position = "identity" places one polygon over another.

penguins_line |> 
  ggplot(aes(x = year,
             y = body_mass_g,
             fill = species_sex)) + 
  geom_area(position = "identity", alpha = 0.2) + 
  geom_line(aes(color = species_sex))

14.9 Positions

All geoms have a position adjustment argument that resolves overlapping geoms. Override the default by using the position argument in the geom_*() or stat_*() function. The argument can either be one of the below functions or a string without the position_ prefix. See in particular Section 14.3.3 for examples.