[1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3"
8 Aesthetic Mappings
In the previous chapters, you learned that aesthetic mappings in ggplot2 are applied through the use of the aes()
function. Most geoms—such as geom_point()
, geom_path()
, and geom_line()
—require a mapping from data values to x-coordinates and y-coordinates. This chapter provides an overview of the most important other types of aesthetic mappings in data visualization:
- Color (Section 8.1)
- Size (Section 8.2)
- Shape and line type (Section 8.3)
- Statistical weights attributed to individual data points (Section 8.4)
- Assignment of data points to groups (Section 8.5)
Afterward, Section 8.6 contrasts aesthetic mappings with geom-specific arguments, which can be used to set visual properties for all data points irrespective of their values. Finally, Section 8.7 demonstrates how to specify aesthetic mappings for all or for individual geoms in a plot.
8.1 Color
Color perception is an integral part of how we experience the world around us. Nature uses colors to communicate attractiveness and warnings. Similarly, astute use of colors in data visualization can help to convey messages to readers and viewers.
R recognizes 657 colors by name, listed by the colors()
function:
Several websites, such as Kyle W. Brown’s “Colors in R”, display the color names alongside the color.
Although it is enjoyable to mix and match one’s own combination of colors, the result is usually haphazard. A more principled approach is based on the use of color models, such as hue-chroma-luminance (HCL), where the name-giving properties are defined as follows (Fairchild, 2013, pp. 88–90):
The hue-chroma-luminance (HCL) color model provides a principled approach to quantifying subjective color perception.
- Hue:
- The hue is the dominant pure color—after removing any mixture with white, gray, or black—of an object, such as red, green, or blue. Hues can be arranged in a circle and measured as an angle with red at 0°, green at 120°, and blue at 240° as shown in Figure 8.1.
- Chroma:
- Chroma is the absence of white, gray, or black from a color. The higher the chroma, the stronger the perception of vividness. The maximum chroma of a color depends on hue and luminance, as indicated in Figure 8.1.
- Luminance:
- Luminance is the the wavelength-weighted power emitted per unit area of light travelling in a given direction, expressed as percentage of the intensity of a perfectly white light source.
Although this chapter will not dwell on the parameterization of the HCL model, we will refer to the terms hue, chroma, and luminance when discussing color palettes.

The appropriate use of color in data visualization is a complex topic. This section will focus only on the suitability of statistical variables for aesthetic mapping to color. Section 8.1.1 will discuss the choice of colors for categorical data, while Section 8.1.2 will address under which conditions quantitative variables can be meaningfully mapped to colors.
8.1.1 Colors for Categorical Data
While all categorical variables can, in principle, be mapped to colors, suitable palettes differ between non-ordinal and ordinal data. These two cases will be discussed in Section 8.1.1.1 and Section 8.1.1.2, respectively.
8.1.1.1 Unordered Categories
If categories are unordered, the colors used to represent them should be distinct from one another. When an unordered categorical variable belongs to the factor
class, ggplot2 selects a distinct color for each level and chooses evenly spaced colors with medium luminance. When provided with a character vector, ggplot2 internally converts the input to a factor.
Color is a suitable visual property for the representation of any unordered categorical variable. In that case, select colors with comparable chroma and luminance but clearly distinct hues.
For instance, the following code creates a scatter plot of the iris
data set, with the three categories in the Species
column mapped to distinct line colors:
ggplot(iris, aes(Sepal.Width, color = Species)) +
geom_freqpoly(bins = 20) +
geom_rug(
aes(y = 0),
position = position_jitter(height = 0),
alpha = 0.5,
sides = "b"
) +
labs(
x = "Sepal Length (cm)",
y = "Count",
title = "Distinct Hues Used for Line Colors",
caption = "Source: E. Anderson (1935)"
)
Similarly, the six different categories of marital status in the General Social Survey are plotted as six evenly spaced colors in the following example:
ggplot(gss_cat, aes(y = partyid, fill = marital)) +
geom_bar(color = "gray") +
labs(
x = "Count",
y = "Response",
fill = "Marital status",
title = "Distinct Hues Used for Fill Colors",
caption = 'Source: R package "forcats"'
)
Although ggplot2’s default palette is adequate for exploratory data analysis, the ColorBrewer project provides better choices for production-quality plots. ColorBrewer palettes and their implementation in ggplot2 will be discussed in Section 9.3.
8.1.1.2 Ordered Categories
While colors for unordered categories should have distinct hues, palettes for ordinal data should exhibit an intuitive progression from low to high. By default, ggplot2 uses a palette progressing in hue from blue to yellow and from low to high luminance, keeping the chroma consistently high. An example can be seen in the following plot, visualizing the cut quality of diamonds. This plot illustrates that most diamonds have a medium clarity of SI1 (Slightly Included to the 1st degree). Diamonds of the highest clarity IF (Internally Flawless) in the data set tend to have better cuts than those of lowest clarity I1 (inclusions visible to the naked eye):
When visualizing ordinal data, use a palette that is gradually changing in luminance. Choose a high chroma at least at one of the color scale’s end points.
ggplot(diamonds, aes(y = clarity, fill = cut)) +
geom_bar(color = "gray") +
labs(
x = "Count",
y = "Clarity",
fill = "Cut",
title = "Diamond Quality",
caption = 'Source: R package "ggplot2"'
)
Note that ggplot2
chooses this palette because cut
is encoded as an ordered factor:
is.ordered(diamonds$cut)
[1] TRUE
Not all ordinal categorical variable can be meaningfully represented by color. In particular, color is unsuitable for the representation of a cumulative property of data values, such as their count. The reason is that color perception is independent of the magnitude of the colored object. For example, a red circle does not appear more red when it is larger. Instead, color should be mapped only to variables representing an intrinsic property of each individual data point, such as the cut quality of diamonds in the previous example.
Only use color for ordinal data if each value represents an intrinsic property of each individual data point.
8.1.2 Colors for Quantitative Data
Similar to ordinal data, colors can be used for some but not all quantitative variables. To be suitable for aesthetic mapping, the data must be intensive. That is, the quantities can be reasonably expected to be independent of the number or data points. For other quantitative data, symbol size or line width might be used instead, as discussed in Section 8.2.
A quantitative variable should only be mapped to color if the data are intensive. That is, the expectation value is independent of system size.
To understand the distinction between intensive and non-intensive variables, let us consider the columns of the gss_cat
tibble. The income of an individual participant in the General Social Survey is an example of an intensive variable because it does not depend on the number of participants. Similarly, the mean income of all Republican party members is intensive because, assuming that survey participants form a representative sample of the U.S. population, the expectation value of the sample mean is independent of the number of Republican survey participants. Furthermore, the percentage of participants earning less than $1,000 is intensive. In contrast, the number of Republicans and the total income in the participant pool are non-intensive variables because their expectation values increase with the number of participants.
Colors share an important feature with intensive variables: an ensemble of multiple, identically colored objects has the same color as the individual objects. For instance, one green point is just as green as a thousand green points. Therefore, colors can be more intuitively linked to intensive rather than non-intensive quantities. In many applications, non-intensive variables can be converted into intensive variables by normalization (e.g., dividing one additive variable by another).
For example, let us visualize the relationship between the age of participants in the General Social Survey and the number of hours per day they spent watching television, stored in the tvhours
column of gss_cat
. As a first step, you may count the number of observations for each combination of age
and tvhours
:
tv <-
gss_cat |>
drop_na(age, tvhours) |>
count(age, tvhours, name = "count")
tv
# A tibble: 868 × 3
age tvhours count
<int> <int> <int>
1 18 0 7
2 18 1 12
3 18 2 7
4 18 3 10
5 18 4 4
6 18 5 4
7 18 7 1
8 18 13 1
9 19 0 18
10 19 1 32
# ℹ 858 more rows
Your first instinct might be to use color to represent the count
variable, with age
and tvhours
mapped to spatial coordinates. However, count
is a non-intensive variable because a survey with twice as many participants would be expected to result in twice as many counts for each group. Therefore, color is unsuitable to represent count
, and area should be used instead; for example, you might apply the geom_count()
function, which you encountered in a previous chapter, directly to gss_cat
instead of tv
.
However, a slight change of the represented data renders color a meaningful visual property for representing two-dimensional distributions. For instance, an intensive variable can be generated by dividing two variables that both scale in proportion to the number of data points. Hence, you can divide count
by the number of participants in each age group to obtain an intensive variable, namely the proportion of television consumption at a specific age:
tv <- mutate(
tv,
cohort_size = sum(count),
pct = (count / cohort_size) * 100,
.by = age
)
tv
# A tibble: 868 × 5
age tvhours count cohort_size pct
<int> <int> <int> <int> <dbl>
1 18 0 7 46 15.2
2 18 1 12 46 26.1
3 18 2 7 46 15.2
4 18 3 10 46 21.7
5 18 4 4 46 8.70
6 18 5 4 46 8.70
7 18 7 1 46 2.17
8 18 13 1 46 2.17
9 19 0 18 135 13.3
10 19 1 32 135 23.7
# ℹ 858 more rows
Now, the fill
aesthetic can be used for the pct
column:
ggplot(tv, aes(x = tvhours, y = age, fill = pct)) +
geom_tile(color = "gray") +
labs(
x = "Hours",
y = "Age (years)",
fill = "% of age group",
title = "Hours Spent Watching TV by Age Group",
caption = 'Source: R package "forcats"'
)
This plot indicates that the number of hours spent watching television is distributed similarly across all age groups. By default, ggplot2 uses a continuous palette with a gradient in luminance, but keeping hue and chroma nearly constant, for representing quantitative data. While the chosen palette is not ideal, it is sufficient for exploratory data analysis. You will be equipped with better choices when we discuss the ColorBrewer palettes in Section 9.3.
8.1.3 Section Summary: Color
This section has pointed out good practices for using color in data visualization. Color can be used for categorical and quantitative data with several caveats.
If the data are categorical, the type of palette depends on whether the data are unordered or ordered. In the unordered case, each hue should be clearly distinct. If the data are ordinal, color should only be used to represent intrinsic properties of individual observations rather than properties obtained by counting or adding individual values. Furthermore, for ordinal data, a progression of discrete colors, such as a gradient in luminance, should be used.
If the data are quantitative, color should be used only if the variable is intensive. That is, the quantity should not depend on the number of data points.
Although ggplot2’s default palettes are adequate for exploratory data analysis, they are not optimized for user-friendliness. In Section 9.3, you will learn how to change the palettes using ggplot2’s scale_*()
functions.
8.2 Size
The size of a plotted object is an effective visual property for mapping quantitative statistical variables that are extensive; that is, the quantity associated with each plotted unit can be regarded as the sum of nonnegative quantities associated with its subunits. Examples of such variables include population or gross domestic product (GDP) by country, which are the sums of the population or GDP of the country’s administrative units, respectively. In contrast, normalized variables, such as population density (expressed as people per square kilometer) or per-capita GDP, are intensive and are better represented using color, as discussed in Section 8.1.2, rather than using size or line width.
Size (e.g., of points and text as well as line width), can be mapped meaningfully only to extensive variables. That is, the data must be the sum of nonnegative quantities associated with subunits.
This section demonstrates how to map a column of a data frame to the size of point symbols and text (Section 8.2.1), as well as the width of lines (Section 8.2.2). However, please note that ggplot2 typically uses unsuitable default scales for size and line width. You will learn how to correct this issue in Section 9.4.
8.2.1 Sizes of Point Symbols and Text
For mapping of an extensive variable to the sizes of point symbols and text, consider the following country-level data stored in the gapminder
tibble from the gapminder package:
data(gapminder, package = "gapminder")
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Let us create a scatter plot, in which GDP per capita and life expectancy are represented by the x-coordinate and y-coordinate, respectively. Sizes of circular point symbols and country labels will correspond to the population of each country. Please note that population is an extensive variable and, consequently, suitable for being mapped to the size aesthetic. In contrast, continent is an unordered categorical variable and can, thus, be mapped to color, but it should not determine the size of the points. If you attempt to do so, ggplot2 will issue a warning.
Please note that the warning issued by the following code is not caused by the choice of aesthetic mappings. Instead it stems from a lack of space for the country labels. Let us ignore this problem right now; it could be fixed by abbreviating or suppressing some labels:
ggplot(
filter(gapminder, year == 2007),
aes(
gdpPercap,
lifeExp,
color = continent,
size = pop,
label = country
)
) +
geom_point(alpha = 0.5) +
ggrepel::geom_text_repel() +
labs(
x = "GDP per capita (US$)",
y = "Life expectancy (years)",
color = "Continent",
size = "Population",
title = "Longevity and Wealth in 2007 by Country",
caption = "Source: Gapminder Foundation"
)
Warning: ggrepel: 56 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
8.2.2 Line Width
Similar to the size
aesthetic, linewidth
can be employed as an aesthetic to represent an extensive variable. For instance, the following code chunk replicates a feature from a famous map by Minard (Rendgen, 2018) depicting Napoleon’s Russian campaign, where the line width indicates the number of surviving soldiers in Napoleon’s army. Additionally, color is used to indicate whether the troops were advancing (A) or retreating (R). For simplicity, longitude and latitude are mapped to x-coordinate and y-coordinate, respectively, which is not ideal; instead, a map projection should be used, as will be discussed in Section 14.3:
data("Minard.troops", package = "HistData")
army_1 <- filter(Minard.troops, group == 1) # Only trace the main army
head(army_1)
long lat survivors direction group
1 24.0 54.9 340000 A 1
2 24.5 55.0 340000 A 1
3 25.5 54.5 340000 A 1
4 26.0 54.7 320000 A 1
5 27.0 54.8 300000 A 1
6 28.0 54.9 280000 A 1
ggplot(army_1, aes(long, lat, linewidth = survivors, color = direction)) +
geom_path(lineend = "round") +
labs(
x = "Longitude (degree)",
y = "Latitude (degree)",
color = "Direction",
linewidth = "Survivors",
title = "Main French Army During 1812 Russian Campaign",
)
Tufte (1983) praised Minard’s map as “the best statistical graphic ever drawn.” Despite some inaccuracies in the historical data, Minard’s use of line width is indeed exemplary. Line width can also be effectively used in other scenarios, such as representing traffic volume on a road map or depicting trading volume in time series plots of stock market prices. Please note that, similar to size
, linewidth
should not be applied to categorical variables, and ggplot2 will issue a warning if you attempt to do so.
8.2.3 Section Summary: Size
In this section, you learned how to map extensive variables to the size of point symbols and text, as well as the width of lines. However, ggplot2’s default scales for size and line width are typically unsuitable except for exploratory data analysis. You will learn in Section 9.4 how to scale area and line width correctly in proportion to the numbers they represent. It is also worth noting that not all positive variables increasing with system size are extensive. Instead, they may increase with the square or square root of the system size or in an even more complex manner. In such cases, it is advisable to normalize the variable, for example, by dividing it by the expectation value of a statistical model for an object with comparable attributes. The resulting normalized variable is intensive; hence, color can be employed as an aesthetic.
When dealing with quantitative variables that are neither intensive nor extensive, consider normalizing them by dividing them by their expected value. The result is an intensive variable, suitable for representation using color.
8.3 Shape and Line Type
Data points can be represented by symbols of different shapes (e.g., circles, squares, and crosses) and connected by lines of different types (e.g., solid, dashed, and dotted). Shape and line type are discrete and, hence, unsuitable as a representation of quantitative data, unlike color, size, and line width, which can change continuously. Additionally, neither shape nor line type can be meaningfully sorted and, consequently, should not be used for ordinal data either.1 However, as demonstrated in this chapter, shape and line type are useful for representing unordered categorical data, either alone or in combination with color.
Shape and line type can be mapped meaningfully only to unordered categorical variables.
8.3.1 Shapes of Point Symbols
For the use of shape as an aesthetic, consider the crown_rad
data frame from the likelihood package. It contains fictitious data for diameter at breast height (DBH) and radii for three tree species:
data(crown_rad, package = "likelihood")
glimpse(crown_rad)
Rows: 99
Columns: 3
$ DBH <dbl> 1.2, 29.9, 28.9, 6.5, 15.4, 34.1, 7.0, 1.0, 36.4, 18.0, 11.0, …
$ Species <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Radius <dbl> 0.70, 3.78, 2.92, 1.10, 1.90, 3.57, 1.58, 0.20, 3.70, 2.70, 2.…
Trying to map Species
to shape throws an error because, while Species
is a categorical variable, it is implemented as an integer:
gg_crown_rad <- function(data, mapping) {
library(ggplot2)
ggplot(data, mapping) +
geom_point() +
labs(
x = "Diameter at breast height (cm)",
y = "Crown radius (m)",
shape = "Species",
title = "Measurements of Three Tree Species",
caption = 'Source: R package "likelihood"'
)
}
gg_crown_rad(crown_rad, aes(DBH, Radius, shape = Species))
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_f()`:
! A continuous variable cannot be mapped to the shape aesthetic.
ℹ Choose a different aesthetic or use `scale_shape_binned()`.
To fix this issue, Species
should be converted to a factor:
Although it is possible, in principle, to extract information about the species from the shape of each point, detecting overall differences between the species solely based on the symbol shapes is challenging. Therefore, when shape represents a categorical variable, it is often beneficial for readability to map the categories additionally to color. Although redundant mapping to two categories should usually be avoided, shape and color generally work well together. Differences in color are easier to spot but differences in shape can assist the colorblind or when the plot is printed in black and white:
Mapping the same categorical variable to color and shape can often improve readability.
gg_crown_rad(
mutate(crown_rad, Species = factor(Species)),
aes(DBH, Radius, color = Species, shape = Species)
)
This plot demonstrates that ggplot2 combines the information about color and shape into a single legend rather than two separate legends, one for each aesthetic mapping, if the same categorical variable is mapped to both.
8.3.2 Line Type
Line type refers to the visual pattern of a line, such as being solid, dashed, or dotted. Similar to the shape aesthetic for point symbols, the line-type aesthetic can be employed to distinguish between unordered categories, but it should not be used for ordinal or quantitative data. In the following plot, which displays the percentage of Christians in the General Social Survey identifying as Protestant, Catholic, or followers of another faith over time, different line types are used to represent the faiths of the participants:
gss_christian <-
gss_cat |>
filter(
relig %in% c(
"Inter-nondenominational",
"Christian",
"Orthodox-christian",
"Catholic",
"Protestant"
)
) |>
mutate(relig = fct_other(relig, keep = c("Protestant", "Catholic"))) |>
count(year, relig, name = "count") |>
mutate(
pct = (count / sum(count)) * 100,
.by = year
)
gg_christian_aes <- function(...) {
library(ggplot2)
ggplot(
gss_christian,
aes(year, .data$pct, linetype = .data$relig, ...)
) +
geom_point() +
geom_line() +
labs(
x = "Year",
y = "% of Christians",
linetype = "Responses",
title = "Christian Faiths in the GSS",
caption = 'R package "forcats"'
)
}
gg_christian_aes()
Similar to the shape aesthetic, incorporating color as a redundant aesthetic in conjunction with the line type can enhance readability. Additionally, directly labeling the lines facilitates matching the lines with the corresponding categories, rendering the legend unnecessary. Consequently, the following code chunk removes the legend using the guides()
function, as will be explained in Section 10.3:
As for shape, it is also advisable to map line type additionally to color for improved readability.
8.3.3 Section Summary: Shape and Line Type
This section demonstrated how point shape and line type can be used for mapping categorical data. While theoretically redundant, mapping the same categorical variable also to color can improve readability in practice. In this case, ggplot2 creates a combined legend for all non-spatial aesthetic mappings used for the same variable. If space permits it, direct labeling of points or lines may eliminate the need for a legend altogether.
8.4 Weighted Data
So far, all aesthetic mappings have corresponded to visual properties of individual geometric elements, such as position, color, and shape. In contrast, the weight aesthetic is a special case that affects the plot only through the statistical model rather than immediately visible properties. In this context, weight is a measure for the relative importance of each observation for the model output.
The most common use case for the weight aesthetic is curve fitting using the geom_smooth()
function. For instance, in the following plot, each point represents a country, and two LOESS curves are used for modeling the dependence of life expectancy on GDP per capita. The blue curve is fitted by calling geom_smooth()
without any additional arguments; as a result, the model weighs all countries equally. In contrast, the black curve weighs countries by their population. Populous countries, such as China, India, and the United States, therefore, exert more influence on the black curve such that it passes nearer to these countries than the unweighted blue curve.
The weight aesthetic assigns relative importance attributed to the data when calculating summary statistics. It is often used when calling geom_smooth
to fit a trend curve.
ggplot(filter(gapminder::gapminder, year == 2007), aes(gdpPercap, lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
ggrepel::geom_text_repel(
aes(
label = if_else(min_rank(-pop) <= 5, country, ""),
color = continent,
size = pop
),
max.overlaps = Inf
) +
labs(
x = "GDP per capita (US$)",
y = "Life expectancy (years)",
color = "Continent",
size = "Population",
title = "Longevity and Wealth in 2007 by Country",
caption = "Source: Gapminder Foundation"
) +
geom_smooth(method = "loess") +
geom_smooth(aes(weight = pop), color = "black", method = "loess")
It depends on the application whether weighting is sensible. For instance, if the objective is to predict the life expectancy of a random country not included in the data, an unweighted approach is appropriate. In contrast, if the objective is to model life expectancy of a random person on the planet, a weighted fit makes more sense.
Statistical variables for weighting should be extensive. That is, if a unit represented by one data point is replaced with multiple data points representing its sub-units, the sum of the weights should remain the same. In the example above, the a country’s population is extensive because splitting a country into smaller regions (e.g., administrative divisions) does not change the total population although the population in each region is smaller.
Weighting is suitable only for extensive variables.
8.5 Grouped Data
Data are automatically grouped when a categorical variable is mapped using the aes()
function. For instance, in the following code, the first call to gg_iris_aes()
creates a single frequency polygon. However, when color = Species
is added in the subsequent call, this assignment is passed to the aes()
function and, consequently, the data are split into separate groups such that three different frequency polygons (one for each species) are generated:
The aesthetic mapping of any categorical data splits the data into groups, causing the display of separate statistical summaries for each group.
gg_iris_aes <- function(...) {
library(ggplot2)
ggplot(iris, aes(...)) +
geom_freqpoly(bins = 20) +
labs(
x = "Sepal Length (cm)",
y = "Count",
caption = "Source: E. Anderson (1935)",
)
}
gg_iris_aes(Sepal.Length) +
labs(title = "Ungrouped")
gg_iris_aes(Sepal.Length, color = Species) +
labs(title = "Grouped by Color Aesthetic")
The code chunk below illustrates that the linetype
aesthetic can achieve the same grouping effect as the color aesthetic. However, there are instances where grouping is desired without changing visible aesthetics like color or line type. In these cases, the group aesthetic is particularly useful. Unlike color
or linetype
, the group
aesthetic does not create a visible distinction between the groups; consequently, no legend is generated:
The group
aesthetic allows grouping data without changing the group’s visual properties.
gg_iris_aes(Sepal.Length, linetype = Species) +
labs(title = "Grouped by Line Type Aesthetic")
gg_iris_aes(Sepal.Length, group = Species) +
labs(title = "Grouped by Group Aesthetic")
In this example, the lack of any visible difference between the frequency polygons produces a confusing plot. As a result, you might question the utility of the group
aesthetic, especially when alternatives like color
or linetype
can achieve a more interpretable result. Let us consider another example to demonstrate the group
aesthetic’s usefulness in specific situations.
R’s built-in data frame ChickWeight
represents a longitudinal study of the impact of different diets on chick body weights. The data frame contains information about the body weights of 50 different chicks, measured every second day from birth until the 20th day, with an additional measurement on day 21. Chicks were divided into four groups, and each group was fed a different protein diet to study the diets’ effects on growth:
head(ChickWeight)
Grouped Data: weight ~ Time | Chick
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
When plotting the data using geom_line()
, the outcome is significantly influenced by whether ggplot()
accounts for the diet groups. In the following output, the left plot, which does not group the data, displays a sawtooth pattern connecting the points chronologically by age. In contrast, the right plot, which includes aes(group = Chick)
, presents a separate line for each chick, offering a clearer view of individual growth patterns. Such visualizations, known as “spaghetti plots” due to their resemblance to tangled strands of pasta, can be effective for visualizing individual trajectories while still communicating the overall trend. Note that, in the code below, weight
refers to a specific column in ChickWeight
, not to be confused with ggplot2’s weight
aesthetic:
gg_chick_aes <- function(...) {
library(ggplot2)
ggplot(ChickWeight, aes(...)) +
geom_line(alpha = 0.2) +
geom_point(alpha = 0.2) +
labs(
caption = "Source: R data set 'ChickWeight'",
x = "Time (days)",
y = "Weight (g)"
)
}
gg_chick_aes(Time, weight) + labs(title = "Ungrouped")
gg_chick_aes(Time, weight, group = Chick) + labs(title = "Grouped by Chick")
In general, applying a group aesthetic to geom_line()
, geom_path()
, or geom_smooth()
generates a separate polyline (i.e., a group of connected lines) for each group. Opting for group
instead of visible aesthetics, such as color
or linetype
, is advantageous when a plot’s legend would become overcrowded with group labels.
Applying group
to line or path geoms creates separate lines for each groups.
8.6 Aesthetic Mappings versus Geom-Specific Arguments
Many geoms can accept equally named arguments either inside or outside the aes()
function. If the argument is inside aes()
, it is treated as an aesthetic mapping, determining the appearance of each constituent part of the geom based on its corresponding data value. However, if the argument is outside aes()
, it serves as a geom-specific argument, affecting the appearance of all constituent parts in the same manner, independently of the data.
If a visual property is an argument of aes()
, its appearance is determined by the data value. If, instead, it is an argument of the geom, its appearance is uniform across all data values.
For instance, consider a box plot of sepal length for each species in the iris
data set. The default fill color of the boxes is white, as the following example proves:
sepal_labs <- function(...) {
ggplot2::labs(
x = "Sepal Length (cm)",
...,
caption = "Source: E. Anderson (1935)",
)
}
ggplot(iris, aes(Sepal.Length, Species)) +
geom_boxplot() +
sepal_labs(title = "Default Fill Color")
If you want to change the fill color of all boxes to gray, use the fill = "gray"
argument inside geom_boxplot()
without wrapping it inside aes()
:
ggplot(iris, aes(Sepal.Length, Species)) +
geom_boxplot(fill = "gray") +
sepal_labs(title = "Gray Fill Color")
In contrast, when passing the aes(fill = Species)
argument to geom_boxplot()
, each box will be filled with a different color for each species, and a fill legend is automatically generated:
ggplot(iris, aes(Sepal.Length, Species)) +
geom_boxplot(aes(fill = Species)) +
sepal_labs(title = "Mapping Fill Color to Species")
In this plot, the labels on the y-axis already effectively function as a fill legend. Hence, it would be advisable to reduce clutter by removing the added fill legend:
ggplot(iris, aes(Sepal.Length, Species)) +
geom_boxplot(aes(fill = Species)) +
sepal_labs(title = "Fill Legend Removed") +
guides(fill = "none") # Remove fill legend
The principle showcased here applies to other geoms and arguments, such as color
for the color of the exterior border, alpha
for transparency, and size
for the size of points and text. If the argument is wrapped inside aes()
, it is treated as an aesthetic mapping from data values to visual properties. Otherwise, the argument determines the appearance of all geometric objects created by the geom.
8.7 Aesthetic Mappings for All and for Individual Geoms
In most of the previous examples, the aes()
function was used as an argument inside the ggplot()
function. However, it is also possible to specify the mappings inside any geom_*()
function. To demonstrate the difference, consider the following example:
dfr <- tribble(
~x, ~y, ~class, ~word,
0, -1, "i", "1. This",
2, 2, "i", "2. That",
1, -2, "ii", "3. Other",
3, 0, "ii", "4. Same",
1, 1, "iii", "5. Different"
)
Let us create a scatter plot with points and text labels, mapping class
to color. If the color = class
argument appears in the ggplot()
function, all subsequent geoms will map class to color. Consequently, both points and text are colored in this example:
Aesthetic mappings specified in the ggplot()
function apply to all geoms in the plot.
ggplot(dfr, aes(x, y, color = class, label = word)) +
geom_point() +
geom_text(nudge_y = 0.25) +
xlim(-1, 4) +
ylim(-2.5, 2.5)
However, if color = class
is moved to an aesthetic mapping inside the geom_point()
function, only the points will be colored but not the text. Conversely, if color = class
is moved to geom_text()
, only the text will be colored but not the points:
Aesthetic mappings specified in the geom_*()
functions apply only to the respective geoms.
ggplot(dfr, aes(x, y, label = word)) +
geom_point(aes(color = class)) + # Color mapping in geom_point()
geom_text(nudge_y = 0.25) +
xlim(-1, 4) +
ylim(-2.5, 2.5)
ggplot(dfr, aes(x, y, label = word)) +
geom_point() +
geom_text(
aes(color = class), # Color mapping in geom_text()
nudge_y = 0.25
) +
xlim(-1, 4) +
ylim(-2.5, 2.5)
While the changes in the previous example appear to be subtle, moving an aesthetic mapping from the plot level to the geom level can have a significant impact on the plot. For example, consider the following two scatter plots of Anderson’s iris data in which the points are jittered, color is mapped to species, and geom_smooth()
is called. In the first plot, color = Species
is specified in geom_jitter()
but not in geom_smooth()
, and the result is a single trend curve. In the second plot, color is mapped to Species inside ggplot()
; thus, this mapping applies to both geom_jitter()
and geom_smooth()
, forcing ggplot2 to draw three differently colored trend curves, one for each species:
iris_labs <- function(iris_title) {
ggplot2::labs(
x = "Sepal Width (cm)",
y = "Petal Width (cm)",
title = iris_title,
caption = "Source: E. Anderson (1935)"
)
}
ggplot(iris, aes(Sepal.Width, Petal.Width)) +
geom_jitter(
aes(color = Species), # Color mapping in geom_jitter()
alpha = 0.5,
size = 0.5
) +
geom_smooth() +
iris_labs("Color Mapped Only to Jitter Geom")
ggplot(
iris,
aes(Sepal.Width, Petal.Width, color = Species) # Color mapping in ggplot()
) +
geom_jitter(alpha = 0.5, size = 0.5) +
geom_smooth() +
iris_labs("Color Mapped to All Geoms")
These two plots exemplify a general rule, applicable to geoms that summarize data, such as the trend curves computed by geom_smooth()
. If any aesthetic mapping to a categorical variable is instantiated, regardless whether inside ggplot()
or the geom_()
function itself, the geom will produce a separate summary for each level of the categorical variable. If this feature is undesired by the user, the mapping should be moved from the ggplot()
function to the non-summarizing geoms, such as geom_jitter()
.
9 Conclusion
This chapter explored various types of aesthetic mappings in ggplot2. You learned that ordered categorical data can be represented using color, shape, and line type. If the categories are unordered and a color aesthetic is applied, hues should be clearly distinct, whereas, for ordinal data, there should be a natural progression in the colors from one level to the next (e.g., from light to dark). When visualizing quantitative data, it is important to determine whether the variable is intensive, extensive, or neither. Intensive variables should be mapped to color, whereas extensive variables should be mapped to size or line width. For all other quantitative variables, it is best to normalize them. Normalization can be achieved by taking the ratio of the observed value to the expectation value of a statistical model, using comparable attributes as model input.
Unlike most other aesthetics, weight
and group
are not directly mapped to a visual property. Instead, weight
is used to determine the importance of data points when plotting statistical summaries, such as trend curves. The group
aesthetic is used to group observations, which is useful when you want to connect data belonging to the same group using geom_line()
but no lines should be drawn between points in different groups.
In addition to introducing different types of aesthetic mappings, this chapter explained the difference between aesthetic mappings and function arguments passed to geoms outside the aes()
function. It also discussed the effect of aesthetic mappings passed to the ggplot()
function versus those passed to individual geoms.
Exceptions to this rule exist. For example, if all symbols used in a plot are regular polygons, then they can be sorted by the number of sides. Another example are the star-shaped symbols in sunflower plots (see Section 6.6.1), which can be sorted by the number of arms.↩︎