-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathggplot2-notes.Rnw
376 lines (308 loc) · 15.5 KB
/
ggplot2-notes.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
% Sean Anderson, 2012-14, sean "at" seananderson.ca
\documentclass[11pt]{article}
\usepackage{geometry}
\geometry{letterpaper}
\usepackage{graphicx}
\textheight 21.0cm
\usepackage{url}
\usepackage{fancyvrb}
\fvset{fontsize=\normalsize}
\usepackage{booktabs}
\usepackage{titling}
\newcommand{\subtitle}[1]{%
\posttitle{%
\par\end{center}
\begin{center}\large#1\end{center}}%
}
\let\proglang=\textsf
\let\pkg=\textbf
\newcommand{\ggtwo}{\pkg{ggplot2}}
\newcommand{\ggplot}{\texttt{ggplot()}}
\newcommand{\qplot}{\texttt{qplot()}}
\title{An introduction to \ggtwo}
\subtitle{FISH 544: Beautiful Graphics in \proglang{R}}
\author{Sean C. Anderson\\ \texttt{[email protected]}}
\setlength\parskip{0.1in}
\setlength\parindent{0in}
\begin{document}
<<set-knitr-options, echo=FALSE>>=
library(knitr)
opts_chunk$set(fig.align='center', fig.pos="htpb", cache=TRUE, echo=TRUE,
message=FALSE, autodep=TRUE, fig.path='figure/', fig.width=5, par=TRUE)
opts_chunk$set(warning=FALSE, message=FALSE, tidy=FALSE, refresh=TRUE)
opts_chunk$set(dev = 'pdf')
thm <- knit_theme$get("solarized-light.css")
knit_theme$set(thm)
@
\maketitle
\section{\ggtwo\ for rapid data exploration}
\ggtwo\ is an \proglang{R} package by Hadley Wickham that implements Wilkinson's
Grammar of Graphics.\footnote{Wilkinson, L. (2005). \textit{The Grammar of
Graphics}. Springer, 2\textsuperscript{2nd} edition.} The emphasis of \ggtwo\ is
on rapid exploration of data, and especially high-dimensional data. Think of
base graphic functions as drawing with data\footnote{Examples of base graphic
functions are \texttt{plot()}, \texttt{points()}, and \texttt{lines()}.}. With
base graphics, you have complete control over every pixel in a plot but it can
take a lot of time and code to produce a plot.
Although \ggtwo\ can be fully customized, it reaches a point of diminishing
returns. I tend to use \ggtwo\ and base graphics for what they excel at: \ggtwo\
for rapid data exploration and base graphics for polished and fully-customized
plots for publication (Figure~\ref{fig:gg-time}).
\begin{figure}[htb]
\centering
\includegraphics[height=2.15in]{figure/gg-dimensions.pdf}
\includegraphics[height=2.15in]{figure/gg-vs-base-pub.pdf}
\caption{Creation time vs.\ data dimensions and customization level for base
graphics (blue) and \ggtwo\ (red). (Left panel) It's remarkably easy to plot
high-dimensional data in \ggtwo\ with, for example, colours, sizes, shapes,
and panels. (Right panel) \ggtwo\ excels at rapid visual exploration of
data, but has some limitations in how it can be customized. Base graphics
are fully customizable but can take longer to set up. I try and exploit the
grey shaded areas: I use \ggtwo\ for data exploration and once I've decided
on a small number of key plots, I'll use base graphics to make
fully-customized plots if needed.}
\label{fig:gg-time}
\end{figure}
Good graphical displays of data require rapid iteration and lots of exploration.
If it takes you hours to code a plot in base graphics, you're unlikely to throw
it out and explore other ways of visualizing the data, and you're unlikely to
explore all the dimensions of the data.
\section{\qplot\ vs.\ \ggplot}
There are two main plotting functions in \ggtwo: \qplot\ and \ggplot. \qplot\ is
short for ``quick plot'' and is made to mimic the format of \texttt{plot()} from
base \proglang{R}. \qplot\ requires less syntax for many common tasks, but has
limitations --- it's essentially a wrapper for \ggplot. The \ggplot\ function
itself isn't complicated and will work in all cases. I prefer to work with just
the \ggplot\ syntax and will focus on it here. I find it easier to master one
function that can do everything.
\section{Basics of the grammar}
Let's look at some illustrative \ggtwo\ code:
<<echo=TRUE, fig.height=3.5, fig.width=7, out.width="4.5in">>=
library(ggplot2)
theme_set(theme_bw()) # use the black and white theme throughout
# fake data:
d <- data.frame(x = c(1:8, 1:8), y = runif(16),
group1 = rep(gl(2, 4, labels = c("a", "b")), 2),
group2 = gl(2, 8))
head(d)
ggplot(data = d) + geom_point(aes(x, y, colour = group1)) +
facet_grid(~group2)
@
The basic format in this example is:
\begin{enumerate}
\item \texttt{ggplot()}: start a \ggplot\ object and specify the data
\item \texttt{geom\_point()}: we want a scatter plot; this is called a
``geom''
\item \texttt{aes()}: specifies the ``aesthetic'' elements; a legend is
automatically created
\item \texttt{facet\_grid()}: specifies the ``faceting'' or panel layout
\end{enumerate}
There are also statistics, scales, and annotation options, among others. At a
minimum, you must specify the data, some aesthetics, and a geom. I will
elaborate on these below. Yes, \ggtwo\ combines elements with \texttt{+}
symbols! This may seem non-standard, although it has the advantage of allowing
\ggtwo\ plots to be proper \proglang{R} objects, which can modified, inspected,
and re-used (I provide some examples at the end).
\clearpage
\section{Geoms}
\texttt{geom} refers to a geometric object. It determines the ``shape'' of the
plot elements. Some common geoms:
\begin{tabular}{llll}
\toprule
\texttt{geom} & description \\
\midrule
\texttt{geom\_point()} & Points\\
\texttt{geom\_line()} & Lines\\
\texttt{geom\_ribbon()} & Ribbons, y range with continuous x values\\
\texttt{geom\_polygon()} & Polygon, a filled path\\
\texttt{geom\_pointrange()} & Vertical line with a point in the middle\\
\texttt{geom\_linerange()} & An interval represented by a vertical line\\
\texttt{geom\_path()} & Connect observations in original order\\
\texttt{geom\_histogram()} & Histograms \\
\texttt{geom\_text()} & Text annotations\\
\texttt{geom\_violin()} & Violin plot (another name for a beanplot)\\
\texttt{geom\_map()} & Polygons from a map\\
\bottomrule
\end{tabular}
\section{Aesthetics}
Aesthetics refer to the attributes of the data you want to display. They map the
data to an attribute (such as the size or shape of a symbol) and generate an
appropriate legend. Aesthetics are specified with the \texttt{aes()} function.
As an example, the aesthetics available for \texttt{geom\_point()} are:
\texttt{x}, \texttt{y}, \texttt{alpha}, \texttt{colour}, \texttt{fill},
\texttt{shape}, and \texttt{size}.\footnote{Note that \ggtwo\ tries to
accommodate the user who's never ``suffered'' through base graphics before by
using intuitive arguments like \texttt{colour}, \texttt{size}, and
\texttt{linetype}, but \ggtwo\ will also accept arguments such as \texttt{col},
\texttt{cex}, and \texttt{lty}.} Read the help files to see the aesthetic
options for the geom you're using. They're generally self explanatory.
Aesthetics can be specified within the data function or within a geom. If
they're specified within the data function then they apply to all geoms you
specify.
Note the important difference between specifying characteristics like colour and
shape inside or outside the \texttt{aes()} function: those inside the
\texttt{aes()} function are assigned the colour or shape automatically based on
the data. If characteristics like colour or shape are defined outside the
\texttt{aes()} function, then the characteristic is not mapped to data. Here's
an example:
\clearpage
<<echo=TRUE, fig.height=3, fig.width=5, out.width="3.5in">>=
ggplot(mpg, aes(cty, hwy)) + geom_point(aes(colour = class))
@
<<echo=TRUE, fig.height=3, fig.width=4, out.width="2.75in">>=
ggplot(mpg, aes(cty, hwy)) + geom_point(colour = "red")
@
\section{Small multiples}
In \ggtwo\ parlance, small multiples are referred to as ``facets''. There are
two kinds: \texttt{facet\_wrap()} and \texttt{facet\_grid()}.
\texttt{facet\_wrap()} plots the panels in the order of the factor levels. When
it gets to the end of a column it wraps to the next column. You can specify the
number of columns and rows with \texttt{nrow} and \texttt{ncol}.
\texttt{facet\_grid()} lays out the panels in a grid with an explicit x and y
position. By default all x and y axes will be shared among panels. However, you
could, for example, allow the y axes to vary with \texttt{facet\_wrap(scales =
"free\_y")} or allow all axes to vary with \texttt{facet\_wrap(scales =
"free")}.
\clearpage
<<echo=TRUE, fig.height=4.5, fig.width=5, out.width="3.5in">>=
ggplot(mpg, aes(cty, hwy)) + geom_point() + facet_wrap(~class)
@
<<echo=TRUE, fig.height=3, fig.width=8, out.width="5in">>=
ggplot(mpg, aes(cty, hwy)) + geom_point() + facet_grid(year~class)
@
\section{Themes}
A useful theme built into \ggtwo\ is \texttt{theme\_bw()}. You'll notice that I
set it as the default in this document back when I first loaded \pkg{ggplot2}.
You can specify it for a specific plot like this:
\clearpage
<<echo=TRUE, fig.height=3.5, fig.width=4, out.width="2.75in">>=
dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_bw()
@
Alternatively, the following plot shows the default theme. The grey is designed
to match the typical darkness of a page filled with text to keep the plot from
drawing too much attention.
<<echo=TRUE, fig.height=3.5, fig.width=4, out.width="2.75in">>=
ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_grey()
@
A powerful aspect of \ggtwo\ is that you can write your own themes. See the
\pkg{ggthemes} package for some examples.
An Edward Tufte-like theme:
<<echo=TRUE, fig.height=2.5, fig.width=3.0, out.width="3.00in">>=
library("ggthemes")
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_rangeframe() +
theme_tufte()
@
Just what you wanted:
<<echo=TRUE, fig.height=2.5, fig.width=5, out.width="3.70in">>=
ggplot(dsamp, aes(carat, price, colour = cut)) + geom_point() +
theme_excel() + scale_colour_excel()
@
Here's an example custom theme:
<<custom-theme, fig.height=3.5, fig.height=4, out.width="2.75in">>=
theme_example <- function (base_size = 12, base_family = "") {
theme_gray(base_size = base_size, base_family = base_family) %+replace%
theme(axis.text = element_text(colour = "grey50"),
axis.title.x = element_text(colour = "grey50"),
axis.title.y = element_text(colour = "grey50", angle = 90),
panel.background = element_rect(fill = "white"),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major = element_blank(),
plot.background = element_blank()
)
}
ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_example()
@
You can customize just about every aspect of a \ggtwo\ plot. We won't get into
that today, but see the additional help resources at the end of this document.
\section{\ggtwo's dirty little secret}
\begin{enumerate}
\item \ggtwo\ is easy to learn, but\ldots
\item You need to be a data-manipulation ninja to use it effectively.
\end{enumerate}
With base graphics, you can work with almost any data structure you'd like,
providing you can write the code to work with it. For example, you data could be
in a list or a data frame. Or maybe it's scattered across multiple objects.
\ggtwo, on the other hand, requires you to think carefully about the data
structure and then write one (possibly long) line of code. So, with base
plotting you'll tend to spend lots of time writing plotting code. With \ggtwo\
you'll tend to spend time thinking beforehand, and then quickly write the code.
\ggplot\ works with ``long'' format data or ``tidy data'' with each aesthetic or
facet variable in its own column\footnote{See Hadley Wickham's pre-print on tidy
data here: \url{http://vita.had.co.nz/papers/tidy-data.pdf}}. So, for example,
if we wanted a panel or colour for each level of a variable called
\texttt{fishstock} then we'd need a column named \texttt{fishstock} with all the
different values of \texttt{fishstock} in that column.
With the \pkg{reshape2} and \pkg{plyr} packages, combined with \texttt{merge()}
(or \texttt{plyr::join()}), you can get almost any dataset into shape for
\ggtwo\ in a few lines. Sometimes this will take some serious thought along with
pen and paper.
\section{Random tips}
\begin{description}
\item[Jittering and statistics] \hfill \\
\ggtwo\ has lots of built-in niceties like automatic position jittering
and the addition of basic statistical models to your plots. Have a look
through the help resources listed at the end of this document.
\item[Axis labels] \hfill \\
\texttt{xlab("Your x-axis label")}
\item[Suppressing the legend] \hfill \\
\texttt{theme(legend.position = "none")}
\item[Exploiting the object-oriented nature of \ggtwo] \hfill \\
Save the basic plot to an object and then experiment with different
aesthetics, geoms, and theme adjustments.
<<echo=TRUE, fig.keep='none'>>=
# Build the basic object:
p <- ggplot(d) + geom_point(aes(x, y, colour = group1))
p + facet_grid(~group1) # try one way
p + facet_grid(~group2) # try another way
@
\item[Horizontal error bars]\hfill \\
Say you want to make a coefficient plot with the coefficients down the
y-axis. You can either build the plot piece by piece with points and
segments, or you can use \texttt{geom\_pointrange()} and then rotate the axes
with \texttt{+ coord\_flip()}.
\item[Axis limits and zooming] \hfill \\
\ggtwo\ has two ways of adjusting the axis limits: \texttt{+ xlim(2, 5)}
will ``cut'' the data at \texttt{2} and \texttt{5} and plot the data.
\texttt{coord\_cartesian(xlim = c(2, 5))} will zoom in on the plot while
retaining the original data. This will, for example, affect colour scales.
\item[Displaying and saving \ggtwo\ plots] \hfill \\
If \ggtwo\ plots are generated in a function, they will be need to be
wrapped in \texttt{print()} to display. There is a convenience function
\texttt{ggsave("filename.pdf")}, which will save the last plot to a PDF
and guess at reasonable dimensions. You can, of course specify which plot
to save and the dimensions of the PDF. You can also use all the same
methods of saving \ggtwo\ plots that you can use for base graphics.
\item[Combining multiple \ggtwo\ panels] \hfill \\
\ggtwo\ makes it easy to create multipanel figures with
\texttt{facet\_grid()} and \texttt{facet\_wrap()}. But, sometimes you need
to combine different kinds of plots in one figure or plots that require
different data. One easy way to do this is with the \texttt{grid.arrange()}
function from the \pkg{gridExtra} package. For example:
<<grid.arrange, fig.width=4.2, fig.height=5.2, out.width="2.75in">>=
df <- data.frame(x = 1:10, y = rnorm(100))
p1 <- ggplot(df, aes(x, y)) + geom_point()
p2 <- ggplot(df, aes(y)) + geom_histogram()
gridExtra::grid.arrange(p1, p2)
@
\item[\pkg{ggvis}]\hfill \\
\pkg{ggvis} is a partial successor to \ggtwo\ and is under rapid development:
\url{https://github.com/rstudio/ggvis}. It focuses on implementing a similar
grammar of graphics for web-based and possibly interactive output. Take a look at the vignettes ---
start with \texttt{vignette("ggvis-basics")} --- to see some examples.
You'll need to install the \pkg{devtools} \proglang{R} package first.
\end{description}
\clearpage
\section{Help}
The best (and nearly essential) help source is the website, which is largely
based off the package help files. However, the website also shows the example
plots and is easier to navigate:\\ \url{http://docs.ggplot2.org/}\\ Don't be
afraid to keep it open whenever you're using \ggtwo.
There's also an active \ggtwo\ discussion group:\\
\url{http://groups.google.com/group/ggplot2}
\ggtwo\ is heavily featured on stackoverflow:\\
\url{http://stackoverflow.com/questions/tagged/ggplot2}
Hadley wrote a book on \ggtwo: \url{http://ggplot2.org/book/}. It's quite
thorough but doesn't feature some of the newer additions to the package.
\end{document}