-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path04-functional-programming.qmd
572 lines (461 loc) · 18.2 KB
/
04-functional-programming.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
aliases:
- functional-programming.html
---
# Functional Programming
```{r}
#| include: false
source("./_common.R")
```
> ... which is all about functions, bringing the whole tidyverse together and
exploring advanced dplyr data wrangling techniques.
{{< youtube B2RRRpjX1QU >}}
## Todays goal
My goal today is to bring together everything
we learned so far and solidify
our understanding of wrangling data in the tidyverse.
If all goes according to plan,
we will then have more mental capacity
freed up for the statistics starting next week.
And our understanding of data will hopefully
enable us to experiment and play with statistical
concepts without getting stuck too much
on data wrangling.
This also means that today's lecture might
be the most challenging so far, because
everything learned up until now will -- in
one way or another -- be relevant.
But first, we load the libraries for today as usual.
Note, that I am cheating a bit here by loading
the `gapminder` package as well.
Even though we will be reading in actual `gapminder` dataset
again from files today, having access to
the vector of country colors is nice to have.
```{r}
library(tidyverse)
library(gapminder)
```
In the lecture I also quickly go over some
of the most important [resources](Resources) so far.
This mostly concerns the pure R resources,
I will cover the resources for statistics and some of the maths
involved in the next couple of lectures.
As you might be able to tell,
mental models are one of my favorite topics.
We are starting today with a powerful mental model:
**iteration**.
## Iteration
Iteration is the basic idea of doing one thing multiple times.
This is an area where computers shine,
so in this chapter we will learn to fully utilize the power at our fingertips.
As an example, we will be reading in multiple similar files.
Remember the `gapminder` dataset?
Well, we are working with it again,
but this time, our collaborator sent us one csv-file for each continent.
You can find them in the `data/04/` folder.
We already know how to read in *one* csv-file:
```{r}
#| eval: false
read_csv("./data/04/Africa.csv")
```
### The Imperative Programming Approach
The first solution to our problem is not my favorite one,
but I want to show it anyway for the sake of completeness.
In general, in Functional Programming, we **tell the computer what we want**, while in Imperative Programming, we **tell the computer what steps to do**.
So here, we tell R what steps to perform.
First, we get a vector of file-path's with
the `fs` package, which stands for _file system_.
`dir_ls` means _directory list_, so we get the
contents of a directory.
We then create a list to store the dataframes that we are
going to read in.
We already define the length of the list because
making a data structure longer is not R's strong suit when
it doesn't know how much space to reserve for it.
We then iterate over the numbers from 1 to the length
of our `paths`.
At each iteration we get the `i`s path, read it and store
it in our results list at position `i`.
Finally we bind the list into one dataframe:
```{r}
paths <- fs::dir_ls("./data/04/")
result <- vector(mode = "list", length = length(paths))
for (i in 1:length(paths)) {
result[[i]] <- read_csv(paths[i])
}
bind_rows(result)
```
In doing this we have lost the information about the `Continent`,
which was in the file name,
but before dwelling on this for too long, let's leave the more convoluted and manual way behind and explore a,
I dare say, more elegant approach.
### The Functional Programming Approach
> "Of course someone has to write for-loops.
It doesn't have to be you."
> --- Jenny Bryan
<aside>
<a href="https://purrr.tidyverse.org/">
![](images/purrr.png){width=200}
</a>
</aside>
We have a function (`read_csv`) that **takes** a file path and **returns**
(spits out) the data.
In the Functional Programming style,
the next idea is to now have a function, that takes two things:
vector (atomic or list) and a function.
And it feeds the individual elements of the vector to the function,
one after another.
In mathematics, the relation between a set of inputs and a set of outputs is called a **map**,
which is where the name of the following family of functions comes from.
In the tidyverse, these functional programming concepts live in the `purrr` package.
Iterating **Explicitly** with `map`s:
First, we create the vector of things that we want to iterate over, the things
that will be fed into our function one after the other:
```{r}
paths <- fs::dir_ls("./data/04/")
```
Then we map the `read_csv` function over our vector
and bind the resulting list of dataframes into one dataframe:
```{r}
#| eval: false
result <- map(paths, read_csv)
bind_rows(result)
```
The operation of mapping over a vector and combining the resulting
list into one dataframe is actually so common that
there is a variant of `map` that does this step automatically:
```{r}
#| eval: false
map_df(paths, read_csv, .id = "continent")
```
This distills everything our initial for-loop did into just one line of code.
Using the `.id` argument we can save the name of the file path
to a column in our dataset.
This allows us to extract the continent from it:
```{r}
gapminder <- map_df(paths, read_csv, .id = "continent") %>%
mutate(continent = str_extract(continent, "(?<=/)\\w+(?=\\.csv)"))
head(gapminder)
```
My way of extracting the continent from the file path seems magical at first,
and I still refer to the cheat sheet of the `stringr` package a lot
when I am having to deal with text:
<aside>
<a href="https://stringr.tidyverse.org/">
![](images/stringr.png){width=200}
</a>
</aside>
Iterating **implicitly** with vectorized functions:
We had our first encounter with iteration in a very implicit form.
When we use
R's basic math operators, the computer is iterating behind the scenes.
Take this expression:
```{r}
1:3 + 1:3
```
This operation is vectorized.
Without us having to tell R to do so, R will add the first element of the first vector to the first element of the second vector and so forth.
Notice, how it looks like the operation happens all at the same time.
But in reality, this is not what happens.
The computer is just really fast at adding numbers, one after the other.
The mathematical operations in R call another programming language that does the
actual addition.
This other programming language is closer to the way computers *think*,
making it less fun to write for us humans,
but also faster because the instructions are easier to translate into actions for our computer processor.
Remember, we only have to build our own iteration (e.g. with a `map` function),
when we find a task that we want to apply to multiple things,
and that is not already vectorized.
And as it turns out, the `fs` and `readr` packages play very
well together, because `readr` can also just take a vector
of file paths and does the combining automatically!
```{r}
read_csv(paths, id = "continent")
```
Whenever you encounter a new problem, ask yourself these questions:
- Is there a function that already does that?
- Is it already vectorized?
- If not, is there a function that solves my problem for one instance?
- Can I map it over many things?
The `purrr` package contains various variants of the `map` function.
- `map` itself will always return a list.
- `map_chr` always returns an atomic character (=text) vector.
- `map_dbl` always returns numbers (dbl = double precision).
- `map_lgl` always returns logical (yes or no, TRUE / FALSE) vectors.
- `map_dfr` always returns a dataframe.
```{r}
# map_
```
The for-loop-version had a lot more code, especially *boilerplate*, code that is just there to make the construct work and doesn't convey our intentions with the
code.
Furthermore, the loop focuses the object that is iterated over (the file
paths), while the `map`-version focuses on what is happening (the function, `read_csv`).
But the loop still works.
If you can't think of a way to solve a problem with a `map` function, it is absolutely OK to use for-loops.
## "If you copy and paste the same code more than three times, write a function."
Writing our own functions can be very helpful for making our
code more readable.
It allows us to separate certain steps of your analysis
from the rest, look at them in isolation to test and
validate them, and also allows us to give them reasonable names.
It also allows us to re-use function across projects!
Let's imagine we have some experiment where the read-out
of the machine is always in a certain format and needs some
cleaning up.
If we turn this into a function, e.g. like this:
```{r}
read_experiment_data <- function(day) {
day <- str_pad(day, pad = 0, width = 2)
paths <- fs::dir_ls(paste0("./data/",day,"/"))
gapminder <- map_df(paths, read_csv, .id = "continent") %>%
mutate(continent = str_extract(continent, "(?<=/)\\w+(?=\\.csv)"))
gapminder
}
```
We can put this function in a file (in my case `R/my_functions`)
and `source` it, for example in multiple analysis Rmarkdown documents.
The `source` function is nothing special, it just runs the R
code in a file.
And when we define functions in this file, those functions
are then available to us:
```{r}
#| eval: false
source("R/my_functions.R")
read_experiment_data(4)
```
Note: This example function only make sense to use with
`4` as the input, as the data for the other days
is of course different, but I hope you get the gist.
I like to store my regular R files (as opposed to Rmd files)
in a folder of my project called `R`.
This makes it already look like an R package,
in case I decide later on that the functions
could be helpful for others as well
or I want to share them more easily
with colleagues.
You can read more about creating your
own R packages [here](http://r-pkgs.had.co.nz/)
[@wickhamPackagesOrganizeTest2015].
## Implicit iteration with `dplyr`: build many models
`dplyr`s idea of grouping allows us to express many
ideas that are implicitly also iterations.
Let us start by looking at just one country at first:
```{r}
algeria <- gapminder %>%
filter(country == "Algeria")
```
We can plot the life expectancy over time with ggplot
and add a **linear model** to our plot with
`geom_smooth` using `method = "lm"`.
```{r}
algeria %>%
ggplot(aes(year, lifeExp)) +
geom_smooth(method = "lm") +
geom_point()
```
However, while `geom_smooth` allows us to easily add smoothing
lines or linear trends to our plot, it does not
give us any information about the actual model.
In order to do that we need to fit it ourselves
with the `lm` function:
```{r}
model <- lm(lifeExp ~ year, data = algeria)
```
The `~` symbol defines a _formula_, you can read it as:
"lifeExp depending on year".
The `data` argument tells R where to look for the variables
`lifeExp` and `year`, namely in our `algeria` tibble.
```{r}
summary(model)
```
That is a lot of information about our model!
The `broom` package provides some functions for a cleaner
and more specialized output:
```{r}
broom::tidy(model)
```
Every 1 year the life expectancy in Algeria went up by about half a year.
Of course, this is only valid in the limited linear regime of
our datapoints.
We can't extrapolate this indefinitely.
After all, the intercept tells us that there would be a negative
life expectancy at year 0.
But given the data that we have, how good does a line fit here?
```{r}
broom::glance(model)
```
R^2^ takes on values between 0 and 1, with 1 being a perfectly
straight line connecting all the points and 0 for the points
being all over the place.
So 0.985 is a pretty good fit!
Using the dollar syntax, we get pull one column from
a tibble, so we can write a function that, given a model,
returns the R^2^
```{r}
get_r_squared <- function(model) {
broom::glance(model)$r.squared
}
```
And now come the `dplyr` magic!
If we group by `country` (and `continent` for good measuere
just so that we don't loose that column when summarizing),
we can calculate a linear model for every country!
Note two things: Firstly, we don't use the `data` argument
of `lm` here because the tidyverse functions already know
where to look for `lifeExp` and `year` and they do so
**respecting the groups**.
So within each group (i.e. for each country), `lifeExp` will only contain the life expectancy for that country,
not the life expectancy for all countries.
Secondly, we wrap this part in `list` because we have
to create a list column here (models don't fit into an atomic vector).
Otherwise, `dplyr` will complain.
In the second step we calculate the `rsqured` value
for each model by mapping the function we created above
over all models.
We use the `_dbl` variant here because we want the values
as an atomic vector of numbers, not as a list.
```{r}
all_models <- gapminder %>%
group_by(country, continent) %>%
summarise(
model = list(lm(lifeExp ~ year)),
rsquared = map_dbl(model, get_r_squared)
)
head(all_models)
```
Finally, we can add our calculated R^2^ values as a column to
the original gapminder dataset so that we can use it
in the following visualization.
```{r}
gapminder <- gapminder %>%
left_join(select(all_models, -model))
```
Here, we highlight the irregularities (less linear countries)
by making the transparency (= alpha value) depend
on the R^2^ value.
```{r}
gapminder %>%
ggplot(aes(year, lifeExp, color = country, alpha = 1/rsquared)) +
geom_line() +
guides(color = "none", alpha = "none") +
scale_color_manual(values = country_colors) +
facet_wrap(~continent, scales = "free") +
theme_minimal()
```
We can highlight one continent by filtering and use the
`ggrepel` package to allow for more flexible labels.
Furthermore check out the source document of this lecture
and the video of the lecture to find out how you can
add a figure caption and control the width and height
of the plot via `knitr` chunk options.
I also showcase a new way of writing chunk options
in the latest version of the `knitr` package that is especially
suited for longer captions.
```{r, fig.width=12, fig.height=7}
#| fig.cap: Figure caption
gapminder %>%
filter(continent == "Africa") %>%
ggplot(aes(year, lifeExp, color = country, group = country)) +
geom_line(color = "black", alpha = 0.3) +
geom_line(data = filter(gapminder, rsquared <= 0.4), size = 1.1) +
ggrepel::geom_text_repel(aes(label = country),
data = filter(gapminder, rsquared <= 0.4,
year == max(year)),
nudge_x = 20,
direction = "y"
) +
guides(color = "none", alpha = "none") +
scale_color_manual(values = country_colors) +
facet_wrap(~continent, scales = "free") +
labs(x = "Year", y = "Life Expectancy at Birth") +
theme_minimal() +
scale_x_continuous(expand = expansion(mult = c(0, 0.2)))
```
The downward slope of our highlighted countries
starting in the 1990s is a result of the ravaging
AIDS pandemic.
The prominent dips in two of the curves, orange for
Rwanda and Cambodia in gray, are the direct
consequences of genocides.
These dire realities can in no way be summarized
in just a couple of colorful lines.
I am also in no way qualified to lecture
on these topics.
A good friend of mine, Timothy Williams,
however is a researcher and teacher in the field of conflict and violence
with a focus on genocides.
He did field work in Cambodia and Rwanda
and his book "The Complexity of Evil. Perpetration and Genocide" was [published here](https://www.rutgersuniversitypress.org/the-complexity-of-evil/9781978814295) on December 18 2020 [@williamsComplexityEvil].
## Exercises
I want to get you playing around with data,
so keep in mind that the solutions for this exercise
are not set in stone.
There is often more than one viable way of graphing
the same dataset and we will use the
Office Hour to talk about the advantages
and disadvantages of approaches that you
came up with.
### Roman emperors
The first exercise uses a dataset about roman emperors
from the tidytuesday project
([link](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-08-13)).
You can import it with:
```{r}
#| eval: false
emperors <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-13/emperors.csv")
```
There is a slight error in the data because some of the dates are actually in BC time.
In order to fix this we will be using the [lubridate](https://lubridate.tidyverse.org/) package,
which is installed with the tidyverse, but not automatically loaded.
For your convenience here is a function that you can use to fix the dataset:
```{r}
#| eval: false
library(tidyverse)
library(lubridate)
fix_emperors <- function(data) {
data %>%
mutate(
birth = case_when(
index %in% c(1, 2, 4, 6) ~ update(birth, year = -year(birth)),
TRUE ~ birth
),
reign_start = case_when(
index == 1 ~ update(reign_start, year = -year(reign_start)),
TRUE ~ reign_start
)
)
}
```
Here are the questions to answer.
Decide for yourself, if a particular question is best
answered using a visualization, a table, a simple sentence
or a combination of the three.
- What was the most popular way to rise to power?
- I what are the most common causes of death among roman
emperors? What (or who) killed them?
- Which dynasty was the most successful?
- Firstly, how often did each dynasty reign?
- Secondly, how long where the reigns?
- Which dynasty would you rather be a part of,
if your goal is to live the longest?
### Dairy Products in the US
Another dataset
([link](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-29#milk_products_facts))
concerns dairy product consumption per person in the US across a number of years.
Load it with:
```{r}
#| eval: false
dairy <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milk_products_facts.csv")
```
- All masses are given in lbs (pounds),
can you convert them to kg?
- Which products lost their customer base over time,
which ones won? Which products have the greatest absolute
change in production when estimated with a straight line?
Above all, have some fun! If you make interesting
findings along the way, go ahead and produce plots
to highlight it.
## Resources
- [purrr documentation](https://purrr.tidyverse.org/)
- [stringr documentation](https://stringr.tidyverse.org/)
- [dplyr documentation](https://dplyr.tidyverse.org/)