-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path02-data-wrangling.qmd
806 lines (616 loc) · 27.9 KB
/
02-data-wrangling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
---
aliases:
- data-wrangling.html
---
# Data Wrangling
```{r}
#| include: false
source("./_common.R")
```
> ... in which we explore the typical data analysis workflow with the tidyverse, wrangle different kinds of data and learn about factors.
{{< youtube oCYn2pHfizg >}}
> **Note**:\
> I try to be vocal about what the code does in plain English while I type it and learning the "translations" of symbols and keywords can help you, too. After a while, programming can feel a lot more like having a conversation with your digital assistant or a helpful friend. The boundary between human languages and computer languages is more blurry than you might think.
Right after the setup-chunk, where we specify common code execution
options like showing the code (`echo`) or hiding messages and warnings,
the first thing we usually do at the top of a new analysis is load the
packages we are going to used.
If I later find that I need some more, I come back here and add
it to the list, so that people reading my code can see straight away,
what they have to install in order for them to run the code themselves.
```{r loading-packages}
library(tidyverse)
```
## A Data Analysis Workflow
We are getting close to importing our very first dataset from a file into R.
Generally, this is the first thing that needs to happen with any data analysis and we will cover it today.
The data I provided is already pretty tidy so we will start with that and build some visualizations.
The communicate-part is also covered, because we are working in Rmarkdown after all,
which is designed to communicate our findings.
Next week we will also have a look at some less tidy data,
but not before having defined what "tidy data" is.
![Figure from @wickhamDataScienceImport2017.](images/workflow-wickham.png)
## Reading Data with `readr`
<aside>
<a href="https://readr.tidyverse.org/"> ![The readr logo](images/readr.png) </a>
</aside>
The package responsible for loading data in the tidyverse is called `readr`, so we start by loading the whole tidyverse.
Note, that in general, we could also load just the `readr` package with `library(readr)`, but we need the rest of the tidyverse later on anyways.
There is also the option to not load a package at all but rather only use one function from a package by prefixing the function with the package name and two colons (`::`) Like so:
`readr::read_csv("...")`.
Without further ado, let's download the data for today.
In fact, there are multiple ways to go about this.
We could download the whole course folder from GitHub
<https://github.com/jmbuhr/dataintro>
and then using the "<> Code" download button:
![](images/download_repo.png)
The data is all in a folder called `data` and organized into sub-folders
with the lecture number.
So everything you need for today can be found in folder `02`.
On GitHub, we can also download individual files,
but for plain text files you need to remember one extra step.
If you have already clicked on the file in GitHub and can
see it's content, it is tempting to copy and paste the link
from the browser bar and use R's `download.file` function.
However, this is just the link to the part of the website
where GitHub shows the file, not a link to the actual file.
We can get the correct link by clicking on the **Raw** button:
![](images/github-raw.png)
Then we can use this to download the file:
```{r}
#| eval: false
download.file("https://raw.githubusercontent.com/jmbuhr/dataintro/main/data/02/gapminder.csv", "example-download.csv")
```
If you look into the source for this lecture you will find that
I set the chunk option `eval=FALSE`, meaning the code will not but run.
I don't want to download the file again every time I make a change to
the course script.
A common error I see people having with `download.file` is trying to download
a file into e.g. a folder called `data` without first creating said folder.
If you get one of those `No such file or directory` errors, this is the most
likely cause.
- read gapminder data (csv)
- read csv also works with url
- but you probably also want a local copy
- write_csv and friends
- `View` and ctrl+click
With our data downloaded, we can make use of RStudio's autocompletion
to complete the file-path to our data inside of quotation marks.
We can trigger it explicitly with `ctrl+space` or `tab`.
```{r}
gapminder <- read_csv("data/02/gapminder.csv")
```
`readr` will also tell you the datatypes it guessed for the columns.
Let's inspect our dataset:
```{r}
gapminder
```
The gapminder dataset [@bryanGapminderDataGapminder2017] is an excerpt
from the [gapminder project](https://www.gapminder.org/) and contains the life expectancy at birth for 142 countries at 5 year intervals between 1952 and 2007.
It also contains the population and the Gross Domestic Product (GDP) per Inhabitant.
We will built a visualization later on.
A cool trick for when you have your data in a variable
is the `View` function.
The same effect can be reached by `ctrl+click`ing on it
or by using the button next to it in the environment panel.
It should be noted that the `read_csv` function can also
read data from links and will download it automatically.
However, in order to have the data nice and safe,
we might want to save if somewhere, just in case
(links can change, especially when it is someone else's link).
```{r}
#| eval: false
our_data <- read_csv("https://raw.githubusercontent.com/jmbuhr/dataintro/main/data/02/gapminder.csv")
write_csv(our_data, "our-data.csv")
```
## A Project-based Workflow
Last week we simply went ahead and created a script file and an Rmarkdown file in some folder on our computer.
But how does R known, where the script is?
How does it know, where to look, when we tell it to read in a file or save a plot?
The main folder where R starts is called the **working directory**.
To find out, what our current working directory is, we execute the function `getwd()` for get working directory:
```{r}
getwd()
```
RStudio projects set this working directory automatically, which is very convenient.
It makes it easier for us to share code and projects, by simply copying the whole
folder.
But we have to follow one prerequisite.
Our file paths need to be **relative**, not absolute.
An absolute file path starts at the root of our operating system,
so on windows you will see something like `C:\\User\Jannik\Documents\...`
while on mac and linux it starts with `/home/jannik/Documents/...`.
For example, I could read the same gapminder dataset by:
```{r}
#| eval: false
read_csv("/home/jannik/Documents/projects/teaching/dataintro/data/02/gapminder.csv")
```
But this is a terrible idea!
If I ever move my analysis folder, this file path will no longer
be correct and if someone else tries to run my code
they will most certainly not be called Jannik and have the exact same
directory structure.
So it will also not work.
In order for our work to be portable, robust and shareable,
we need our file paths to be relative to the root
of our project (which is set by the RStudio project).
```{r}
#| eval: false
read_csv("./data/02/gapminder.csv")
```
Here, `./` refers to the current working directory,
which is set by the RStudio project.
It can also be omitted (e.g. `data/02/...`),
but the path can't start with `/` because this would
mark it as an absolute path.
There is also a function to set the working directory
ourselves (it is called `setwd`),
but I ask you to never use it.
Because in order to use it, you would have to specify
the working directory using an absolute path,
rendering the script useless for anyone that is not you.
Use RStudio projects instead.
There is one thing I didn't tell you about Rmarkdown documents, yet.
Their working directory is always the folder they are in,
even if they are in some subdirectory of a project.
In a way this also means that you don't necessarily need a project to work with Rmarkdown,
but having one anyway makes it easier to keep track of your files and
have a consistent structure.
### Common Hurdles when Importing Data
So, importing the gapminder csv went smoothly.
But this will not always be the case.
We will now look at common hurdles when importing data.
The function we just used was called `read_csv`, because it reads a file format that consists of **comma separated values**.
Look at the raw file in a text editor (not word) like notepad or RStudio to see why.
But the file extension `.csv` can sometimes be lying...
Because in German, the comma is used to separate decimal numbers
(vs. the dot in English),
a lot of Software will output a different type of csv-file when configured in German.
It will still call it csv,
but actually it is separated by semicolons!
We have a special function for this:
```{r}
#| eval: false
read_csv2("data/02/gapminder_csv2.csv")
```
When looking through the autocompletion options that pop up when you are typing the function name, you might have noticed a similar function `read.csv` and `read.csv2`.
These are the functions that come with R, without any packages like the tidyverse.
You can of course use those as well,
but the tidyverse functions provide a more consistent experience and
have less surprising quirks.
I am teaching the tidyverse first because it allows you to do more while
having to learn less edge cases.
If we look at yet another file `gapminder_tsv.txt`,
we notice that the file extension doesn't tell us much about the format, only that it is text (as opposed to a binary format only computers can read).
If we look into the file:
```{r}
read_lines("data/02/gapminder_tsv.txt", n_max = 3)
```
We notice that the values are separated by "\t", a special sequence that stands for the tab character. The `read_tsv` function will do the job.
I am not showing the output here because it is just
the gapminder dataset once again.
```{r}
#| eval: false
read_tsv("data/02/gapminder_tsv.txt")
```
If the separator (also called delimiter) is even more obscure,
we can use the general function `read_delim`.
Say a co-worker misunderstood us and thought tsv stands for "Tilde separated values",
we can still read his file.
```{r}
#| eval: false
read_delim("data/02/obscure_file.tsv", "~")
```
There are more ways in which raw data can be messy or hard to read depending on the machine but I can't show all of them.
One common thing you will encounter though is measurement machines writing some additional information in the first couple of lines before the actual data
(like the time of the measurement).
In this example:
```{r}
read_lines("data/02/gapminder_messier.csv", n_max = 5)
```
The first 2 lines are not part of the data.
Reading the file normally as a csv would produce something weird:
Because the first line does not contain any commata, it will assume that the file contains only one column and also report a bunch of **parsing failures**.
Parsing is the act of turning data represented as raw text into a useful format,
like a table of numbers.
We can fix this by telling R to skip the first 2 lines entirely:
```{r}
read_csv("data/02/gapminder_messier.csv", skip = 2, n_max = 3)
```
I was using the `n_max` argument of the functions above to save space in this lecture script.
We can also read excel files it using a function from the `readxl` package.
This package is automatically installed with the tidyverse,
but it is not loaded along with the other packages via `library(tidyverse)`.
We can either load it with `library(readxl)` or
refer to a single function from the package without loading the whole thing
using double colons (`::`) like so:
```{r}
readxl::read_xlsx("data/02/gapminder.xlsx")
```
Remember, that when we read in the gapminder dataset for the first
time to day, we saved it in a variable called `gapminder`,
so we are going to use this going forward.
## Wrangling Data with `dplyr`
<aside>
<a href="https://dplyr.tidyverse.org/"> ![](images/dplyr.png) </a>
</aside>
There a are a number of ways in which we can manipulate data.
Of course I mean manipulate in it's original sense, not the malicious one.
This is sometimes referred to as **data wrangling** and within the tidyverse,
this is a job for the `dplyr` package (short for data plyer, the tool you see in the logo).
dplyr provides functions for various operations on our data.
Theses functions are sometimes also called **dplyr verbs**.
All of them take a `tibble` or `data.frame` as input (plus additional parameters) and always return a `tibble`.
But enough talk, let's go wrangling!
![Let's go data wrangling! Artwork by Allison Horst](images/dplyr_wrangling.png)
### select
The first verb we introduce is used to select columns.
And hence, it is called `select`.
The first argument is always the data, followed by an arbitrary number of column names.
We can recognize functions the take an arbitrary number of additional arguments by the `...` in the autocompletion and help page.
```{r}
select(gapminder, country, year, gdpPercap)
```
It might be confusing why we don't need quotation marks around the column names like we do
in other languages or even other parts of R.
This concept is known as **quasiquotation** or **data masking**.
It is quite unique to R, but it allows functions to known about the content of the data that is passed to them and use this as the environment in which they do their computations and search for variable names.
So while the variable `country` doesn't exist in the global environment,
it does exist as a column of the gapminder tibble.
> `dplyr` functions always look in the data first when they search for names.
The help page for `select` tells us more about the different ways in which we can select columns.
Here are a couple of examples without the output,
run them in your R session to confirm that they do what you think they do
(but do have a look at the help pages yourselves, they are quite well written).
```{r}
#| eval: false
select(gapminder, year:pop)
select(gapminder, starts_with("co"))
select(gapminder, where(is.numeric))
select(gapminder, where(is.character))
select(gapminder, c(1, 3, 4))
```
### filter
After selecting columns it is only natural to ask how to select rows.
This is achieved with the function `filter`.
![Filter data. Artwork by Allison Horst](images/dplyr_filter.jpg)
Here, we select all rows, where the year is greater than 2000
and the country is New Zealand.
```{r}
#| eval: false
filter(gapminder, year > 2000, country == "New Zealand")
```
Because text comparisons are cases sensitive, we would have missed
New Zealand had we written it with lowercase letters.
In order to make sure we find the correct country,
it can be helpful to simply convert all country names
to lower case, and in fact we can use functions on our columns
straight inside of any dplyr verb.
Functions that deal with text (strings or character in R's language)
in the tidyverse start with `str_`, so they are easy to find
with autocompletion.
```{r}
#| eval: false
filter(gapminder, year > 2000, str_to_lower(country) == "new zealand")
```
Instead of combining conditions with `,` (which works the same as
`&` here), we can also use `|` meaning **or**.
Here, we get all rows where the country is New Zealand or
the country is Afghanistan.
```{r}
#| eval: false
filter(gapminder, country == "New Zealand" | country == "Afghanistan")
```
This particular comparison can be written more succinctly,
by asking (for every row), is the particular country `%in%` this vector?
```{r}
#| eval: false
filter(gapminder, country %in% c("New Zealand", "Afghanistan"))
```
### mutate
We are back at manipulating columns, this time by creating new ones or changing old ones.
The `dplyr` verb that does that is called `mutate`.
For example, we might want to calculate the total GDP from the GDP per Capita and the population:
```{r}
mutate(gapminder, gdp = pop * gdpPercap)
```
Notice, that none of the functions changed the original variable `gapminder`.
They only take an input and return and output,
which makes it easier to reason about our code and later chain pieces of code together.
How do you change it then?
Use the Force! ... ahem, I mean, the assignment operator (`<-`).
```{r}
gapminder <- mutate(gapminder, gdp = pop * gdpPercap)
```
Here, the power of dplyr shines.
It knows that `pop` and `gdpPercap` are columns of the tibble and
that `gdp` refers to the new name of the freshly created column.
### Interlude: Begind the magic, handling data with base-R
This section is meant to show you what happens behind the scenes.
It is not strictly necessary to understand all the details of it in order to work effectively with the tidyverse, but it helps especially when things don't go as planned.
Let's create a tibble to play with:
```{r}
test_tibble <- tibble(
x = 1:5,
y = x ^ 2,
z = c("hello", "world", "test", "four", "five")
)
test_tibble
```
Instead of the tidyverse functions, we can also use
what is called **subsetting**, getting a subset of our datasctructure,
with square brackets:
```{r}
test_tibble[c(1, 3)]
```
This selected the first and third column.
This also works for lone vectors:
```{r}
evens <- seq(0, 10, 2)
evens[c(1, 3)]
```
If we want to select columns by their names without the
tidyverse, we have to pass these names as a character vector
(hence the quotation marks).
```{r}
test_tibble[c("x", "z")]
```
If we have two things in the square brackets, separated by a comma,
the first refers to the rows and the second refers to the columns.
e.g. this would be "the first row and the columns from 1 to 2":
```{r}
test_tibble[1, 1:2]
```
Internally, tibbles / dataframes are lists of columns.
Lists have more ways of accessing their elements.
The `$` symbol gets us an element from the list:
```{r}
test_tibble$x
```
If we want to use numbers (=indices) to get a single
element from a list (or a column from a tibble),
we have ot use double square brackets:
```{r}
test_tibble[[1]]
```
The reason for this: Single square brackets give us a subset
of the list, which is still packed up in a list.
If we want to unpack it to work with it we need the
content of just one element `[[` does that for us.
The `pull` function from the tidyverse works like `$`.
```{r}
pull(test_tibble, x)
```
Subsetting not only works for looking at things,
it also allows us to replace the part we are subsetting:
```{r}
x <- 1:10
x[1] <- 42
x
```
> Note:\
> The base-R and the tidyverse way are not mutually exclusive.
Sometimes you can mix and match.
### The pipe `%>%`
The tidyverse functions are easier to compose (i.e. chain together).
To facilitate this, we introduce another operator, a bit like `+` for numbers or the `+` to add ggplot components, but specially for functions.
The pipe, which you can either type or insert in RStudio with **Ctrl+Shift+M**,
takes it's left side and passes it as the first argument to the function on the right side
Why is this useful?
Imagine our data processing involves a bunch of steps,
so we save the output to intermediate variables.
```{r}
subset_gapminder <- select(gapminder, country, year, pop)
filtered_gapminder <- filter(subset_gapminder, year > 200)
final_gapminder <- mutate(filtered_gapminder, pop_thousands = pop / 1000)
final_gapminder
```
However, we don't really need those intermediate variables
and they just clutter our code.
The pip allows us to express our data processing as a series of
steps:
```{r}
final_gapminder <- gapminder %>%
select(country, year, pop) %>%
filter(year > 2000) %>%
mutate(pop_thousands = pop / 1000)
final_gapminder
```
You can read the pipe in your head as "and then" or "take ... pass it into ...".
And because all main `tidyverse` functions take data as their first argument,
we can chain them together fluently
Additionally, it enables autocompletion of column names inside of the function
that gets the data.
Next to the tidyverse pipe `%>%`, you might also see `|>` at some point.
The latter is a pipe that was introduced to base-R because this whole
piping thing got so popular they are making it part of the core language.
### arrange
A simple thing you might want from a table is to sort it based on some column. This is what `arrange` does:
```{r}
gapminder %>%
arrange(year)
```
The helper function `desc` marks a column to be arranged in descending order.
We can arrange by multiple columns, where the first will be most important.
```{r}
gapminder %>%
arrange(desc(year), pop)
```
### summarise
To condense one or multiple columns into summary values, we use `summarise`.
Like with mutate, we can calculate multiple things in one step.
```{r}
gapminder %>%
summarise(
max_year = min(year),
pop = max(pop),
mean_life_expectancy = mean(lifeExp)
)
```
But condensing whole columns into one value, flattening the tibble in the style of Super Mario jumping on mushrooms, is often not what we need.
We would rather know the summaries within certain groups.
For example the maximal gdp **per country**. This is what `group_by` is for.
### group_by
`group_by` is considered an adverb, because it doesn't change the data itself but it changes how subsequent functions handle the data. For example, if a tibble has groups, all summaries are calculated within these groups:
```{r}
gapminder %>%
group_by(year) %>%
summarise(
lifeExp = mean(lifeExp)
)
```
`summarize` removes one level of grouping.
If the data was grouped by multiple features, this means that some groups remain.
We can make sure that the data is no longer grouped with `ungroup`.
```{r}
gapminder %>%
group_by(year, continent) %>%
summarise(
lifeExp = mean(lifeExp)
) %>%
ungroup()
```
Groups also work within `mutate` and `filter`.
For example, we can get all rows where the gdp per Person was highest per country:
```{r}
gapminder %>%
group_by(country) %>%
filter(gdpPercap == max(gdpPercap))
```
```{r}
gapminder %>%
group_by(year) %>%
mutate(pop = pop / sum(pop))
```
### others:
We can rename columns with `rename`:
```{r}
gapminder %>%
rename(population = pop)
```
Sometimes you want to refer to the size of the current group inside of `mutate` or `summarise`. The function to to just that is called `n()`. For example, I wonder how many rows of data we have per year.
```{r}
gapminder %>%
group_by(year) %>%
mutate(n = n())
```
A shortcut for `group_by` and `summarise` with `n()` is the `count` function:
```{r}
gapminder %>%
count(year, country) %>%
count(n)
```
In general, you might find after solving a particular problem in a couple of steps that there is a more elegant solution. Do not be discouraged by that! It simply means that there is always more to learn, but the tools you already know by now will get you a very long way and set you on the right track.
I think we learned enough dplyr verbs for now. We can treat ourselves to a little ggplot visualization.
## Visualization and our first encounter with `factor`s
<aside>
<a href="https://forcats.tidyverse.org/"> ![](images/forcats.png) </a>
</aside>
```{r}
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line() +
facet_wrap(~continent)
```
The `facet_wrap` function slices our plot into theses subplots, a style of plot sometimes referred to as *small multiples*. At this point you might wonder: "How do I control the order of these facets?" The answer is: With a `factor`!
Any time we have a vector that can be thought of as representing discrete categories (ordered or unordered), we can express this by turning the vector into a factor with the `factor` function. This enables R's functions to handle them appropriately. Let's create a little example. We start out with a character vector.
```{r}
animals <- c("cat", "dog", "bear", "shark")
animals <- factor(animals)
animals
```
Note the new information R gives us, the **Levels**, which is all possible values we can put into the factor. They are automatically ordered alphabetically on creation. We can also pass a vector of levels on creation.
```{r}
animals <- c("cat", "dog", "bear", "shark")
animals <- factor(animals, levels = c("cat", "dog"), ordered = TRUE)
animals
```
A factor can only contain elements that are in the levels, so because I omitted the whale shark, it will be turned into `NA`. The tidyverse contains the `forcats` package to help with factors. Most functions from this package start with `fct_`.
For example, the `fct_relevel` function,
which keeps all levels but let's us change the order:
```{r}
animals <- c("cat", "dog", "bear", "shark")
animals <- factor(animals)
fct_relevel(animals, c("shark", "dog"))
```
Using this in action, we get:
```{r}
gapminder %>%
mutate(continent = fct_relevel(continent, "Oceania")) %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 0.3) +
facet_wrap(~ continent)
```
Note: `fct_relevel` might be a very constructed example.
More often you will need its cousin `fct_reoder`
to reorder a factor by the values of some other column.
Let's make this plot a bit prettier by adding color!
The `gapminder` package that provided this dataset also included a nice color palette. I included it as a `.csv` file in the `data/` folder so that we can practice importing data once more. But you could also take the shortcut of getting it straight from the package (`gapminder::country_colors`). Here, we are using the `head` function to look at the first couple of rows of the tibble and to look at the first couple of elements of the named vector from the package.
```{r}
country_colors <- read_csv("data/02/country_colors.csv")
color <- country_colors$color
names(color) <- country_colors$country
head(color)
```
Having a named vector means that we can access individual elements
by their names, and ggplot can use those names
to match up for example colors with countries
when we pass it to a `scale_` function.
```{r}
x <- c(first = 1, second = 3, hello = 5)
x["first"]
```
To the final plot we also add `guides(color = "none")`,
because if we were to show a guide (for discrete colors typically a legend),
it would fill up the whole plot.
```{r}
gapminder %>%
mutate(continent = fct_relevel(continent, c("Oceania"))) %>%
ggplot(aes(year, lifeExp, color = country)) +
geom_line() +
facet_wrap(~continent) +
guides(color = "none") +
scale_color_manual(values = color)
```
## Exercises
- Drink a cup of coffee or tea, relax, because you
just worked through quite a long video.
- Familiarize yourself with the folders on your computer.
Make sure you understand, where your directories and files live.
- Download the data for today in one of the ways taught.
You can refer to the script anytime.
- The file `./data/02/exercise1.txt` is in an unfamiliar format.
- Find out how it is structured and read it in with `readr`.
- Create a scatterplot of the x and y column with `ggplot`.
- Look at the help page for `geom_point`.
What is the difference between `geom_point(aes(color = <something>))` and `geom_point(color = <something>)`?
A relevant hint is in the section about the `...`-argument.
- Make the plot pretty by coloring the points,
keeping in mind the above distinction.
- Read in the `gapminder` dataset with `readr`
- Using a combination of `dplyr` verbs and / or
visualizations with `ggplot`,
answer the following questions:
- Which continent had the highest life expectancy on average
in the most current year? There are two options here.
First, calculate a simple mean for the countries
in each continent. Then, remember that the countries
have different population sizes, so we really need
a _weighted mean_ using R's function `weighted.mean()`.
- Is there a relationship between the GDP per capita and
the life expectancy? A visualization might be helpful.
- How did the population of the countries change over time?
Make the plot more informative by adding color,
facets and labels (with `geom_text`). Can you find out,
how to add the country name label only to the last year?
Hint: Have a look at the `data` argument that all `geom_`-functions
have.
## Resources
Don't miss the dedicated [Resources](resources) page.
### Package Documentation
- [The tidyverse website](https://www.tidyverse.org/)
- [The readr package website with cheatsheet](https://readr.tidyverse.org/)
- [The dplyr package website with cheatsheet](https://dplyr.tidyverse.org/)
### Getting Help
- [How to find help](https://www.tidyverse.org/help/#reprex)
- [R4DS online learning community](https://www.rfordatasci.com/)