-
Notifications
You must be signed in to change notification settings - Fork 47
/
Copy path01-wrangling.Rmd
573 lines (370 loc) · 16.7 KB
/
01-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
---
title: "Data wrangling "
author: ""
date: ""
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_knit$set(root.dir = "")
```
In this lesson we'll:
1. see how to save .csv data in the form of an .RData object
2. introduce data transformations using the **dplyr** package, using the five key dplyr verbs:
+ filter
+ arrange
+ mutate
+ summarise
+ group_by
3. introduce the pipe operator `%>%`
4. see a few nice things you can do by combining dplyr verbs (grouped filters and mutates, for example)
5. use the dplyr verbs to build a small movie rating dataset that we'll use in the next lesson on recommender systems.
6. introduce various *join* operations that can be used to combine information across multiple tables (relational data)
### Sources and references
* http://r4ds.had.co.nz/transform.html
* http://r4ds.had.co.nz/relational-data.html
## Get the MovieLens data and save as .RData
[MovieLens](https://grouplens.org/datasets/movielens/) is a great resource for data on movie ratings. The full dataset has ratings on 40 000 movies by 260 000 users, some 24 million ratings in all. We'll use a smaller dataset with ratings of 9 000 movies by 700 users (100 000 ratings in all).
Download the file "ml-latest-small.zip" from https://grouplens.org/datasets/movielens/ and unzip it to the *data* directory in your main project folder (make a folder called *data* if you haven't already). You should see four csv files: `links.csv`, `movies.csv`, `ratings.csv`, and `tags.csv`. You can also do this programmatically in R:
```{r}
dir.create("data")
download.file(
url = "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip",
destfile = "data/ml-latest-small.zip"
)
unzip("data/ml-latest-small.zip", exdir = "data")
```
First let's save the data we downloaded as an .RData object. .RData objects are smaller than csv, plus we can save all four csvs in a single .RData object that we can call with a single call to `load` the dataset later on.
```{r}
# read in the csv files
movies <- read.csv("data/ml-latest-small/movies.csv") # movie info: title and genre
ratings <- read.csv("data/ml-latest-small/ratings.csv") # user ratings for each movie
tags <- read.csv("data/ml-latest-small/tags.csv") # additional user reviews ("tag")
links <- read.csv("data/ml-latest-small/links.csv") # lookup for imdb movie IDs
# save as .RData
save(links, movies, ratings, tags, file = "data/movielens-small.RData")
# check that it's worked
rm(list = ls())
load("data/movielens-small.RData")
```
You'll only need to do the above part once, so once you've got the data saved as .RData, start running the notebook from here.
## Loading the tidyverse
Load the **tidyverse** collection of packages, which loads the following packages: **ggplot2**, **tibble**, **tidyr**, **readr**, **purrr**, and **dplyr**.
```{r}
library(tidyverse)
```
Load the MovieLens data
```{r}
load("data/movielens-small.RData")
```
Tibbles are a special kind of dataframe that work well with tidyverse packages ("in the tidyverse").
```{r}
# convert ratings to a "tibble"
ratings <- as.tibble(ratings)
```
A nice feature of tibbles is that if you display them in the console (by typing `ratings`, for example) only the first few rows and columns are shown.
```{r}
ratings
```
Explore some of the variables:
```{r}
str(ratings)
```
```{r}
glimpse(ratings)
```
```{r}
glimpse(movies)
```
We'll look at database joins in more detail, but for now, this just adds movie title to the `ratings` data by pulling that information from `movies`.
```{r}
ratings <- left_join(ratings, movies)
```
## Filtering rows with `filter()`
Here we illustrate the use of `filter()` by extracting user 1's observations from the *ratings* data frame.
```{r}
u1 <- filter(ratings, userId == 1)
u1
```
Next we extract the observations for user 1 that received a rating greater than 3. Multiple filter conditions are created with `&` (and) and `|` (or).
```{r}
filter(ratings, userId == 1 & rating > 3)
```
Here's another way of writing the same condition as above:
```{r}
filter(ratings, userId == 1, rating > 3)
```
The `%in%` command is often useful when using dplyr verbs:
```{r}
filter(ratings, userId == 1, rating %in% c(1,4))
```
## Introducing the pipe
The pipe operator `%>%` is a very useful way of chaining together multiple operations. A typical format is something like:
*data* `%>%` *operation 1* `%>%` *operation 2*
You read the code from left to right: Start with *data*, apply some operation (operation 1) to it, get a result, and then apply another operation (operation 2) to that result, to generate another result (the final result, in this example). A useful way to think of the pipe is as similar to "then".
The main goal of the pipe is to make code easier, by focusing on the transformations rather than on what is being transformed. Usually this is the case, but it's also possible to get carried away and end up with a huge whack of piped statements. Deciding when to break a block up is an art best learned by experience.
```{r}
# filtering with the pipe
ratings %>% filter(userId == 1)
```
The main usefulness of the pipe is when combining multiple operations:
```{r}
# first filter on userId then on rating
u1_likes <- ratings %>% filter(userId == 1) %>% filter(rating > 3)
u1_likes
```
```{r}
# another way of doing the same thing
ratings %>% filter(userId == 1 & rating > 3)
```
## Arranging rows with `arrange()`
Ordering user 1's "liked" movies in descending order of rating (note the use of `desc`):
```{r}
arrange(u1_likes, desc(rating))
```
Subsequent arguments to `arrange()` can be used to arrange by multiple columns. Here we first order user 1's liked movies by rating (in descending order) and then by timestamp (in ascending order)
```{r}
arrange(u1_likes, desc(rating), timestamp)
```
We can also use the pipe to do the same thing:
```{r}
u1_likes %>% arrange(desc(rating))
```
Finally, here's an example of combining filter and arrange operations with the pipe:
```{r}
ratings %>% filter(userId == 1 & rating > 3) %>% arrange(desc(rating))
```
## Selecting columns with `select()`
Select is a bit like `filter()` for columns. The syntax is straightforward, the first argument gives the dataframe, and then you list the variables you want to select!
```{r}
select(u1_likes, title, rating)
```
To exclude variables just put a minus sign in front of them:
```{r}
select(u1_likes, -userId, -timestamp)
```
You can also use `select()` to reorder variables. A useful function here is `everything()`.
```{r}
# original order
u1_likes
# reorder so title is first
select(u1_likes, title, everything())
```
## Adding new variables with `mutate()`
Mutating operations add a new column to a dataframe. Here's a trivial example to get started:
```{r}
mutate(u1_likes, this_is = "stupid")
```
A more useful use of mutate is to construct a new variable based on existing variables. This is the way that `mutate` is almost always used.
```{r}
mutate(u1, like = ifelse(rating > 3, 1, 0))
```
We can also use the pipe for mutating operations. Hopefully you're getting used to the pipe by now, so let's embed a mutating operation within a larger pipe than we've used before.
```{r}
ratings %>%
mutate(like = ifelse(rating > 3, 1, 0)) %>%
filter(userId == 1) %>%
select(like, everything())
```
## Aggregating over rows with `summarise()`
The `summarise()` verb (or `summarize()` will also work) summarises the rows in a data frame in some way. When applied to the whole data frame, it will collapse it to a single row. For example, here we take user 1's data, and calculate their average rating and the number of movies they have given a rating higher than 3 to:
```{r}
summarise(u1, mean = mean(rating), likes = sum(rating > 3))
```
You need to watch out for NAs when using `summarise()`. If one exists, operations like `mean()` will return NA. You can exclude NAs from calculations using `na.rm = TRUE`:
```{r}
# introduce an NA
u1$rating[1] <- NA
# see what happens
summarise(u1, mean = mean(rating), likes = sum(rating > 3))
```
```{r}
# with na.rm = TRUE
summarise(u1, mean = mean(rating, na.rm = TRUE), likes = sum(rating > 3, na.rm = TRUE))
```
`summarise()` is most useful when combined with `group_by()`, which imposes a grouping structure on a data frame. After applying `group_by()`, subsequent dplyr verbs will be applied to individual groups, basically repeating the code for each group. That means that `summarise()` will calculate a summary for each group:
```{r}
# tell dplyr to group ratings by userId
ratings_by_user <- group_by(ratings, userId)
# apply summarize() to see how many movies each user has rated
ratings_by_user %>% summarize(count = n()) %>% head()
```
```{r}
# get sorted counts (plus some presentation stuff)
ratings %>%
group_by(userId) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(10) %>% # take first ten rows
t() # transpose
```
```{r}
# or with the pipe (last time)
ratings %>% group_by(userId) %>% summarize(count = n()) %>% head(10)
```
You can also pass your own summary functions to the pipe:
```{r}
# my own function, computes the 60% quantile of a vector
compute_60q <- function(x){quantile(x, probs = 0.60)}
# use it in a grouped summary
ratings %>% group_by(userId) %>% summarize(count = n(), q60 = compute_60q(rating)) %>% head(10)
```
## Other uses of `grouped_by()`: grouped filters and grouped mutates
While you'll probably use `group_by()` most often with `summarise()`, it can also be useful when used in conjunction with `filter()` and `mutate()`. Grouped filters perform the filtering within each group. Below we use it to extract each user's favourite movie (or movies, if there's a tie).
```{r}
# example of a grouped filter
ratings %>% group_by(userId) %>% filter(rank(desc(rating)) < 2)
```
Here we use a grouped mutate to standardise each user's ratings so that they have a mean of zero (for each user, which guarantees the overall mean rating is also zero).
```{r}
# example of a grouped mutate
ratings %>%
group_by(userId) %>%
mutate(centered_rating = rating - mean(rating)) %>%
select(-movieId, -timestamp, -genres)
```
## Putting it all together: extracting a sample set of reviews
In this section we'll take what we've learned and do something useful: build a 15x20 matrix containing the reviews made on 20 movies by 15 users. We'll use this matrix in one of the next lessons to build a recommendation system.
First, we select the 15 users we want to use. I've chosen to use 15 users with moderately frequent viewing habits (remember there are 700 users and 9000 movies), mainly to make sure there are some (but not too many) empty ratings.
```{r}
users_frq <- ratings %>% group_by(userId) %>% summarize(count = n()) %>% arrange(desc(count))
my_users <- users_frq$userId[101:115]
```
Next, we select the 20 movies we want to use:
```{r}
movies_frq <- ratings %>% group_by(movieId) %>% summarize(count = n()) %>% arrange(desc(count))
my_movies <- movies_frq$movieId[101:120]
```
Now we make a dataset with only those 15 users and 20 movies:
```{r}
ratings_red <- ratings %>% filter(userId %in% my_users, movieId %in% my_movies)
# and check there are 15 users and 20 movies in the reduced dataset
n_users <- length(unique(ratings_red$userId))
n_movies <- length(unique(ratings_red$movieId))
paste("number of users is", n_users)
paste("number of movies is", n_movies)
```
Let's see what the 20 movies are:
```{r}
movies %>% filter(movieId %in% my_movies) %>% select(title)
```
However, note all the movie titles are still being kept:
```{r}
head(levels(ratings_red$title), 10)
```
This actually isn't what we want, so let's drop the ones we won't use.
```{r}
ratings_red <- droplevels(ratings_red)
levels(ratings_red$title)
```
We now want to reshape the data frame into a 15x20 matrix i.e.from "long" format to "wide" format. We can do this using the `spread()` verb.
```{r}
ratings_red %>% spread(key = movieId, value = rating)
```
The preceding line *doesn't* work: as you can see we land up with more than one row per user. But it is useful as an illustration of `spread()`. Question: why doesn't it work?
Here's the corrected version:
```{r}
ratings_red %>% select(userId, title, rating) %>% spread(key = title, value = rating)
```
Finally, since we just want to know who has seen what, we replace all NAs with 0 and all other ratings with 1:
```{r}
viewed_movies <- ratings_red %>%
complete(userId, title) %>%
mutate(seen = ifelse(is.na(rating), 0, 1)) %>%
select(userId, title, seen) %>%
spread(key = title, value = seen)
```
We could have got this more simply with a call to `table()`, which creates a two-way frequency table.
```{r}
table(ratings_red$userId, ratings_red$title)
```
Finally, we save our output for later use!
```{r}
dir.create("output")
save(ratings_red, viewed_movies, file = "output/recommender.RData")
```
## Combining data frames with *joins*
We'll often need to combine the information contained in two or more tables. To do this, we need various kinds of database *joins*. This section describes the basic join operations that we need to combine data frames. The examples are taken from [Chapter 13](http://r4ds.had.co.nz/relational-data.html) of R4DS, which also contains a lot more general information on relational data.
First, we make some very simple data tables to show how the joins work:
```{r}
# make some example data
x <- tribble(
~key, ~xvalue,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~yvalue,
1, "y1",
2, "y2",
4, "y3"
)
x
y
```
### Mutating joins: `inner_join`, `left_join`, `right_join`, `full_join`
The first set of joins we look at are called mutating joins. These first match observations in two tables in some way, and then combine variables from the two tables.
There are four types of mutating joins: inner joins, left joins, right joins, and full joins.
An **inner join** keeps observations that appear in *both* tables.
```{r}
inner_join(x,y)
```
```{r}
inner_join(y,x)
```
The other three joints are all **outer joins**: they keep observations that appear in *at least one* of the tables.
A **left join** keeps all observations from the left table (first argument).
```{r}
left_join(x,y)
```
```{r}
left_join(y,x)
```
A **right join** keeps all observations from the right table (second argument).
```{r}
# note this is the same as left_join(y,x)
right_join(x,y)
```
A **full join** keeps all observations from the left and right table.
```{r}
full_join(x,y)
```
We can now re-examine the join we used to add movie titles to the ratings data frame earlier:
```{r}
# reload the MovieLens data
load("data/movielens-small.RData")
ratings <- as.tibble(ratings)
movies <- as.tibble(movies)
```
Note that the same *movieId* can appear multiple times in the *ratings* data frame:
```{r}
ratings %>% arrange(movieId) # note duplicate movieIds
```
But each *movieId* only appears once in the *movies* data frame:
```{r}
movies %>% arrange(movieId) # note unique movieIds
```
In this case a left join by the *movieId* key copies across the movie title information (as well as any other information in the *movies* data frame):
```{r}
left_join(ratings, movies, by = "movieId") %>% select(title, everything())
```
### Filtering joins: `semi_join`, `anti_join`
The last two joins we look at are **filtering joins**. These match observations in two tables, but do not add variables. There are two types of filtering joins: semi-joins and anti-joins.
A **semi join** keeps all observations from the first table that appear in the second (note variables are from the first table only):
```{r}
semi_join(x,y)
```
```{r}
semi_join(y,x)
```
while an **anti join** *drops* all observations from the first table that appear in the second table (note variables are from the first table only).
```{r}
anti_join(x,y)
```
```{r}
anti_join(y,x)
```
## Exercises
1. Write a function called **my_range** that computes the difference between the 90% and 10% quantiles of a vector, and another function called **my_cov** that computes the coefficient of variation of a vector (defined as the standard deviation divided by mean). Use these functions to work out how "variable"" each movie's ratings are, and hence what the most variable (in the sense of love-them-or-hate-them) movies are. Do the same for users i.e. find out which users give the most variable ratings, in terms of your two variability functions.
2. Do the exercises in [Chapter 5](http://r4ds.had.co.nz/transform.html) (data transformation using the **dplyr** verbs) and [Chapter 13](http://r4ds.had.co.nz/relational-data.html) (on database joins) of R4DS. There are exercises at the end of each major subsection. Do as many of these exercises as you need to feel comfortable with the material - I suggest doing at least the first two of each set of exercises.