01-wrangling.Rmd

---
title: "Data wrangling "
author: ""
date: ""
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_knit$set(root.dir = "")
```

In this lesson we'll:

1. see how to save .csv data in the form of an .RData object
2. introduce data transformations using the **dplyr** package, using the five key dplyr verbs: 

  + filter
  + arrange
  + mutate
  + summarise
  + group_by
  
3. introduce the pipe operator `%>%`
4. see a few nice things you can do by combining dplyr verbs (grouped filters and mutates, for example)
5. use the dplyr verbs to build a small movie rating dataset that we'll use in the next lesson on recommender systems.
6. introduce various *join* operations that can be used to combine information across multiple tables (relational data)

### Sources and references

 * http://r4ds.had.co.nz/transform.html
 * http://r4ds.had.co.nz/relational-data.html

## Get the MovieLens data and save as .RData

[MovieLens](https://grouplens.org/datasets/movielens/) is a great resource for data on movie ratings. The full dataset has ratings on 40 000 movies by 260 000 users, some 24 million ratings in all. We'll use a smaller dataset with ratings of 9 000 movies by 700 users (100 000 ratings in all). 

Download the file "ml-latest-small.zip" from https://grouplens.org/datasets/movielens/ and unzip it to the *data* directory in your main project folder (make a folder called *data* if you haven't already). You should see four csv files: `links.csv`, `movies.csv`, `ratings.csv`, and `tags.csv`. You can also do this programmatically in R:

```{r}
dir.create("data")
download.file(
  url = "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip", 
  destfile = "data/ml-latest-small.zip"
)
unzip("data/ml-latest-small.zip", exdir = "data")
```

First let's save the data we downloaded as an .RData object. .RData objects are smaller than csv, plus we can save all four csvs in a single .RData object that we can call with a single call to `load` the dataset later on.

```{r}
# read in the csv files
movies <- read.csv("data/ml-latest-small/movies.csv")  # movie info: title and genre
ratings <- read.csv("data/ml-latest-small/ratings.csv") # user ratings for each movie
tags <- read.csv("data/ml-latest-small/tags.csv") # additional user reviews ("tag")
links <- read.csv("data/ml-latest-small/links.csv") # lookup for imdb movie IDs

# save as .RData
save(links, movies, ratings, tags, file = "data/movielens-small.RData")

# check that it's worked
rm(list = ls())
load("data/movielens-small.RData")
```

You'll only need to do the above part once, so once you've got the data saved as .RData, start running the notebook from here.

## Loading the tidyverse

Load the **tidyverse** collection of packages, which loads the following packages: **ggplot2**, **tibble**, **tidyr**, **readr**, **purrr**, and **dplyr**.

```{r}
library(tidyverse)
```

Load the MovieLens data

```{r}
load("data/movielens-small.RData")
```

Tibbles are a special kind of dataframe that work well with tidyverse packages ("in the tidyverse"). 


```{r}
# convert ratings to a "tibble"
ratings <- as.tibble(ratings)
```

A nice feature of tibbles is that if you display them in the console (by typing `ratings`, for example) only the first few rows and columns are shown. 


```{r}
ratings
```

Explore some of the variables:


```{r}
str(ratings)
```


```{r}
glimpse(ratings)
```


```{r}
glimpse(movies)
```

We'll look at database joins in more detail, but for now, this just adds movie title to the `ratings` data by pulling that information from `movies`.


```{r}
ratings <- left_join(ratings, movies)
```

## Filtering rows with `filter()`

Here we illustrate the use of `filter()` by extracting user 1's observations from the *ratings* data frame.


```{r}
u1 <- filter(ratings, userId == 1)
u1
```

Next we extract the observations for user 1 that received a rating greater than 3. Multiple filter conditions are created with `&` (and) and `|` (or).


```{r}
filter(ratings, userId == 1 & rating > 3)
```

Here's another way of writing the same condition as above:


```{r}
filter(ratings, userId == 1, rating > 3)
```

The `%in%` command is often useful when using dplyr verbs:


```{r}
filter(ratings, userId == 1, rating %in% c(1,4))
```

## Introducing the pipe

The pipe operator `%>%` is a very useful way of chaining together multiple operations. A typical format is something like:

*data* `%>%` *operation 1* `%>%` *operation 2* 

You read the code from left to right: Start with *data*, apply some operation (operation 1) to it, get a result, and then apply another operation (operation 2) to that result, to generate another result (the final result, in this example). A useful way to think of the pipe is as similar to "then".

The main goal of the pipe is to make code easier, by focusing on the transformations rather than on what is being transformed. Usually this is the case, but it's also possible to get carried away and end up with a huge whack of piped statements. Deciding when to break a block up is an art best learned by experience. 


```{r}
# filtering with the pipe
ratings %>% filter(userId == 1)
```

The main usefulness of the pipe is when combining multiple operations:


```{r}
# first filter on userId then on rating
u1_likes <- ratings %>% filter(userId == 1) %>% filter(rating > 3)
u1_likes
```


```{r}
# another way of doing the same thing
ratings %>% filter(userId == 1 & rating > 3)
```

## Arranging rows with `arrange()`

Ordering user 1's "liked" movies in descending order of rating (note the use of `desc`):

```{r}
arrange(u1_likes, desc(rating))
```

Subsequent arguments to `arrange()` can be used to arrange by multiple columns. Here we first order user 1's liked movies by rating (in descending order) and then by timestamp (in ascending order)


```{r}
arrange(u1_likes, desc(rating), timestamp)
```

We can also use the pipe to do the same thing:


```{r}
u1_likes %>% arrange(desc(rating))
```

Finally, here's an example of combining filter and arrange operations with the pipe:


```{r}
ratings %>% filter(userId == 1 & rating > 3) %>% arrange(desc(rating))
```

## Selecting columns with `select()`

Select is a bit like `filter()` for columns. The syntax is straightforward, the first argument gives the dataframe, and then you list the variables you want to select!


```{r}
select(u1_likes, title, rating)
```

To exclude variables just put a minus sign in front of them:


```{r}
select(u1_likes, -userId, -timestamp)
```

You can also use `select()` to reorder variables. A useful function here is `everything()`.


```{r}
# original order
u1_likes

# reorder so title is first
select(u1_likes, title, everything())
```

## Adding new variables with `mutate()`

Mutating operations add a new column to a dataframe. Here's a trivial example to get started:


```{r}
mutate(u1_likes, this_is = "stupid")  
```

A more useful use of mutate is to construct a new variable based on existing variables. This is the way that `mutate` is almost always used.


```{r}
mutate(u1, like = ifelse(rating > 3, 1, 0))  
```

We can also use the pipe for mutating operations. Hopefully you're getting used to the pipe by now, so let's embed a mutating operation within a larger pipe than we've used before. 


```{r}
ratings %>% 
  mutate(like = ifelse(rating > 3, 1, 0)) %>% 
  filter(userId == 1) %>% 
  select(like, everything()) 
```

## Aggregating over rows with `summarise()`

The `summarise()` verb (or `summarize()` will also work) summarises the rows in a data frame in some way. When applied to the whole data frame, it will collapse it to a single row. For example, here we take user 1's data, and calculate their average rating and the number of movies they have given a rating higher than 3 to:


```{r}
summarise(u1, mean = mean(rating), likes = sum(rating > 3))
```

You need to watch out for NAs when using `summarise()`. If one exists, operations like `mean()` will return NA. You can exclude NAs from calculations using `na.rm = TRUE`:


```{r}
# introduce an NA
u1$rating[1] <- NA

# see what happens
summarise(u1, mean = mean(rating), likes = sum(rating > 3))
```


```{r}
# with na.rm = TRUE
summarise(u1, mean = mean(rating, na.rm = TRUE), likes = sum(rating > 3, na.rm = TRUE))
```

`summarise()` is most useful when combined with `group_by()`, which imposes a grouping structure on a data frame. After applying `group_by()`, subsequent dplyr verbs will be applied to individual groups, basically repeating the code for each group. That means that `summarise()` will calculate a summary for each group:


```{r}
# tell dplyr to group ratings by userId
ratings_by_user <- group_by(ratings, userId)

# apply summarize() to see how many movies each user has rated
ratings_by_user %>% summarize(count = n()) %>% head()
```


```{r}
# get sorted counts (plus some presentation stuff)
ratings %>% 
group_by(userId) %>% 
summarize(count = n()) %>% 
arrange(desc(count)) %>% 
head(10) %>%     # take first ten rows
t()  # transpose 
```


```{r}
# or with the pipe (last time)
ratings %>% group_by(userId) %>% summarize(count = n()) %>% head(10)
```

You can also pass your own summary functions to the pipe:

```{r}
# my own function, computes the 60% quantile of a vector
compute_60q <- function(x){quantile(x, probs = 0.60)}
# use it in a grouped summary
ratings %>% group_by(userId) %>% summarize(count = n(), q60 = compute_60q(rating)) %>% head(10)
```

## Other uses of `grouped_by()`: grouped filters and grouped mutates

While you'll probably use `group_by()` most often with `summarise()`, it can also be useful when used in conjunction with `filter()` and `mutate()`. Grouped filters perform the filtering within each group. Below we use it to extract each user's favourite movie (or movies, if there's a tie).


```{r}
# example of a grouped filter
ratings %>% group_by(userId) %>% filter(rank(desc(rating)) < 2)
```

Here we use a grouped mutate to standardise each user's ratings so that they have a mean of zero (for each user, which guarantees the overall mean rating is also zero).


```{r}
# example of a grouped mutate
ratings %>% 
  group_by(userId) %>%
  mutate(centered_rating = rating - mean(rating)) %>% 
  select(-movieId, -timestamp, -genres)
```

## Putting it all together: extracting a sample set of reviews

In this section we'll take what we've learned and do something useful: build a 15x20 matrix containing the reviews made on 20 movies by 15 users. We'll use this matrix in one of the next lessons to build a recommendation system.

First, we select the 15 users we want to use. I've chosen to use 15 users with moderately frequent viewing habits (remember there are 700 users and 9000 movies), mainly to make sure there are some (but not too many) empty ratings.


```{r}
users_frq <- ratings %>% group_by(userId) %>% summarize(count = n()) %>% arrange(desc(count))
my_users <- users_frq$userId[101:115]
```

Next, we select the 20 movies we want to use:

```{r}
movies_frq <- ratings %>% group_by(movieId) %>% summarize(count = n()) %>% arrange(desc(count))
my_movies <- movies_frq$movieId[101:120]
```

Now we make a dataset with only those 15 users and 20 movies:


```{r}
ratings_red <- ratings %>% filter(userId %in% my_users, movieId %in% my_movies) 
# and check there are 15 users and 20 movies in the reduced dataset
n_users <- length(unique(ratings_red$userId))
n_movies <- length(unique(ratings_red$movieId))
paste("number of users is", n_users)
paste("number of movies is", n_movies)
```

Let's see what the 20 movies are:

```{r}
movies %>% filter(movieId %in% my_movies) %>% select(title)
```

However, note all the movie titles are still being kept:

```{r}
head(levels(ratings_red$title), 10)
```

This actually isn't what we want, so let's drop the ones we won't use.

```{r}
ratings_red <- droplevels(ratings_red)
levels(ratings_red$title)
```

We now want to reshape the data frame into a 15x20 matrix i.e.from "long" format to "wide" format. We can do this using the `spread()` verb. 

```{r}
ratings_red %>% spread(key = movieId, value = rating)
```

The preceding line *doesn't* work: as you can see we land up with more than one row per user. But it is useful as an illustration of `spread()`. Question: why doesn't it work?

Here's the corrected version:

```{r}
ratings_red %>% select(userId, title, rating) %>% spread(key = title, value = rating)
```

Finally, since we just want to know who has seen what, we replace all NAs with 0 and all other ratings with 1:


```{r}
viewed_movies <- ratings_red %>% 
  complete(userId, title) %>% 
  mutate(seen = ifelse(is.na(rating), 0, 1)) %>% 
  select(userId, title, seen) %>% 
  spread(key = title, value = seen)
```

We could have got this more simply with a call to `table()`, which creates a two-way frequency table.


```{r}
table(ratings_red$userId, ratings_red$title)
```

Finally, we save our output for later use!


```{r}
dir.create("output")
save(ratings_red, viewed_movies, file = "output/recommender.RData")
```

## Combining data frames with *joins*

We'll often need to combine the information contained in two or more tables. To do this, we need various kinds of database *joins*. This section describes the basic join operations that we need to combine data frames. The examples are taken from [Chapter 13](http://r4ds.had.co.nz/relational-data.html) of R4DS, which also contains a lot more general information on relational data.

First, we make some very simple data tables to show how the joins work:


```{r}
# make some example data
x <- tribble(
  ~key, ~xvalue,
  1, "x1",
  2, "x2",
  3, "x3"
)

y <- tribble(
  ~key, ~yvalue,
  1, "y1",
  2, "y2",
  4, "y3"
)

x 
y
```

### Mutating joins: `inner_join`, `left_join`, `right_join`, `full_join`

The first set of joins we look at are called mutating joins. These first match observations in two tables in some way, and then combine variables from the two tables.

There are four types of mutating joins: inner joins, left joins, right joins, and full joins. 

An **inner join** keeps observations that appear in *both* tables. 


```{r}
inner_join(x,y)
```


```{r}
inner_join(y,x)
```

The other three joints are all **outer joins**: they keep observations that appear in *at least one* of the tables.

A **left join** keeps all observations from the left table (first argument).


```{r}
left_join(x,y)
```


```{r}
left_join(y,x)
```

A **right join** keeps all observations from the right table (second argument).

```{r}
# note this is the same as left_join(y,x)
right_join(x,y)
```

A **full join** keeps all observations from the left and right table.

```{r}
full_join(x,y)
```

We can now re-examine the join we used to add movie titles to the ratings data frame earlier:


```{r}
# reload the MovieLens data
load("data/movielens-small.RData")
ratings <- as.tibble(ratings)
movies <- as.tibble(movies)
```

Note that the same *movieId* can appear multiple times in the *ratings* data frame:


```{r}
ratings %>% arrange(movieId) # note duplicate movieIds
```

But each *movieId* only appears once in the *movies* data frame:

```{r}
movies %>% arrange(movieId) # note unique movieIds
```

In this case a left join by the *movieId* key copies across the movie title information (as well as any other information in the *movies* data frame):


```{r}
left_join(ratings, movies, by = "movieId") %>% select(title, everything())
```

### Filtering joins: `semi_join`, `anti_join`

The last two joins we look at are **filtering joins**. These match observations in two tables, but do not add variables. There are two types of filtering joins: semi-joins and anti-joins.

A **semi join** keeps all observations from the first table that appear in the second (note variables are from the first table only):


```{r}
semi_join(x,y)
```


```{r}
semi_join(y,x)
```

while an **anti join** *drops* all observations from the first table that appear in the second table (note variables are from the first table only).


```{r}
anti_join(x,y)
```


```{r}
anti_join(y,x)
```

## Exercises

1. Write a function called **my_range** that computes the difference between the 90% and 10% quantiles of a vector, and another function called **my_cov** that computes the coefficient of variation of a vector (defined as the standard deviation divided by mean). Use these functions to work out how "variable"" each movie's ratings are, and hence what the most variable (in the sense of love-them-or-hate-them) movies are. Do the same for users i.e. find out which users give the most variable ratings, in terms of your two variability functions.

2. Do the exercises in [Chapter 5](http://r4ds.had.co.nz/transform.html) (data transformation using the **dplyr** verbs) and [Chapter 13](http://r4ds.had.co.nz/relational-data.html) (on database joins) of R4DS. There are exercises at the end of each major subsection. Do as many of these exercises as you need to feel comfortable with the material - I suggest doing at least the first two of each set of exercises.