-
Notifications
You must be signed in to change notification settings - Fork 47
/
Copy path11-spark-caching.Rmd
87 lines (63 loc) · 2.25 KB
/
11-spark-caching.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
```{r, spark-caching, include = FALSE}
eval_caching <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_caching <- Sys.getenv("GLOBAL_EVAL")
```
# Spark data caching
```{r, eval = eval_caching, include = FALSE}
library(sparklyr)
library(dplyr)
library(readr)
library(purrr)
```
## Map data
*See the machanics of how Spark is able to use files as a data source*
1. Examine the contents of the **/usr/share/class/files** folder
2. Load the `sparklyr` library
```{r, eval = eval_caching}
library(sparklyr)
```
3. Use `spark_connect()` to create a new local Spark session
```{r, eval = eval_caching}
sc <- spark_connect(master = "local")
```
4. Load the `readr` and `purrr` libraries
```{r, eval = eval_caching}
library(readr)
library(purrr)
```
5. Read the top 5 rows of the **transactions_1** CSV file
```{r, eval = eval_caching}
top_rows <- read_csv("/usr/share/class/files/transactions_1.csv", n_max = 5)
```
6. Create a list based on the column names, and add a list item with "character" as its value. Name the variable `file_columns`
```{r, eval = eval_caching}
file_columns <- top_rows %>%
rename_all(tolower) %>%
map(function(x) "character")
```
7. Preview the contents of the `file_columns` variable
```{r, eval = eval_caching}
head(file_columns)
```
8. Use `spark_read()` to "map" the file's structure and location to the Spark context. Assign it to the `spark_lineitems` variable
```{r, eval = eval_caching}
```
9. In the Connections pane, click on the table icon by the `transactions` variable
10. Verify that the new variable pointer works by using `tally()`
```{r, eval = eval_caching}
```
## Caching data
*Learn how to cache a subset of the data in Spark*
1. Create a subset of the *orders* table object. Summarize by **date**, careate a total price and number of items sold.
```{r, eval = eval_caching}
daily_orders <-
```
2. Use `compute()` to extract the data into Spark memory
```{r, eval = eval_caching}
```
3. Confirm new variable pointer works
```{r, eval = eval_caching}
```
4. Go to the Spark UI
5. Click the **Storage** button
6. Notice that "orders" is now cached into Spark memory