11-spark-caching.Rmd

```{r, spark-caching, include = FALSE}
eval_caching <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_caching <- Sys.getenv("GLOBAL_EVAL")
```

# Spark data caching

    ```{r, eval = eval_caching, include = FALSE}
    library(sparklyr)
    library(dplyr)
    library(readr)
    library(purrr)    
    ```

## Map data
*See the machanics of how Spark is able to use files as a data source*

1. Examine the contents of the **/usr/share/class/files** folder

2. Load the `sparklyr` library
    ```{r, eval = eval_caching}
    library(sparklyr)
    ```

3. Use `spark_connect()` to create a new local Spark session
    ```{r, eval = eval_caching}
    sc <- spark_connect(master = "local")
    ```

4. Load the `readr` and `purrr` libraries
    ```{r, eval = eval_caching}
    library(readr)
    library(purrr)
    ```
    
5. Read the top 5 rows of the **transactions_1** CSV file
    ```{r, eval = eval_caching}
    top_rows <- read_csv("/usr/share/class/files/transactions_1.csv", n_max = 5)
    ```

6. Create a list based on the column names, and add a list item with "character" as its value. Name the variable `file_columns`
    ```{r, eval = eval_caching}
    file_columns <- top_rows %>%
      rename_all(tolower) %>%
      map(function(x) "character")
    ```
    
7. Preview the contents of the `file_columns` variable
    ```{r, eval = eval_caching}
    head(file_columns)
    ```

8. Use `spark_read()` to "map" the file's structure and location to the Spark context. Assign it to the `spark_lineitems` variable
    ```{r, eval = eval_caching}

    ```

9. In the Connections pane, click on the table icon by the `transactions` variable

10. Verify that the new variable pointer works by using `tally()`
    ```{r, eval = eval_caching}

    ```

## Caching data
*Learn how to cache a subset of the data in Spark*

1. Create a subset of the *orders* table object. Summarize by **date**, careate a total price and number of items sold.
    ```{r, eval = eval_caching}
    daily_orders <- 
    ```

2. Use `compute()` to extract the data into Spark memory
    ```{r, eval = eval_caching}
    
    ```

3. Confirm new variable pointer works
    ```{r, eval = eval_caching}
    
    ```

4. Go to the Spark UI 

5. Click the **Storage** button

6. Notice that "orders" is now cached into Spark memory