09-intro-to-sparklyr.Rmd

```{r, intro-to-sparklyr, include = FALSE}
eval_sparklyr <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_sparklyr <- Sys.getenv("GLOBAL_EVAL")
```

# Intro to `sparklyr`

```{r, eval = eval_sparklyr, include = FALSE}
library(dplyr)
library(sparklyr)
```

## New Spark session
*Learn to open a new Spark session*

1. Load the `sparklyr` library
    ```{r, eval = eval_sparklyr}
    library(sparklyr)
    ```

2. Use `spark_connect()` to create a new local Spark session
    ```{r, eval = eval_sparklyr}
    sc <- spark_connect(master = "local")
    ```

3. Click on the `Spark` button to view the current Spark session's UI

4. Click on the `Log` button to see the message history

## Data transfer
*Practice uploading data to Spark*

1. Load the `dplyr` library
    ```{r, eval = eval_sparklyr}
    library(dplyr)
    ```

2. Copy the `mtcars` dataset into the session
    ```{r, eval = eval_sparklyr}
    spark_mtcars <- copy_to(sc, mtcars, "my_mtcars")
    ```

3. In the **Connections** pane, expande the `my_mtcars` table

4. Go to the Spark UI, note the new jobs

5. In the UI, click the Storage button, note the new table

6. Click on the **In-memory table my_mtcars** link

## Spark and `dplyr`
*See how Spark handles `dplyr` commands*

1. Run the following code snipett
    ```{r, eval = eval_sparklyr}
    spark_mtcars %>%
      group_by(am) %>%
      summarise(mpg_mean = mean(mpg, na.rm = TRUE))
    ```

2. Go to the Spark UI and click the **SQL** button 

3. Click on the top item inside the **Completed Queries** table

4. At the bottom of the diagram, expand **Details**

## Feature transformers
*Introduction to how Spark Feature Transformers can be called from R*

1. Use `ft_binarizer()` to create a new column, called `over_20`, that indicates if that row's `mpg` value is over or under 20MPG
    ```{r, eval = eval_sparklyr}

    ```


2. Pipe the code into `count()` to see how the data splits between the two values
    ```{r, eval = eval_sparklyr}

    ```


3. Start a new code chunk. This time use `ft_quantile_discretizer()` to create a new column called `mpg_quantile`
    ```{r, eval = eval_sparklyr}

    ```

4. Add the `num_buckets` argument to `ft_quantile_discretizer()`, set its value to 5
    ```{r, eval = eval_sparklyr}

    ```

5. 1. Pipe the code into `count()` to see how the data splits between the quantiles
    ```{r, eval = eval_sparklyr}

    ```

## Models
*Introduce Spark ML models by running a couple of them in R*

1. Use `ml_kmeans()` to run a model based on the following formula: `wt ~ mpg`.  Assign the results to a variable called `k_mtcars`
    ```{r, eval = eval_sparklyr}
    k_mtcars <- 
    ```

2. Use `k_mtcars$summary` to view the results of the model.  Pull the cluster sizes by using `...$cluster_sizes()`
    ```{r, eval = eval_sparklyr}
    k_mtcars$summary$cluster_sizes()
    ```

3. Start a new code chunk. This time use `ml_linear_regression()` to produce a Linear Regression model of the same formula used in the previous model. Assign the results to a variable called `lr_mtcars`
    ```{r, eval = eval_sparklyr}
    lr_mtcars <- 
    ```

4. Use `summary()` to view the results of the model
    ```{r, eval = eval_sparklyr}
    
    ```