-
Notifications
You must be signed in to change notification settings - Fork 47
/
Copy path09-intro-to-sparklyr.Rmd
118 lines (80 loc) · 3.14 KB
/
09-intro-to-sparklyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
```{r, intro-to-sparklyr, include = FALSE}
eval_sparklyr <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_sparklyr <- Sys.getenv("GLOBAL_EVAL")
```
# Intro to `sparklyr`
```{r, eval = eval_sparklyr, include = FALSE}
library(dplyr)
library(sparklyr)
```
## New Spark session
*Learn to open a new Spark session*
1. Load the `sparklyr` library
```{r, eval = eval_sparklyr}
library(sparklyr)
```
2. Use `spark_connect()` to create a new local Spark session
```{r, eval = eval_sparklyr}
sc <- spark_connect(master = "local")
```
3. Click on the `Spark` button to view the current Spark session's UI
4. Click on the `Log` button to see the message history
## Data transfer
*Practice uploading data to Spark*
1. Load the `dplyr` library
```{r, eval = eval_sparklyr}
library(dplyr)
```
2. Copy the `mtcars` dataset into the session
```{r, eval = eval_sparklyr}
spark_mtcars <- copy_to(sc, mtcars, "my_mtcars")
```
3. In the **Connections** pane, expande the `my_mtcars` table
4. Go to the Spark UI, note the new jobs
5. In the UI, click the Storage button, note the new table
6. Click on the **In-memory table my_mtcars** link
## Spark and `dplyr`
*See how Spark handles `dplyr` commands*
1. Run the following code snipett
```{r, eval = eval_sparklyr}
spark_mtcars %>%
group_by(am) %>%
summarise(mpg_mean = mean(mpg, na.rm = TRUE))
```
2. Go to the Spark UI and click the **SQL** button
3. Click on the top item inside the **Completed Queries** table
4. At the bottom of the diagram, expand **Details**
## Feature transformers
*Introduction to how Spark Feature Transformers can be called from R*
1. Use `ft_binarizer()` to create a new column, called `over_20`, that indicates if that row's `mpg` value is over or under 20MPG
```{r, eval = eval_sparklyr}
```
2. Pipe the code into `count()` to see how the data splits between the two values
```{r, eval = eval_sparklyr}
```
3. Start a new code chunk. This time use `ft_quantile_discretizer()` to create a new column called `mpg_quantile`
```{r, eval = eval_sparklyr}
```
4. Add the `num_buckets` argument to `ft_quantile_discretizer()`, set its value to 5
```{r, eval = eval_sparklyr}
```
5. 1. Pipe the code into `count()` to see how the data splits between the quantiles
```{r, eval = eval_sparklyr}
```
## Models
*Introduce Spark ML models by running a couple of them in R*
1. Use `ml_kmeans()` to run a model based on the following formula: `wt ~ mpg`. Assign the results to a variable called `k_mtcars`
```{r, eval = eval_sparklyr}
k_mtcars <-
```
2. Use `k_mtcars$summary` to view the results of the model. Pull the cluster sizes by using `...$cluster_sizes()`
```{r, eval = eval_sparklyr}
k_mtcars$summary$cluster_sizes()
```
3. Start a new code chunk. This time use `ml_linear_regression()` to produce a Linear Regression model of the same formula used in the previous model. Assign the results to a variable called `lr_mtcars`
```{r, eval = eval_sparklyr}
lr_mtcars <-
```
4. Use `summary()` to view the results of the model
```{r, eval = eval_sparklyr}
```