-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCH5_summary_agg.qmd
592 lines (350 loc) · 17.3 KB
/
CH5_summary_agg.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
---
title: "Chapter 5: Summaries and Aggregation"
author: "Government Analysis Function and ONS Data Science Campus"
engine: knitr
execute:
echo: true
eval: false
freeze: auto # re-render only when source changes
---
> To switch between light and dark modes, use the toggle in the top left
# Learning Objectives
* Describe numeric and categorical data using statistics such as:
* Mean
* Median
* Standard Deviation
* Variance
* Aggregate data
# Packages and Data
We will use the titanic dataset and clean its column names with janitor.
```{r message = FALSE, warning = FALSE, include = TRUE}
# Load in packages
library(tidyverse)
library(janitor)
# Prepare the dataset
titanic_data <- read_csv("Data/titanic.csv",
na = c("*", ".", "", "NULL"))
titanic_data <- clean_names(titanic_data)
glimpse(titanic_data)
```
# Descriptive Statistics
There are two broad types of data in our tibble:
* Numerical data (i.e. ints and doubles)
* Text data (i.e. character strings)
## Numerical Data
The **summarise()** verb allows us to create one or more numeric variables summarising the variables we specify.
There are many summary statistics functions we can use with summarise():
* **mean()** - The average (arithmetic mean) data value in the given column.
* **median()** - The middle value of an ordered vector of values.
* If the vector has an odd number of values, it is the exact middle value.
* If the vector has an even number of values, there is no middle, so the median is the average of the numbers either side (e.g. if there are 80 values, the median is the average of the 39th and 40th value).
* **var()** - The variance measures the degree of spread of each point from the mean. It is the square of the standard deviation.
* **sd()** - The standard deviation (spread) of values in from the mean. It is the square root of the variance.
* **quantile()** - Produces sample quantiles corresponding to the given proportions (between 0 and 1). We can compute upper and lower quartiles with this.
* **min()** - The smallest value in the given column.
* **max()** - The maximum value in the given column.
* **n()** - Counts the number of entries (rows).
* **n_distinct()** - Counts the number of unique entries for a given variable.
### **Examples**{-}
Let's compute the mean fare paid by passengers on the titanic.
The **summarise()** verb takes the following form:
> **summarise(agg_col_name = summary_function(col))** notice that we can name the column as well!
```{r}
# Calculating the mean fare paid
titanic_data |>
summarise(mean(fare)) # Compute mean of fare column
```
>**This highlights something very important, that any operation applied to a column containing one or more missing values, produces an average of NA.**
This is because we are adding something that is Not a Number (it does not exist) to an actual number, which is not defined (think dividing by 0 or adding infinity).
By default, this will throw most of the summary functions, but there is an optional parameter we can turn on, called "**na.rm = TRUE**".
This will omit any NA values in our calculation and produce a more meaningful result, albeit on a subset of observed data.
```{r}
# Calculating the mean fare paid by passengers, removing NAs
titanic_data |>
summarise(mean(fare, na.rm = TRUE))
```
To assign this a meaningful column name, we'd pass a keyword argument before the summary function used:
>**agg_col_name = summary_function(column, na.rm = TRUE)**
```{r}
# Calculating the mean fare paid and naming it
titanic_data |>
summarise(mean_fare = mean(fare, na.rm = TRUE))
```
If we didn't care about the tibble output summarise generates, we could simply compute using base R syntax:
```{r}
# Calculating the mean fare with base R
mean(titanic_data$fare, na.rm = TRUE)
```
Like we saw with mutate() and other functions, we can compute multiple statistics through **summarise()**, separating with commas.
```{r}
# Calculating multiple summary statistics with summarise()
titanic_data |>
summarise(median_fare = median(fare, na.rm = TRUE),
mean_fare = mean(fare, na.rm = TRUE),
sd_fare = sd(fare, na.rm = TRUE),
median_age = median(age_of_passenger, na.rm = TRUE),
mean_age = mean(age_of_passenger, na.rm = TRUE),
sd_age = sd(age_of_passenger, na.rm = TRUE))
```
### Quantiles
We can also calculate quantiles, such as the upper and lower quartile within summarise().
To do this, we must specify a value between 0 and 1, which denotes the nth percentile/quantile. For example:
* 0.1 denotes the 10th percentile - All values in order up to the value at the 10% spot.
* 0.25 denotes the lower quartile - All values in order up to the value at the 25% spot.
* 0.75 denotes the upper quartile - All values in order up to the value at the 75% spot.
### **Examples**{-}
The quantile function is structured as:
> **quantile(col, probs = c(0,1),..)** where probs should be the value of the quantile you want.
```{r}
# Computing the lower quartile
titanic_data |>
summarise(fare_lower = quantile(fare, probs = 0.25, na.rm = TRUE))
```
```{r}
# Computing the upper quartile
titanic_data |>
summarise(fare_upper = quantile(fare, probs = 0.75, na.rm = TRUE))
```
## Categorical Data
Simple statistics on qualitative data include:
* **distinct()** - Displays the unique values in a given column.
* **count()** - Displays the number of occurrences of each unique entry in a given column.
These (of course) will not work well on columns that are almost entirely unique, such as the name of passenger!
### **Examples**{-}
Let's start with **count()**, which we can sort with the extra parameter **sort = TRUE** for a descending order output.
```{r}
# Calculating frequency of each unique passenger name
titanic_data |>
count(name_of_passenger, sort = TRUE)
```
Compared this to a categorical variable with only a few unique categories, like embarked.
```{r}
# Calculating the number of passengers embarked from each port
titanic_data |>
count(embarked, sort = TRUE)
```
Let's see distinct() in action:
```{r}
# Select only unique/distinct rows from the embarked column
titanic_data |>
distinct(embarked)
```
## Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. Use summarise() to return the ages of the oldest and youngest passengers on the titanic.
2. Use a summary function for categorical data to determine the number of men and women in the dataset.
### **Show Answer**{-}
1.
```{r}
# Calculating the oldest and youngest
titanic_data |>
summarise(youngest_age = min(age_of_passenger, na.rm = TRUE),
oldest_age = max(age_of_passenger, na.rm = TRUE))
```
2.
```{r}
# Calculating the count of men and women on the titanic
titanic_data |>
count(sex_of_passenger)
```
We see that there were almost twice as many men as there were women on the titanic.
:::
# Aggregation
Aggregation means grouping data together by a particular grouping variable and producing a summary of one or more columns for that grouping variable.
For example, we might want to see the average fare paid **by** sex or **by** port embarked from.
The **group_by()** function from dplyr is the bread and butter of this section.
## Group By
Group by performs the aggregation necessary so that we can use summarise() from before to obtain numeric summaries by group.
It is formed on a principle known as **split - summarise - combine**
* Split - The tibble is divided into a set of smaller tibbles, based on the grouping variable.
For example, if we use a variable with 3 unique categories, 3 smaller tibbles are created.
* Summarise - An aggregate statistic is applied to each of the groups (such as the mean of their fare paid) and a single row for each group is output.
* Combine - The individual outputs from aggregation are combined into a new tibble.
### Example
Let's look at this in code by finding the mean fare for each passenger class.
The sequence of code will go as follows:
> **group_by(col) |> summarise(agg_col = summary_function(col))**
Firstly, let's check how many classes there are:
```{r}
# How many pclasses are there?
titanic_data |>
distinct(pclass)
```
So we should expect three values, one mean fare for each of the passenger classes.
```{r}
# Calculate mean fare by passenger class
titanic_data |>
group_by(pclass) |>
summarise(mean_fare = mean(fare, na.rm = TRUE))
```
### The similarity of count() and group_by()
An interesting point to make is that we can use the n() summary statistic to count the number of observations in that group after a group_by.
```{r}
# Group by and count the number of unique groups
titanic_data |>
group_by(pclass) |>
summarise(n = n())
```
This looks similar to the output of another function we used!
```{r}
# Replicate the above result with a different function
titanic_data |>
count(pclass)
```
it is the case the count() is performing the exact same computations in the background as the combination of group_by() and summarise()!
A great shortcut if a count is all you need.
## Multiple Aggregation
We can also double aggregate with more complex groupings. For example, we may require the mean fare paid by first class passengers leaving from Southhampton.
We would need to group by both passenger class and embarked to achieve this.
```{r}
# Group by pclass and embarked and find mean fare
titanic_data |>
group_by(pclass, embarked) |>
summarise(mean_fare = mean(fare, na.rm = TRUE))
```
We see that since there are 3 passenger classes and 4 (with the null value) embarked entries, there are 3 * 4 = 12 possible combinations, with 10 outputting here (there are no second or third class null values).
Order does not matter here as it just determines what the final groupings look like, we will still get the same 12 each time.
The output itself will of course differ as it displays the groups in the order we grouped them:
```{r}
# Group by embarked and pclass and find mean fare
titanic_data |>
group_by(embarked, pclass) |>
summarise(mean_fare = mean(fare, na.rm = TRUE))
```
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. What is the average fare paid by men and women?
2. What is the median fare paid by men and women in each passenger class?
### **Show Answer**{-}
1.
```{r}
# Calculating the average fare by sex
titanic_data |>
group_by(sex_of_passenger) |>
summarise(mean_fare = mean(fare,
na.rm = TRUE))
```
2.
```{r}
# Calculating the median fare by sex and passenger class
titanic_data |>
group_by(sex_of_passenger, pclass) |>
summarise(median_fare = median(fare,
na.rm = TRUE))
```
:::
## The impact of NAs
Ideally, we don't want the null values being a category of their own and cluttering up our summary tables.
When investigating data, we either:
* Deal with these at read in.
* Meticulously clean them with imputation methods (fill them in).
* Drop them entirely (be very careful with this!).
**tidyr** comes to the rescue when we are dealing with NAs as it has some handy functions to work with them:
* drop_na() removes all rows with missing values present.
* replace_na() fills in the missing values with whatever we specify.
Let's see a few small examples of this in practice.
### Filling missing values
We should always attempt to understand why values are missing before dealing with them.
Domain expertise comes into play when deciding whether it makes sense for a value to be missing, think of temperature:
> **A missing temperature reading is not the same as a temperature reading of 0 degrees celsius.**
As such, it wouldn't make sense to fill the NAs with 0, as we create **bias** by over inflating the number of 0 temperature values.
A better approach in cases such as this is to fill with an **average**, preferably one not impacted by outliers, such as the median.
### **Example**{-}
Let's consider first which numeric variables have missing values. We can check this by using the **is.na()** function.
This returns a vector of TRUEs and FALSEs, one for each cell in the table, for whether they are missing or not.
We can then sum() these up to find the total number of missing values in that column.
```{r}
# Compute number of missing values in age_of_passenger
titanic_data$age_of_passenger |>
is.na() |>
sum()
```
Repeating this for each column becomes laborious quickly, so there are some options for us:
* Use **colSums()** to sum the outputs of a function across columns.
* Create a **User defined function** or **Anonymous Function** to apply across the columns of the dataset using functional programming in purrr.
Whilst the latter is recommended, it is beyond the scope of this current chapter, and features in later more advanced courses.
Let's try colSums():
```{r}
# Use colSums() to find the missing values
titanic_data |> is.na() |>
colSums()
```
Now that we have understood the number of missing values, let's see some imputation.
### **Filling a numeric column with an average**{-}
We know that age_of_passenger has 267 missing values, but filling these in with 0 makes no sense and will skew the data.
A better approach would be to impute with an average, that way we maximise our attempt to keep the data centralised.
However, we must also be careful which statistic we use, as some are more robust than others:
* The mean is heavily influenced by outliers (as it includes them in the calculation).
* The median is a more robust measure as it does not consider the end points of the data (not influenced by outliers).
#### **Example**{-}
The replace_na() functions takes a few arguments:
* The data itself
* replace - What to fill the NAs with
Using this effectively requires us to also use **mutate()** as this modifies an existing column, as opposed to us using base R functionality to refer to columns with **$**.
Let's use is.na() and colSums() to check it worked:
```{r}
# Impute missing age values with median
titanic_data <- titanic_data |> # overwrite to save changes
mutate(age_of_passenger = replace_na(age_of_passenger,
replace = median(age_of_passenger,
na.rm = TRUE)))
```
Let's check this worked:
```{r}
# Check imputation
titanic_data |> is.na() |>
colSums()
```
### **Filling a categorical column with an average**{-}
With categorical columns, the most robust method we can use to impute is using the most frequently occuring value, or the **mode**.
There is no built in function for this in base R, requiring an outside package in order to so. However, we can observe the mode with the output from count().
#### **Example**{-}
We saw that embarked had 2 missing values, let's fill them with the mode.
```{r}
# Find the mode of the embarked column
titanic_data |>
count(embarked)
```
We see that "S" or "Southhampton" was the most commonly departed from port among passengers.
Let's fill this in with replace_na():
```{r}
# Fill embarked in with the mode
titanic_data <- titanic_data |>
mutate(embarked = replace_na(embarked, replace = "S"))
```
Let's check again:
```{r}
# Check that embarked impute worked
titanic_data |> is.na() |>
colSums()
```
Imputation is a very deep topic and goes beyond the scope of the course, but we recommend researching this for other examples.
### Dropping Missing Values
We must be very careful about doing this, as we previously noted that there may be a pattern to how or why they are missing, and as such can be treated.
We risk lowering variance across our dataset by removing them, but if analysis determines that there is no easy way to impute them, dropping them might be the only option.
### **Example**{-}
We use the drop_na() function to drop null values remaining.
* In our example, cabin is roughly 75% missing, and with a variety of categories of all different forms, imputation would be a nightmare.
* As such, we would elect to exclude this column with **select()**.
* Similarly, there is only one missing fare value, which it is more efficient to drop than fill with an average.
```{r}
# Drop the null values in fare
titanic_data <- titanic_data |>
select(-cabin) |>
drop_na(fare)
```
# Summary
Well done for working your way through this whistle stop tour of aggregation and summary statistics!
This is the final compulsory chapter of the course, and as such your next steps are up to you, some recommendations we would make are:
* Complete the additional chapter, chapter 6, which is a case study that tests the skills you've learnt in this course on a new dataset.
* Complete the Additional Chapter, chapter 7, that covers more advanced concepts, such as:
* Pivoting data
* Binning numeric columns to categorical columns.
* Functional programming tools with purrr.
* Extra list functionality.
* Complete any of the R courses that this course is a pre-requisite for on the [Learning Hub](https://learninghub.ons.gov.uk/local/catalogue/index.php), such as:
* [Data Visualization in R](https://learninghub.ons.gov.uk/course/view.php?id=553)
* [Statistics in R](https://learninghub.ons.gov.uk/course/view.php?id=560)
* [Modular Programming in Python and R](https://learninghub.ons.gov.uk/course/view.php?id=574)