forked from sfirke/janitor
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
179 lines (126 loc) · 6.95 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
options(width = 110)
```
> Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
>
> -- *"[For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html)" - The New York Times, 2014*
# janitor <img src="tools/readme/logo_small.png" align="right" />
***********************
[](https://travis-ci.org/sfirke/janitor)
[](https://codecov.io/github/sfirke/janitor?branch=master)
[](https://cran.r-project.org/package=janitor)


**janitor** has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
The main janitor functions:
* perfectly format data.frame column names;
* generate and format quick one- and two-variable tabulations (i.e., frequency tables and crosstabs); and
* isolate partially-duplicate records.
The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
janitor is a [#tidyverse](https://github.com/hadley/tidyverse/blob/master/vignettes/manifesto.Rmd)-oriented package. Specifically, it plays nicely with the `%>%` pipe and is optimized for cleaning data brought in with the [readr](https://github.com/hadley/readr) and [readxl](https://github.com/hadley/readxl) packages.
### Installation
You can install:
* the latest released version from CRAN with
```R
install.packages("janitor")
````
* the latest development version from GitHub with
```R
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("sfirke/janitor")
```
#### November 2017: Please try out the dev version!
The current development version is significantly ahead of version 0.3.0 on CRAN (which was released May 2017). It has a greatly-enhanced `tabyl()` function - see the [tabyls vignette](https://github.com/sfirke/janitor/blob/master/vignettes/tabyls.md) - and improvements to `clean_names()`. These `clean_names()` improvements may cause old code to break, due to better handling of variable names. More info is in the [NEWS](https://github.com/sfirke/janitor/blob/master/NEWS.md) file; the very quick fix is to supply the argument `case = "old janitor"`.
## Using janitor
Below are quick examples of how janitor tools are commonly used. A full description of each function can be found in janitor's [catalog of functions](https://github.com/sfirke/janitor/blob/master/vignettes/introduction.md).
### Cleaning dirty data
Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file [dirty_data.xlsx](https://github.com/sfirke/janitor/blob/master/dirty_data.xlsx):

Dirtiness includes:
* Dreadful column names
* Rows and columns containing Excel formatting but no data
* Dates stored as numbers
* Values spread inconsistently over the "Certification" columns
Here's that data after being read in to R:
```{r, warning = FALSE, message = FALSE}
library(pacman) # for loading packages
p_load(readxl, janitor, dplyr)
roster_raw <- read_excel("dirty_data.xlsx") # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
```
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
Clean it with janitor functions:
```{r}
roster <- roster_raw %>%
clean_names() %>%
remove_empty_rows() %>%
remove_empty_cols() %>%
mutate(hire_date = excel_numeric_to_date(hire_date),
cert = coalesce(certification, certification_1)) %>% # from dplyr
select(-certification, -certification_1) # drop unwanted columns
roster
```
The core janitor cleaning function is `clean_names()` - call it whenever you load data into R.
### Examining dirty data
#### Finding duplicates
Use `get_dupes()` to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:
```{r}
roster %>% get_dupes(first_name, last_name)
```
Yes, some teachers appear twice. We ought to address this before counting employees.
#### Tabulating tools
A variable (or combinations of two or three variables) can be tabulated with `tabyl()`. The resulting data.frame can be tweaked and formatted
with the suite of `adorn_` functions for quick analysis and printing of pretty results in a report. `adorn_` functions can be helpful with non-tabyls, too.
`tabyl` can be called two ways:
* On a vector, when tabulating a single variable - e.g., `tabyl(roster$subject)`
* On a data.frame, specifying 1, 2, or 3 variable names to tabulate : `roster %>% tabyl(subject, employee_status)`.
* Here the data.frame is passed in with the `%>%` pipe; this allows for dplyr commands earlier in the pipeline
##### tabyl()
Like `table()`, but pipe-able, data.frame-based, and fully featured.
One variable:
```{r}
roster %>%
tabyl(subject)
```
Two variables:
```{r}
roster %>%
filter(hire_date > as.Date("1950-01-01")) %>%
tabyl(employee_status, full_time)
```
Three variables:
```{r}
roster %>%
tabyl(full_time, subject, employee_status)
```
##### Adorning tabyls
The suite of `adorn_` functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:
```{r}
roster %>%
tabyl(employee_status, full_time) %>%
adorn_totals("row") %>%
adorn_percentages("row") %>%
adorn_pct_formatting() %>%
adorn_ns()
```
Pipe that right into `knitr::kable()` in your RMarkdown report!
These modular adornments can be layered to reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.
## Contact me
You are welcome to:
* submit suggestions and report bugs: https://github.com/sfirke/janitor/issues
* send a pull request: https://github.com/sfirke/janitor/
* let me know what you think on twitter @samfirke
* compose a friendly e-mail to: <img src = "http://samfirke.com/wp-content/uploads/2016/07/email_address_whitespace_top.png" alt = "samuel.firke AT gmail" width = "210"/>