-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathData-Crunching.Rmd
executable file
·125 lines (65 loc) · 5.82 KB
/
Data-Crunching.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
title: "Data Crunching"
---
```{r options_communes, include=FALSE}
source("options_communes.R")
```
<div class="note">
**Key Take Away** :
</div>
## Challenges with Household Survey analysis
Household survey often results in dataset with over 300 variables to process & explore. Deadline to get insight from the dataset are often tight and Manual processing is very lengthy and can be done only for a limited part of the dataset. Often, because of those challenges, a lot of potential insights are not discovered.
To address this, it's necessary to move from a reactive support to a proactive one.
![](images/unknowquestions.png)
Data Crunching is about automating processes:
* Upstream process of data interpretation: consists of the __preparation of a dataset__ so that it can be processed, sorted and structured to run algorithms and program sequences on it;
* Can __save a lot of time__ as the processes do not need to be performed manually (different from data munging and data wrangling that refers manual processing of data);
* Can be __iterative__ when the output of the crunching process includes new data or reveals errors. This means that the program sequences may be repeated until the desired result is achieved;
* Enable better [__analysis reproducibility__](http://edouard-legoupil.github.io/humanitaRian-data-science/slides/) (minimize point & click interventions) to facilitate peer review
## Data Crunching Automation
Using the right combination of packages from the R statistical language, it is possible to integrate all necessary data analysis steps into **scripts**:
* Data management (clean, recode, merge, reshape)
* Data analysis (test, regression, multivariate analysis, etc...)
* Data visualisation (plot, map, graph...)
* Writing up results (report and presentation generation)
![](images/analysis.png)
## Using KoboloadeR to facilitate reproducibily.
Before data visualization & interpretation many steps are required:
### Import raw data
In a structured survey with numbered questions, the __flat file__ type of data structure has a column for each question, and a row for each respondent. A more complex survey data structure arises if the data are __hierarchical__. A common type of hierarchy is where a series of questions is repeated say for each child in the household, and combined with a household questionnaire, and maybe data collected at community level. For analysis, we can create a rectangular flat file, at the ‘child level’, by repeating relevant household information in separate rows for each child. Similarly, we can summarise information for the children in a household, to create a ‘household level’
analysis file.
In the case of hierearchical dataset, it is required to use [ODK Briefcase](https://opendatakit.org/downloads/download-category/briefcase/) to export and configure the key to join the frame together.
### Recode & Relabel
* Leverage the same [xlsform](http://xlsform.org) file (saved as .xls – not .xlsx) already used to encode the questionnaire to generate a data dictionary
* Extend xlsform by adding additional column (chapter, disaggregation, correlation, etc.)
* Potentially revise label wording to make them more concise when they will appear on the output chart
### Clean records
* Cleaning Log defined though the iteration of the crunching
* Log stored as a worksheet, act as documented data audit trail
* Log actions to be sorted as “update” or “delete”
* Log to be re-applied every time to raw date
### Reweight dataset according to sampling strategy,
* Associate a weight to each record
* Weight defined by the sampling script (can be based on cluster, or Respondent Driven Sample)
* Possibility to use post-stratification to re-compute corrected weights in case of low coverage of the sample
### Build new indicators from existing variable, etc.
Indicators are summary measures. They often provide a baseline from which to weigh up the finer points. It is important not to create unnecessary confusion. An indicator should synthesise information and serve to represent a reasonable measure of some issue or concept. The concept should have an agreed name so that users can discuss it meaningfully.
* Create new indicators from existing one: Need to define in a worksheet for each indicator: type, name, label, chapter, correlation, aggregation, formula, frame
* Indicators formula written with a R-ready syntax: Allow for complex notation:
* May need to use dcast if you want to calculate an indicator based on values from a nested data frame. dcast will work as pivot table using the unique ID used for the join.
* May need to use if when trying to do a calculation where you could have potential zero as numerator
Indicators are calculated, appended to the right data frame and then the indicator definition is appended in the data dictionary
## KoboloadeR
It is An R packages (i.e. a series of functions) that can be plugged to a configuration file in order to separate “input”, “processing”, and “output”
* The “output” is an Rmd (Rmarkdown) file than produce word, pdf or html reports
![](images/rm-cheatsheet1.png)
* The configuration file includes references to all “input”:
Path to raw data files
Path to form (in xlsform) in order to build a data dictionary
Path to the sample weight
Path to the data cleaning log
Path to the indicator calculation sheet
### Collaborative Development
* Open Source Package maintained in [GitHub](https://github.com/Edouard-Legoupil/koboloadeR)
* Submit issues for [bug report or feature request](https://github.com/Edouard-Legoupil/koboloadeR/issues) in Github
* [Fork and submit pull request](https://github.com/Edouard-Legoupil/koboloadeR/tree/master/R#fork-destination-box) for code review and integration