-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial_dataset_2.qmd
460 lines (298 loc) · 22.7 KB
/
tutorial_dataset_2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
# Tutorial 2: Adding a more complex dataset
## Overview
This is the second of five tutorials on adding datasets to your `traits.build` database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in `traits.build-template`. Instructions are available at [Tutorial: Example compilation](tutorial_compilation.html).\
### Goals
- Learn how to [merge in location data from a standalone spreadsheet](#add_locations).
- Learn how to [add substitutions for categorical trait values](#categorical_substitutions).
- Learn how to [add custom R code](#custom_R_code) to the metadata file.
- Understand the importance of [attributing traits to the correct entity_type](#correct_entity_type).
- Understand the importance of having the [dataset pivot](#dataset_pivot).
### New functions introduced
- metadata_add_substitution
- metadata_add_substitutions_list
- metadata_check_custom_R_code
------------------------------------------------------------------------
## Adding tutorial_dataset_2
### Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled `tutorial_dataset_2` within the data folder.
- Ensure that this folder exists on your computer.
- The file `data.csv` exists within the `tutorial_dataset_2` folder.
- There is a folder `raw` nested within the `tutorial_dataset_2` folder, that contains two files, `locations.csv` and `notes.txt`.
### source necessary functions
- You'll need to both source the `traits.build` functions and some ancillary functions that are in a file in the scripts folder:
```{r, eval=FALSE}
library(traits.build)
source("R/custom_R_code.R")
```
### Use functions to create a metadata.yml file
#### **Create a metadata template**
To create the metadata template, run:
```{r, eval=FALSE}
metadata_create_template("tutorial_dataset_2")
```
As with `tutorial_dataset_1` this function leads you through a series of menus requiring user input. Ensure you select:
[data format:]{style="color:blue;"} [**wide**]{style="color:red;"}\
[taxon_name column:]{style="color:blue;"} [**name_original**]{style="color:red;"}\
[location_name column:]{style="color:blue;"} [**site_TEXT**]{style="color:red;"}\
[individual_id column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[collection_date column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[Enter collection_date range in format '2007/2009':]{style="color:blue;"} [1996/1997]{style="color:red;"}
[Do all traits need `repeat_measurements_id`'s?]{style="color:blue;"} [**2: No**]{style="color:red;"}\
*Navigate to the dataset's folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.*
------------------------------------------------------------------------
#### **Propagate source information into the metadata.yml file**
This dataset is from a published source and therefore the source information can be added with the function `metadata_add_source_doi`:
```{r, eval=FALSE}
metadata_add_source_doi(dataset_id = "tutorial_dataset_2",
doi = "10.1046/j.1365-2745.2000.00506.x")
```
confirm:
1. the authors' names are formatted as `first name last name` or `first initial last name`\
2. the article title is in sentence case\
3. the page numbers are filled in as a range, separated by a double dash\
------------------------------------------------------------------------
#### **Add location details** {#add_locations}
For this dataset, location data is provided as a standalone spreadsheet, located in the raw data folder: `tutorial_dataset_2\raw\locations.csv`
First read the location data provided into R:
```{r, eval=FALSE}
locations <-
read_csv("data/tutorial_dataset_2/raw/location_data.csv")
```
`traits.build` requires three fields to use a specific syntax:
- `latitude` must be in decimal degrees and the field name (column header) must be `latitude (deg)`
- `longitude` must be in decimal degrees and the field name (column header) must be `longitude (deg)`
- A general site description is document in the field `description`
`traits.build` does not require that the labels for other location properties align across datasets, but it is best practice to use a controlled vocabulary, so database users can easily search across all datasets for information on a specific climate variable or soil nutrient content. For new databases or new location properties, any naming/labeling convention can be established.
To confirm you are using the correct syntax, check the terms already in use:
```{r, eval=FALSE}
locations_properties <-
traits.build_database$locations %>%
distinct(location_property) %>%
View()
```
Then rename your columns to match those in use:
```{r, eval=FALSE}
locations <-
locations %>%
rename(
`longitude (deg)` = long,
`latitude (deg)` = lat,
`description` = vegetation,
`elevation (m)` = elevation,
`precipitation, MAP (mm)` = MAP,
`soil P, total (mg/kg)` = `soil P`,
`soil N, total (ppm)` = `soil N`,
`geology (parent material)` = `parent material`
)
```
Now add the location information into the metadata file:
```{r, eval=FALSE}
metadata_add_locations(dataset_id = "tutorial_dataset_2", location_data = locations)
```
Ensure you select:
[location_name:]{style="color:blue;"} [**location**]{style="color:red;"}\
[location_property columns:]{style="color:blue;"} [**1 2 3 4 5 6 7 8**]{style="color:red;"}\
Check the metadata.yml file to ensure the location information has been added as expected. If there is a problem, rerun the necessary code; this will overwrite what is present. You can also manually add additional properties if something is forgotten.
------------------------------------------------------------------------
#### **Add traits**
To select columns in the `data.csv` file that include trait data, run:
```{r, eval=FALSE}
metadata_add_traits(dataset_id = "tutorial_dataset_2")
```
Select columns [**3 4 5 6**]{style="color:red;"}, as these contain trait data.
------------------------------------------------------------------------
### Manual filling in of metadata
After confirming that the skeletal traits section has been added to `metadata.yml` file, you must fill in all the `unknown` fields.
For this dataset, you will later use functions to add substitutions and exclude unwanted observations, but it is best to first fill in the information for contributors, the dataset, and the traits.
These are all fields that contain the word `unknown` and must be filled in manually:
- the `contributors` section\
- `description`, `basis_of_record`, `life_stage`, `sampling_strategy`, `original_file`, and `notes` under the `dataset` section\
- details for each trait, including `unit_in`, `trait_name`, `entity_type`, `value_type`, `basis_of_record`, `replicates` and `methods`\
#### **Adding contributors**
The file `data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt` indicates the main data_contributor for this study.\
Fill in the remaining contributor information as described in the [`tutorial_dataset_1` tutorial](tutorial_dataset_1.html).\
#### **Dataset fields**
The file `data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt` indicates how to fill in the `unknown` dataset fields for this study.
#### **Trait details**
The file `data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt` indicates how to fill in the `unknown` trait fields for this study, but see below as well.
| column in dataset | trait concept | units_in | entity_type | value_type | basis_of_value | replicates |
|---------------|----------|----------|----------|----------|----------|----------|
| TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine | plant_ growth_ form | .na | species | mode | expert_score | .na |
| TRAIT SLA UNITS mm2/g | leaf_mass_ per_area | mm2/g | population | mean | measurement | 5 |
| TRAIT Leaf Size UNITS mm2 | leaf_area | mm2 | population | mean | measurement | 5 |
| TRAIT Leaf Dry Mass UNITS g | leaf_dry_ mass | g | population | mean | measurement | 5 |
Some notes:
- The `trait_name` must match a trait concept within the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml).
- The second trait in this dataset is documented as `specific leaf area`, the inverse of the trait concept `leaf mass per area`. The unit conversions algorithm inverts data read in as specific leaf area, converting it to leaf mass per area.
- Categorical traits do not have units or replicates, so these fields become `.na`.
- The `traits.build` convention for a categorical trait is `value_type: mode`, indicating the recorded value is the most commonly observed trait value. In some datasets there may be multiple space-delimited values within a single cell in the data.csv file, indicating there are multiple commonly observed categorical trait values.
- For most observations of categorical traits, the `traits.build` convention is that the `basis_of_value` is determined by an expert examining an individual, population or species, and is therefore an `expert_score`.
#### **Additional steps**
Once you are well-versed in adding datasets to a `traits.build` database you will know that there is additional information required in `metadata.yml`.
However, for this tutorial, let's begin by assuming we're finished adding dataset metadata and check for errors:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_2")
```
[\*Users, please note, not all of these test failures or messages are currently in place. Adding test to document all failures is still a work in progress.]{style="color:red;"}
Three items will fail:
1. There are unknown trait values for `plant_growth_form` [(error doesn't yet exist)]{style="color:red;"}
2. There are values out of range for `leaf_dry_mass` [(error doesn't yet exist)]{style="color:red;"}
3. The dataset cannot pivot between `long` and `wide` formats. [(error doesn't yet exist)]{style="color:red;"}
As indicated in the output messages, there is a \[troubleshooting vignette\](https://github.../vignettes/) to help solve these errors.
For this tutorial however, keep reading...
There are several ways to proceed, but for these errors, it is useful to next build the dataset:
```{r, eval=FALSE}
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
```
If you look at the excluded_data table, you'll find any data for this dataset that could not be mapped to known traits, known trait values, or fell within allowable ranges.
```{r, eval=FALSE}
traits.build_database$excluded_data %>%
filter(dataset_id == "tutorial_dataset_2") %>%
View()
```
This code displays a table with 190 rows of excluded data.
- 187 instances of `Unsupported trait value` for the trait `plant_growth_form`
- 3 instance of `Value out of allowable range` for the trait `leaf_dry_mass`
Looking through the output you'll notice that the `Unsupported trait value` error exists because the data.csv file used plant growth form values that are different to those in the trait dictionary.
The values that triggered the `Value out of allowable range` error are all 0's, a disallowed `leaf_dry_mass` value; the trait dictionary specifies that `leaf_dry_mass` can range from `0.01 - 15000.0 mg`.
##### **Adding trait value substitutions** {#categorical_substitutions}
For categorical traits, only trait values that are indicated in the trait dictionary are recognised. This is an important harmonisation step, as it ensures the same trait concept value is mapped to the same trait value throughout the database.
However, researchers use countless synonyms, abbreviations and syntax to express an identical trait value. `traits.build` converts all input to lowercase, but all other substitutions must be specified in the dataset's `metadata.yml` file.
For this example individual letters were used to express 7 plant growth forms: EP, F, G, H, S, T, V
Looking at the definition for `plant_growth_form` in the trait dictionary and the helpful column header provided by the contributor, you can deduce that `t` is for `tree`; `s` is for `shrub`, etc.
There are two ways to add substitutions into the `metadata.yml` file.
1. Map in individual trait value substitutions using `metadata_add_substitution`:
```{r, eval=FALSE}
metadata_add_substitution(dataset_id = "tutorial_dataset_2",
trait_name = "plant_growth_form", find = "t", replace = "tree")
```
Look at the `metadata.yml` file and you'll note that a substitution has been added, to indicate the `t`'s are `tree`'s
You would repeat this step for the remaining unknown trait values.
2. Map in a table of substitutions using `metadata_add_substitutions_list`:
- If there are quite a few trait values that require replacements, it is easier to first create a table of the required substitutions, then add a column of substitutions in either R or Excel.
```{r, eval=FALSE}
table <-
traits.build_database$excluded_data %>%
filter(
dataset_id == "tutorial_dataset_2" &
error == "Unsupported trait value"
) %>%
distinct(trait_name, value) %>%
rename(find = value)
```
Next view your table to check the order of trait values and check the allowed values and definitions in the trait dictionary to ensure you replace each abbreviation with an accepted value. Note that `epiphyte` is not an allowed value for `plant_growth_form` in the trait dictionary, as, AusTraits uses a narrow definition of `plant_growth_form`, and separately has a trait `plant_growth_substrate` which includes the trait value `epiphyte`. For now, we'll simply ignore this data.
To add a column with substitutions, then add the substitutions to the metadata file:
```{r, eval=FALSE}
table <- table %>%
mutate(replace = c("shrub", "tree", "herb", NA,
"graminoid", "fern", "climber_herbaceous"))
## an alternative is
## `mutate(replace = c("shrub", "tree", "herb", "epiphyte", "graminoid",
## "fern", "climber_herbaceous"))`
## which will result in the epiphyte observations remaining in the excluded data table
metadata_add_substitutions_list("tutorial_dataset_2", table)
```
All required substitutions have been added to the `metadata.yml` file.
If you were to rerun `dataset_test("tutorial_dataset_2")` the error referring to `Unsupported trait values` would now have vanished.
##### **Replacing "placeholder characters" with NA's**
The `Values out of range` error is triggered for numeric traits when the `traits.build` pipeline detects values that fall outside the range specified for the trait in the traits dictionary.
There are three common situations that lead to this error warning:
1. Values are truly out of range, possibly due to human error or because a plant really was not performing as expected (i.e. not photosynthesising).
2. Values appear to be out of range because of a "unit conversion issue" - that is, the dataset curator or the dataset contributor got the units wrong. This is fixed by working out the correct units and adjusting this in the traits section of the `metadata.yml` file.
3. A dataset contributor has used a "dummy symbol" to indicate missing data, such as `0`, `x`, `missing`, etc. Truly missing data should be a blank cell - i.e. `NA`
This study is likely an example of (3), where `0` is a placeholder symbol. While the 0's can be left in the `excluded_data` table, cluttering the `excluded_data` table with extraneous measurements makes it difficult to scan for true examples of `Value out of allowable range` errors in the future. *(Values that are `NA` are, by default, omitted from the excluded_data table.)*
###### **adding custom R code** {#custom_R_code}
Instead, you can add R code within the metadata file to replace the placeholder symbol with `NA`.
- Look at `metadata.yml` in Visual Studio Code.
- The second field under the `dataset` section is `custom_R_code: .na`.
- You can write any code you'd like within this section.
- The file `R/custom_R_code.R` that you sourced at the beginning of this tutorial contains customised functions commonly used in custom_R_code.
For this example, replace:
```{r, eval=FALSE}
custom_R_code: na
```
with:
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
)
'
```
This code replaces all 0's in the column with NA's.
You can confirm that the custom R code has made the anticipated change with the function `metadata_check_custom_R_code`. This function reads in the data.csv file, then applies any manipulations from the custom_R_code:
```{r, eval=FALSE}
metadata_check_custom_R_code("tutorial_dataset_2") %>% View()
```
Note:
- use the format with the single quotes; this allows you to manually add line breaks, not otherwise permitted in the metadata.yml format.
- you pipe in `data` to begin with, but do not need to assign your code back to data; that occurs automatically.
Build the database again, then check the `excluded_data` table to confirm there are no longer any excluded measurements:
```{r, eval=FALSE}
source("build.R")
traits.build_database$excluded_data %>%
filter(dataset_id == "tutorial_dataset_2") %>%
View()
```
##### **Replacing duplicate values with NA's** {#correct_entity_type}
Run the tests again to confirm the errors related to disallowed trait values and values out of range have vanished:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_2")
```
However, there should still be an error indicating that the dataset cannot pivot between `long` and `wide` formats.
The ability to pivot is important for 2 reasons:{#dataset_pivot}
1. Database users may prefer to display data in wide format to readily compare the values of multiple traits collected on the same individual (or population or species).
2. The `pivot test` groups together 13 variables that are meant to uniquely identify each row of data (dataset_id, trait_name, observation_id, source_id, taxon_name, entity_type, life_stage, basis_of_record, value_type, population_id, individual_id, temporal_id, method_id, entity_context_id, original_name). An inability to pivot indicates either:
- A variable present in the `data.csv` file to distinguish between unique observations has not been mapped into the metadata file. (Most likely a context property, column with locations, individual_id or source_id)
- Duplicate values exist within the data.csv file and have been read in multiple times.
The error in this dataset is a common one:\
- The three numeric traits (`leaf_mass_per_area`, `leaf_mass_per_area` and `leaf_dry_mass`) are all population-level measurements, while `plant_growth_form` is mapped in as having `entity_type: species`, meaning it is considered a species-level measurement. This means if the same species occurs at multiple sites, its growth form value is read in twice. However, because it is designated as `entity_type: species` that `traits.build` workflow does not connect the value to a location, since the species has the same growth form regardless of location.
- Two options are to:
1. Recategorise `plant_growth_form` as having `entity_type: population`.
2. Only read in a single instance of `plant_growth_form` per species.
Either allows the dataset to pivot, but if the taxon truly displays only a single growth form across all populations it is much better to read in plant growth form once per species. Otherwise the database becomes longer without capturing additional information. This dataset has only a few instance of duplication, but imagine the dataset that has 500 rows of data for the same tree species - suddenly 500 instances of that species being a tree are read into the database.
The solution is to modify the existing custom_R_code, adding one of the customised functions from the file `R/custom_R_code.R`, `replace_duplicates_with_NA`:
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
) %>%
group_by(name_original) %>%
mutate(
across(c("TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine"),
replace_duplicates_with_NA)
) %>%
ungroup()
'
```
Rerun the tests and everything should now pass:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_2")
```
Then rebuild the database and look at the output in the traits table for one of the taxa that previously had duplicate `plant_growth_form` entries:
```{r, eval=FALSE}
source("build.R")
traits.build_database$traits %>%
filter(dataset_id == "tutorial_dataset_2") %>%
filter(taxon_name == "Actinotus minor") %>% View()
dataset_id taxon_name observation_id trait_name value unit entity_type location_id
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 tutorial_dataset_2 Actinotus minor 010 leaf_area 18.8 mm2 population 02
2 tutorial_dataset_2 Actinotus minor 010 leaf_dry_mass 7 mg population 02
3 tutorial_dataset_2 Actinotus minor 010 leaf_mass_per_area 344.827586206897 g/m2 population 02
4 tutorial_dataset_2 Actinotus minor 011 leaf_area 75.9 mm2 population 03
5 tutorial_dataset_2 Actinotus minor 011 leaf_dry_mass 7 mg population 03
6 tutorial_dataset_2 Actinotus minor 011 leaf_mass_per_area 89.2857142857143 g/m2 population 03
7 tutorial_dataset_2 Actinotus minor 012 plant_growth_form herb NA species NA
```
The measurements for the three numeric traits from a single location share a common `observation_id`, as they are all part of an observation of a common entity (a specific population of *Actinotus minor*), at a single location, at a single point in time. However the row with the plant growth form measurement has a separate `observation_id` reflecting that this is an observation of a different entity (the taxon *Actinotus minor*).
##### **Build dataset report**
As a final step, build a report for the study
```{r, eval=FALSE}
traits.build_database$build_info$version <- "5.0.0" # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_2", traits.build_database, overwrite = TRUE)
```