-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathadding_data_long.qmd
1336 lines (955 loc) · 72.7 KB
/
adding_data_long.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
editor:
markdown:
wrap: 72
---
# Adding datasets, a lengthy guide
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
results = "asis",
echo = FALSE,
message = FALSE,
warning = FALSE
)
library(traits.build)
```
```{r, echo=FALSE, results='hide', message=FALSE}
## Loads austraits into global name space
austraits <- austraits:::austraits_5.0.0_lite
schema <- get_schema()
definitions <- austraits$definitions
```
This vignette is an exhaustive reference for adding datasets to a traits.build database.
If you are embarking on building a new database using the `traits.build` standard a better place to get started are 7 [tutorials](tutorial_dataset_1.html).
Then come back to this document for details and unusual dataset circumstances that are not covered in the tutorials.
Other chapters you may want to read include:
- an [overview of `traits.build`](overview.html),
- the instructions provided to [data contributors](contributing_data.html),
- the [structure of a compiled `traits.build` database](database_structure.html),
- the [structure of the raw data files](file_structure.html)
- the [overview for adding data](adding_data_brief.html).
## Getting started
The `traits.build` package offers a workflow to build a harmonised trait database from disparate sources, with different data formats and containing varying metadata.
There are two key components required to merge datasets into a database with a common output structure:
1) A workflow to wrangle datasets into a standardised input format, using a combination of `{traits.build}` functions and manual steps.
2) A process to harmonise information across datasets and build them into a single database.
This document details all the steps to format datasets into a pair of standardised files for input, a tabular data file and a structured metadata file. It includes examples of code you might use.
To begin, install the traits.build package.
```{r, echo=TRUE, eval=FALSE}
#remotes::install_github("traitecoevo/traits.build", quick=TRUE)
library(traits.build)
```
## Standardised input files required
## Create a dataset folder
Add a new folder within the `data` folder. Its name should be the study's unique `dataset_id`.
The preferred format for `dataset_id` is the surname of the first author of any corresponding publication, followed by the year, as `surname_year`. E.g. `Falster_2005`. Wherever there are multiple studies with the same id, we add a suffix `_2`, `_3` etc. E.g.`Falster_2005`, `Falster_2005_2`.
`dataset_id` is one of the core identifiers within a `traits.build` database.
## Constructing the `data.csv` file {#csv_file}
The trait data for each study (`dataset_id`) must be in a single table, `data.csv`. The `data.csv` file can either be in a wide format (1 column for each trait, with the various `trait names` as the column headers) or long format (a single column for all `trait values` and an additional column for `trait name`.
### Required columns
- `taxon_name`
- `trait_name` (many columns for wide format; 1 column for long format)
- `value` (trait value; for long format only)
- `location_name` (if required)
- `contexts` (if required)
- `collection_date` (if required)
- `individual_id` (if required)
a. For all field studies, ensure there is a column for `location_name`. If all measurements were made at a single location, a `location_name` column can easily be mutated using [custom_R_code](#custom_R) within the metadata.yml file. See sections [adding locations](#adding_locations) and [adding contexts](#adding_contexts) below for more information on compiling location and context data.
b. If available, be sure to include a column with `collection date`. If possible, provide dates in `yyyy-mm-dd` (e.g. 2020-03-05) format or, if the day of the month isn't known, as `yyyy-mm` (e.g. 2020-03). However, any format is allowed and the column can be parsed to the proper yyyy-mm-dd format using `custom_R_code`. If the same `collection date` applies to the entire study it can be added directly into the metadata.yml file.
c. If applicable, ensure there are columns for all context properties, including experimental treatments, specific differences in method, a stratified sampling scheme within a plot, or sampling season. Additional context columns could be added through `custom_R_code` or keyed in where traits are added, but it is best to include a column in the data.csv file whenever possible. The protocol for adding context properties to the metadata file is under [adding contexts](#adding_contexts)
### Data may need to be summarised
Data submitted by a contributor should be in the rawest form possible; always request data with individual measurements over location/species means.
Some datasets include replicate measurements on an individual at a single point in time, such as the leaf area of 5 individual leaves. In AusTraits (the Australian plant trait database) we generally merge such measurements into an `individual mean` in the `data.csv` file, but the raw values are preserved in the contributor's raw data files. Be sure to calculate the number of replicates that contributed to each mean value.
When there is just a single column of trait values to summarise, use:
```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
dplyr::group_by(individual, `species name`, location, context, etc) %>%
dplyr::summarise(
leaf_area_mean = mean(leaf_area),
leaf_area_replicates = n()
) %>%
dplyr::ungroup()
```
*Make sure you `group_by` all categorical variables you want to retain, for only columns that are grouping variables will be kept.*
When you want to take the mean of multiple data columns simultaneously, use:
```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
dplyr::group_by(individual, `species name`, location, context, etc) %>%
dplyr::summarise(
across(c(leaf_area, `leaf N`), ~ mean(.x, na.rm = TRUE)),
across(c(growth_form, `photosynthetic pathway`), ~ first(.x)),
replicates = n()
) %>%
dplyr::ungroup()
```
`{dplyr}` hints:
- Categorical variables not included as grouping variables will return `NA`.
- Generally use the function `first` for categorical variables - it simply retains the trait value in the first column.
- You can identify runs of columns by column number/position. For instance `c(5:25), ~ mean(.x, na.rm = TRUE)` or `c(leaf_area:leaf_N), ~ mean(.x, na.rm = TRUE)`.
- Be sure to `ungroup` at the end.
- Before summarising, ensure variables you expect are numeric, are indeed numeric: `utils::str(data)`.
### Merging multiple spreadsheets
If multiple spreadsheets of data are submitted these must be merged together.
- If the spreadsheets include different trait measurements made on the same individual (or location means for the same species), they are best merged using `dplyr::left_join`, specifying all conditions that need to be matched across spreadsheets (e.g. individual, species, location, context). Ensure the column names are identical between spreadsheets or specify columns that need to be matched.
```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
data_1 %>%
dplyr::left_join(
data_2,
by = c("Individual", "Taxon" = "taxon", "Location", "Context")
)
```
- If the spreadsheets include trait measurements for different individuals (or possibly data at different scales - such as individual level data for some traits and species means for other traits), they are best merged using `dplyr::bind_rows`. Ensure the column names for taxon name, location name, context, individual, and collection date are identical between spreadsheets. If there are data for the same traits in both spreadsheets, make sure those column headers are identical as well.
```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
data_1 %>%
dplyr::bind_rows(data_2)
```
### Taxon names
Taxon names need to be complete names. If the main data file includes code names, with a key as a separate file, they are best merged now to avoid many individual replacements later.
```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/species_key.csv") -> species_key
readr::read_csv("data/dataset_id/raw/data_file.csv") -> data
data %>%
dplyr::left_join(species_key, by = "code")
```
### Unexpected hangups
- When Excel saves an `.xls` file as a `.csv` file it only preserves the number of significant figures that are displayed on the screen. This means that if, for some reason, a column has been set to display a very low number of significant figures or a column is very narrow, data quality is lost.
- If you're reading a file into R where there are lots of blanks at the beginning of a column of numeric data, the defaults for `readr::`read_csv` fail to register the column as numeric. It is fixed by adding the argument `guess_max`:
```{r, eval=FALSE, echo=TRUE}
read_csv("data/dataset_id/raw/raw_data.csv", guess_max = 10000)
```
This checks 10,000 rows of data before declaring the column is non-numeric.
(When `data.csv` files are read in through the `{traits.build}` workflow, `guess_max = 100000`.)
## Constructing the `metadata.yml` file {#metadata_file}
As described in detail [here](https://traitecoevo.github.io/traits.build-book/workflow.htm) the `metadata.yml` file maps the meanings of the individual columns within the `data.csv` file and documents all additional dataset metadata.
Before beginning, it is a good idea to look at the two example dataset metadata files in the [`traits.build-template` repository](https://github.com/traitecoevo/traits.build-template/tree/master/data), to become familiar with the general structure.
The sections of the `metadata.yml` file are:
- [source](#source)
- [contributors](#contributors)
- [dataset](#metadata_dataset) (includes adding [custom R
code](#custom_R))
- [locations](#adding_locations)
- [contexts](#adding_contexts)
- [traits](#add_traits)
- [substitutions](#add_substitutions)
- [taxonomic_updates](#add_taxonomic_updates)
- [exclude_observations](#exclude_observations)
- [questions](#questions)
This document covers these metadata sections in sequence.
### Use a proper text editor
- Install a proper text editor, such as Visual Studio Code (our favorite), Rstudio, textmate, or sublime text. Using Microsoft word will make a mess of the formatting.
### Source the `{traits.build}` functions
To assist you in constructing the `metadata.yml` file, we have developed functions to help propagate and fill in the different sections of the file.
If you haven't already, run:
```{r, eval=FALSE, echo=TRUE}
library(traits.build)
```
The functions for populating the metadata file all begin with `metadata_`.
A full list is available [here](https://traitecoevo.github.io/traits.build/reference/index.html#creating-metadata-files).
### Creating a template
The first step is to create a blank `metadata.yml` file.
```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_create_template("Yang_2028")
```
As each function prompts you to enter the dataset_id, it can be useful to assign the dataset's id to a variable you can use repeatedly:
```{r, eval=FALSE, echo=TRUE}
current_study <- "Yang_2028"
traits.build::metadata_create_template(current_study)
```
This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date). It then creates a relatively empty metadata file `data/dataset_id/metadata.yml`.
The questions are:
- Is the data long or wide format?
A wide dataset has each variable (i.e. trait ) as a column. A long dataset has a single row containing all trait values.
- Select column for `taxon_name`
- Select column for `trait_name` (long datasets only)
- Select column for `trait values` (long datasets only)
- Select column for `location_name`
If your `data.csv` file does not yet have a `location_name` column, this information can later be added manually.
- Select column for `individual_id` (a column that links measurements on the same individual)
- Select column for `collection_date`
If your `data.csv` file does not have a `collection_date` column, you will be prompted to *Enter collection_date range in format '2007/2009'*. A fixed value in a `yyyy`, `yyyy-mm` or `yyyy-mm-dd` format is accepted, either as a single value or range of values. This information can be edited later.
- Indicate whether all traits need `repeat_measurements_id`'s
`repeat_measurements_id`'s are only required if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour). They can also be added to individual traits (later). They are intended to capture multiple "sub-measurements" that together comprise a single "trait measurement".
### Adding a source {#source}
The skeletal `metadata.yml` file created by `metadata_create_template` included a section for the primary source with default fields for a journal article.
You can manually enter citation details, but whenever possible, use one of the three functions developed to automatically propagate citation details.
#### **Adding source from a doi**
If you have a `doi` for your study, use the function.
```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi")
```
The different elements within the source will automatically be generated.
Double check the information added to ensure:
1. The title is in `sentence case`.
2. The information isn't in `all caps` (sources from a few journals gets read in as all caps).
3. Pages numbers are present and include `--` between page numbers (for example, `123 -- 134`).
4. If there is a colon (:) or apostrophe (') in a reference, the text for that line must be in quotes (").
By default, details are added as the primary source. If multiple sources are linked to a single `dataset_id`, you can specify a source as `secondary`.
```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi",
type = "secondary")
```
- Attempting to add a second primary source will overwrite the information already input. Instead, if there is a third resource to add, use `type = "secondary_2"`
- Always check the `key` field, as it can be incorrect for hyphenated last names.
- If the dataset being entered is a compilation of many original sources, you should add all the original sources, specifying, `type = "original_01"`, `type = "original_02"` etc. See [Richards_2008](https://github.com/traitecoevo/austraits.build/blob/master/data/Richards_2008/metadata.yml) for an example of a complex source list.
#### **Adding source from a bibtex file**
```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id, file = "myref.bib")
```
(These options require the packages [rcrossref](https://github.com/ropensci/rcrossref) and [RefManageR](https://github.com/ropensci/RefManageR/) to be installed.)
#### **Proper formatting of different source types**
Different source types require different fields, formatting:
**Book:**
```
source:
primary:
key: Cooper_2013
bibtype: Book
year: 2013
author: Wendy Cooper and William T. Cooper
title: Australian rainforest fruits
publisher: CSIRO Publishing
pages: 272
```
**Online resource:**
```
source:
primary:
key: TMAG_2009
bibtype: Online
author: '{Tasmanian Herbarium}'
year: 2009
title: Flora of Tasmania Online
publisher: Tasmanian Museum & Art Gallery (Hobart)
url: http://www.tmag.tas.gov.au/floratasmania
```
**Thesis:**
```
source:
primary:
key: Kanowski_2000
bibtype: Thesis
year: 1999
author: John Kanowski
title: Ecological determinants of the distribution and abundance of the folivorous
marsupials endemic to the rainforests of the Atherton uplands, north Queensland.
type: PhD
institution: James Cook University, Townsville
```
**Unpublished dataset:**
```
source:
primary:
key: Ooi_2018
bibtype: Unpublished
year: 2018
author: Mark K. J. Ooi
title: "Unpublished data: Herbivory survey within Royal National Park, University
of New South Wales"
```
- Note the title of an unpublished dataset must begin with the words "Unpublished data" and include the data collectors affiliation.
### Adding contributors {#contributors}
The skeletal `metadata.yml` file created by the function `metadata_create_template` includes a template for entering details about data contributors. Edit this manually, duplicating if details for multiple people are required.
- `data_collectors` are people who played a key intellectual role in the study's experimental design and data collection. Most studies have 1-3 `data_collectors` listed. Four fields of information are required for each data collector: `last_name`, `given_name`, `affiliation` and `ORCID` (if available). Nominate a single data collector to be the dataset's point of contact.
- Additional field assistants can be listed under `assistants`.
- The data entry person is listed under `dataset_curators`.
- email addresses for the `data_collectors` are not included in the `metadata.yml` file, but it is recommended that a database curator maintain a list of email addresses of all data collectors to whom authorship may be extended on a future database data paper. Authorship "rules" will vary across databases, but for AusTraits we extend authorship to all `data_collectors` who we successfully contact.
For example, in Roderick_2002:
```
contributors:
data_collectors:
- last_name: Roderick
given_name: Michael
ORCID: 0000-0002-3630-7739
affiliation: The Australian National University, Australia
additional_role: contact
assistants: Michelle Cochrane
dataset_curators: Elizabeth Wenk
```
### Custom R code {#custom_R}
The goal is always to maintain `data.csv` files that are as similar as possible to the contributed dataset. However, for many studies there are minor changes we want to make to a dataset before the data.csv file is processed by the `{traits.build}` workflow. These may include applying a function to transform a particular column of data, a function to filter data, or a function to replace a contributor's "measurement missing" placeholder symbol with `NA`. In each case it is appropriate to leave the rawer data in `data.csv` and edit the data table as it is read into the `{traits.build}` workflow.
#### **Background**
To allow custom modifications to a particular dataset before the common pipeline of operations gets applied, the workflow permits for some customised R code to be run as a first step in the processing pipeline. That pipeline (the function `process_custom_code` called within [`dataset_process`](https://github.com/traitecoevo/traits.build/blob/master/R/process.R)) looks like this:
```{r, eval=FALSE, echo=TRUE}
data <-
readr::read_csv(filename_data_raw, col_types = cols(), guess_max = 100000,
progress = FALSE) %>%
process_custom_code(metadata[["dataset"]][["custom_R_code"]])()
```
The second line shows that the custom code gets applied, right after the file is loaded.
#### **Overview of options and syntax**
- A copy of the file containing functions the AusTraits team have explicitly developed to use within the custom_R_code field is available at [custom_R_code.R](https://github.com/traitecoevo/traits.build-template/blob/master/R/custom_R_code.R) and should be placed with the `R` folder within your database repository, then sourced (`source("R/custom_R_code.R")`).
- Place a single apostrophe (') at the start and end of your custom R code; this allows you to add line breaks between pipes.
- Begin your custom R code with `data %>%`, then apply whatever fixes are needed.
- Use functions from the packages [dplyr](https://dplyr.tidyverse.org), [tiydr](https://tidyr.tidyverse.org), [stringr](https://tidyr.tidyverse.org) (e.g. `mutate`, `rename`, `summarise`, `str_detect`), but avoid other packages.
- Alternatively, use the functions we've created explicitly for pre-processing data that were sourced through the file `custom.R`. You may choose to expand this file within your own database repository.
- Custom R code is not intended for reading in files. Any reading in and merging of multiple files should be done before creating the dataset's `data.csv` file.
- Use pipes to weave together a single statement, where possible. If you need to manipulate/subset the data.csv file into multiple data frames and then bind them back together, you'll need to use semi colons `;` at the end of each statement.
##### Examples of appropriate use of custom R code
1. **Converting times to `NY` strings**
Most sources from herbaria record `flowering_time` and `fruiting_time` as a span of months, while AusTraits codes these variables as a sequence of 12 N's and Y's for the 12 months. A series of functions make this conversion in custom_R_code. These include:
- '`format_flowering_months`' (Create flowering times from start to end pair)
- '`convert_month_range_string_to_binary`' (Converts flowering and fruiting month ranges to 12 element character strings of binary data)
- '`convert_month_range_vec_to_binary`' (Convert vectors of month range to 12 element character strings of binary data)
- '`collapse_multirow_phenology_data_to_binary_vec`' (Converts multi-row phenology data to a 12 digit binary string)
2. **Splitting ranges into min, max pairs**
Many datasets from herbaria record traits like `leaf_length`, `leaf_width`, `seed_length`, etc. as a range (e.g. `2-8`). The function `separate_range` separates this data into a pair of columns with `minimum` and `maximum` values, which is the preferable way to merge the data into a trait database.
3. **Removing duplicate values within a dataset**
Duplicate values within a study need to be filtered out using the custom function `replace_duplicates_with_NA`
If a species-level trait value has been entered repeatedly on rows containing individual-level trait measurements, you need to filter out the duplicates. For instance, plant growth form is generally a species-level observation, with the same value on every row with individual-level trait measurements. There are also instances, where a population-level numeric trait appears repeatedly, such as if nutrient analyses were performed on a bulked sample at each site.
Before applying the function, you must group by the variable(s) that contain the unique values. This might be at the species or population level. For instance, use `group_by(Species, Location)` if there are unique values at the species x location level.
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::group_by(Species) %>%
dplyr::mutate(
across(c(`leaf_percentN`, `plant growth form`), replace_duplicates_with_NA)
) %>%
dplyr::ungroup()
```
4. **Removing duplicate values across datasets**
Values that were sourced from a different study need to be filtered out. See [Duplicates between studies](#duplicates_between_studies) below -functions to automate this process are in progress.
5. **Replacing "missing values" with NA's**
If missing data values in a dataset are represented by a symbol, such as `0` or `*`, these need to be converted to NA's:
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
across(c(`height (cm)`, `leaf area (mm2)`), ~ na_if(., 0))
)
```
6. **Mapping data from one trait to a second trait, part 1**
If a subset of data in a column are also `values` for a second trait in AusTraits, some data values can be duplicated into a second temporary column. In the example below, some data in the contributor's `fruit_type` column **also** apply to the trait `fruit_fleshiness` in AusTraits:
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
fruit_fleshiness = ifelse(`fruit type` == "pome", "fleshy", NA)
)
```
The function `move_values_to_new_trait` is being developed to automate this and currently resides in the [`custom_R_code.R`](https://github.com/traitecoevo/austraits.build/blob/master/R/custom_R_code.R) file within the austraits.build repository.
7. **Mapping data from one trait to a second trait, part 2**
If a subset of data in a column are *instead* `values` for a second trait in AusTraits, some data values can be moved to a second column (second trait), also using the function '`move_values_to_new_trait`'. In the example below, some data in the contributor's `growth_form` column *only* apply to the trait `parasitic` in AusTraits. Note you need to create a blank variable, before moving the trait values.
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(new_trait = NA_character) %>%
move_values_to_new_trait(
original_trait = "growth form",
new_trait = "parasitic",
original_values = "parasitic",
values_for_new_trait = "parasitic",
values_to_keep = "xx") %>%
dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))
```
or
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(dispersal_appendage = NA.char) %>%
move_values_to_new_trait(
"fruits", "dispersal_appendage",
c("dry & winged", "enclosed in aril"),
c("wings", "aril"),
c("xx", "enclosed") %>%
dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))
```
- Note, the parameter `values_to_keep` doesn't accept `NA`, leading to the clunky coding. This bug is known, but we haven't managed to fix it.
8. **Mutating a new trait from other traits**
If the `data.csv` file includes raw data that you want to manipulate into a `trait`, or the contributor presents the data in a different formulation than AusTraits, you may choose to mutate a new column, containing a new `trait`.
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
root_mass_fraction = `root mass` / (`root mass` + `shoot mass`)
)
```
9. **Mutating a location name column**
If the dataset has location information, but lacks unique location names (or any location name), you might mutate a `location name` column to map in. (See also [Adding location details](#adding_locations)).
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
location_name = ifelse(location_name == "Mt Field" & habitat == "Montane rainforest",
"Mt Field_wet", location_name),
location_name = ifelse(location_name == "Mt Field" & habitat == "Dry sclerophyll",
"Mt Field_dry", location_name)
)
```
or
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
location_name = dplyr::case_when(
longitude == 151.233056 ~ "heath",
longitude == 151.245833 ~ "terrace",
longitude == 151.2917 ~ "diatreme"
)
)
# Note with `dplyr::case_when`,
# any rows that do not match any of the conditions become `NA`'s.
```
or
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
location_name = paste0("lat_", round(latitude,3),"_long_", round(longitude,3))
)
)
```
10. **Generating `measurement_remarks`**
Sometimes there is a note column with abbreviated information about individual rows of data that is appropriate to map as a context. This could be included in the field `measurement_remarks`:
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
measurement_remarks = paste0("maternal lineage ", Mother)
)
```
11. **Reformatting dates**
You can reformat `collection_dates` to conform to the `yyyy-mm-dd` format, or add a date column
Converting from any `mdy` format to `yyyy-mm-dd` (e.g. `Dec 3 2015` to `2015-12-03`)
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
Date = Date %>% lubridate::mdy()
)
```
Converting from any `dmy` format to `yyyy-mm-dd` (e.g. `3-12-2015` to `2015-12-03`)
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
Date = Date %>% lubridate::dmy()
)
```
Converting from a `mmm-yyyy` (string) format to `yyyy-mm` (e.g. `Dec 2015` to `2015-12`)
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
Date =
lubridate::parse_date_time(Date, orders = "my") %>%
base::format.Date("%Y-%m")
)
```
Converting from a `mdy` format to `yyyy-mm` (e.g. Excel has reinterpreted the data as full dates `12-01-2015` but the resolution should be "month" `2015-12`)
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
Date =
lubridate::parse_date_time(Date, orders = "mdy") %>%
base::format.Date("%Y-%m")
)
```
A particularly complicated example where some dates are presented as `yyyy-mm` and others as `yyyy-mm-dd`
```{r, eval=FALSE, echo=TRUE}
data %>%
dplyr::mutate(
weird_date = ifelse(stringr::str_detect(gathering_date, "^[0-9]{4}"),
gathering_date, NA),
gathering_date = gathering_date %>%
lubridate::mdy(quiet = T) %>% as.character(),
gathering_date = coalesce(gathering_date, weird_date)
) %>%
select(-weird_date)
```
#### **Testing your custom R code**
After you've added the custom R code to a file, check that output is indeed as intended:
```{r, eval=FALSE, echo=TRUE}
metadata_check_custom_R_code("Blackman_2010")
```
### Fill in `metadata[["dataset"]]` {#metadata_dataset}
The `dataset` section includes fields that are:
1. filled in automatically by the function `metadata_create_template()`
2. mandatory fields that need to be filled in manually for all datasets
3. optional fields that are included and filled in only for a subset of datasets
#### **fields automatically filled in**
- **data_is_long_format** yes/no
- **taxon_name**
- **location_name**
- **collection_date** If this is not read in as a specified column, it needs to be filled in manually as `start date/end date` in yyyy-mm-dd, yyyy-mm, or yyyy format, depending on the relevant resolution. If the collection dates are unknown, write `unknown/publication year`, as in `unknown/2022`
- **individual_id** Individual_id is one of the fields that can be read in during `metadata_create_template`. However, you may instead mutate your own `individual_id` using `custom_R_code` and add it in manually. For a wide dataset individual_id is required anytime there are multiple rows of data for the same individual and you want to keep these linked. This field should only be included if it is required.
**WARNING** If you have an entry `individual_id: unknown` this assigns all rows of data to an individual named "unknown" and the entire dataset will be assumed to be from a single individual. This is why it is essential to omit this field if there isn't an actual row of data being read in.
**NOTE** For individual-level measurements, each row of data is presumed to be a different individual during dataset processing. Individual_id is only required if there are multiple rows of data (long or wide format) with information for the same individual.
- **repeat_measurements_id** `repeat_measurement_id`'s are sequential integer identifiers assigned to a sequence of measurements on a single trait that together represent a single observation (and are assigned a single `observation_id` by the `traits.build` pipeline. The assumption is that these are measurements that document points on a response curve. The function `metadata_create_template` offers an option to add it to `metadata[["dataset"]]`, but it can alternately be specified under specific traits, as `repeat_measurements_id: TRUE`
#### **required fields manually filled in**
- **description:** 1-2 sentence description of the study's goals. The abstract of a manuscript usually includes some good sentences/phrases to borrow.
- **basis_of_record:** Basis of record can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: `field`, `field_experiment`, `captive_cultivated`, `lab`, `preserved_specimen`, and `literature`. See the [database structure vignette](database_structure.html#life_stage) for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under `metadata[["dataset"]]` and for specific locations/traits under `metadata[["locations"]]` or `metadata[["traits"]]`, the location/trait value overrides that entered under `metadata[["dataset"]]`.
- **life_stage:** Life stage can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: `adult`, `sapling`, `seedling`, `juvenile`. See the [database structure vignette](database_structure.html#life_stage) for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under `metadata[["dataset"]]` and for specific locations/traits under `metadata[["locations"]]` or `metadata[["traits"]]`, the location/trait value overrides that entered under `metadata[["dataset"]]`.
- **sampling_strategy:** Often a quite long description of the sampling strategy, extracted verbatim from a manuscript whenever possible.
- **original_file:** The name of the file initially submitted to the database curators. It is generally archived in the dataset folder, in a subfolder named `raw`. For AusTraits datasets are also usually archived in the project's GoogleDrive folder.
- **notes:** Notes about the study and processing of data, especially if there were complications or if some data is suspected duplicates with another study and were filtered out.
#### **optional fields manually filled in**
- **measurement_remarks**: Measurement remarks is a field to capture a miscellaneous notes column. This should be information that is not captured by trait methods (which is fixed to a single value for a trait) or as a `context`. Measurement_remarks can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file.
- **entity_type** is standardly added to each trait, and is described below under traits, but a fixed value or column could be read in under `metadata[["dataset"]]`
#### Adding location details {#adding_locations}
Location data includes location names, latitude/longitude coordinates, verbal location descriptions, and any additional abiotic/biotic location variables provided by the contributor (or in the accompanying manuscript). For studies with more than a few locations, it is most efficient to create a table of this data that is automatically read into the `metadata.yml` file.
The function `metadata_add_locations` automatically propagates location information from a stand-alone location properties table into `metadata[["locations"]]`:
```{r, eval=FALSE, echo=TRUE}
locations <- read_csv("data/dataset_id/raw/locations.csv")
traits.build::metadata_add_locations(current_study, locations)
```
The function `metadata_add_locations` first prompts the user to identify the column with the location name and then to list all columns that contain location data. This automatically fills in the location component on the metadata file.
Rules for formatting a `locations` table to read in:
1. Location names must be identical (including syntax, case) to those in `data.csv`
2. Column headers for latitude and longitude data must read `latitude (deg)` and `longitude (deg)`
3. Latitude and longitude must be in decimal degrees (i.e. -46.5832). There are many online converters to convert from `degrees,minutes,seconds` format or `UTM`. Or use the following formula: `decimal_degrees = degrees + (minutes/60) + (seconds/3600)`
4. If there is a column with a general vegetation description (i.e. `rainforest`, `coastal heath` it should be titled `description`)
5. Although location properties are not restricted to a controlled vocabulary, newly added studies should use the same location property syntax as others whenever possible, to allow future discoverability. To generate a list of already used under `location_property`:
```
database$locations %>% dplyr::distinct(location_property)
```
Some examples of syntax to add `locations` data that exists in different formats.
- When the main data.csv file has columns for a few location properties:
```{r, eval=FALSE, echo=TRUE}
locations <-
check_custom_R_code(current_study) %>%
dplyr::distinct(location_name, latitude, longitude, `veg type`) %>%
dplyr::rename(dplyr::all_of(c("latitude (deg)" = "latitude",
"longitude (deg)" = "longitude",
"description" = "veg type")))
traits.build::metadata_add_locations(current_study, locations)
```
- If you were to want to add or edit the data, it is probably easiest to save the `locations` table, then edit in Excel, before reading it back into R
- It is possible that you will want to specify `life_stage` or `basis_of_record` at the location_level. When required, it is usually easiest to manually add these fields to some or all locations.
### Adding contexts {#adding_contexts}
The dictionary definition of a context is *the situation within which something exists or happens, and that can help explain it*. This is exactly what `context_properties` are in AusTraits, ancillary information that is important to explaining and understanding a trait value.
AusTraits recognises 5 categories of contexts:
- **treatment contexts** Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples include soil nutrient manipulations, growing temperatures, or CO2 enhancement.
- **plot contexts** Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples are an property that is stratified within a "geographic location", such as topographic position. `Plots` are of course `locations` themselves; what is a `location` vs `plot_context` depends on the geographic resolution a dataset collector has applied to their locations.
- **entity contexts** Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. This might be the entity's sex, caste (for social insects), or host plant (for insects).
- **temporal contexts** Context property that is a feature of a "point in time" that might affect the trait values measured on an individual, population or species-level entity. They generally represent repeat measurements on the same entity across time and may simply be numbered observations or might be explicitly linked to growing season or time of day.
- **method contexts** Context property that records specific information about a measurement method that is modified between measurements. These might be samples from different canopy light environments, different leaf ages, or sapwood samples from different branch diameters.
Context properties are not restricted to a controlled vocabulary. However, newly added studies should use the same context property syntax as others whenever possible, to allow future discoverability. To generate a list of terms already used under `context_property`, use:
```{r, echo=TRUE, eval=FALSE}
database$contexts %>%
dplyr::distinct(context_property, category)
```
Context properties are most easily read into the `metadata.yml` file with the dedicated function:
```{r, echo=TRUE, eval=FALSE}
traits.build::metadata_add_contexts(dataset_id)
```
The function first displays a list of all data columns (from the data.csv file) and prompts you to select those that are context properties.
1. For each column you are asked to indicate its `category` (those described above).
2. You are shown a list of the unique values present in the data column and asked if these require any substitutions. (y/n)
3. You are asked if descriptions are required for the context property values (y/n)
This function then adds the contexts to the `metadata[["contexts"]]` section.
If you selected both substitutions and descriptions required:
```
- context_property: unknown
category: temporal_context
var_in: month
values:
- find: AUG
value: unknown
description: unknown
- find: DEC
value: unknown
description: unknown
- find: FEB
value: unknown
description: unknown
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: unknown
description: unknown
- find: added CO2
value: unknown
description: unknown
```
If you selected just substitutions required:
```
- context_property: unknown
category: temporal_context
var_in: month
values:
- find: AUG
value: unknown
- find: DEC
value: unknown
- find: FEB
value: unknown
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: unknown
- find: added CO2
value: unknown
```
If you selected neither substitutions nor descriptions required:
```
- context_property: unknown
category: temporal_context
var_in: month
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
```
- You must then manually fill in the fields designated as `unknown`.
- If there is a value in a column that is not a context property, set its value to `value: .na`
If there are additional context properties that were designated in the traits section, these will have to be added manually, as this information is not captured in a column that is read in. A final output might be:
```
- context_property: sampling season
category: temporal_context
var_in: month
values:
- find: AUG
value: August
description: August (late winter)
- find: DEC
value: December
description: December (early summer)
- find: FEB
value: February
description: February (late summer)
- context_property: CO2 treatment
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: 400 ppm
description: Plants grown at ambient CO2 (400 ppm).
- find: added CO2
value: 640 ppm
description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
category: method_context
var_in: method_context # this field would be included in the relevant traits
values:
- value: 20°C # this value would be keyed in through the relevant traits
description: Measurement made at 20°C
- value: 25°C
description: Measurement made at 25°C
```
### Adding traits {#add_traits}
The function `metadata_add_traits()` adds a scaffold for trait metadata to the skeletal `metadata.yml` file.
```{r, eval=FALSE, echo=TRUE}
metadata_add_traits(current_study)
```
You will be asked to indicate which columns include trait data.
This automatically propagates the following metadata fields for each trait selected into `metadata[["traits"]`. `var_in` is the name of a column in the `data.csv` file (for wide datasets) or a unique trait name in the `trait_name` column (for a long dataset):
```
- var_in: leaf area (mm2)
unit_in: .na
trait_name: .na
entity_type: .na
value_type: .na
basis_of_value: .na
replicates: .na
methods: .na
```
The trait details then need to be filled in manually.
- **units**: fill in the units associated with the trait values in the submitted dataset - such as mm2 in the example above. If you're uncertain about the syntax/format used for some more complex units, look through the traits definition file (`config/traits.yml`) or the file showing unit conversions (`config/unit_conversions.csv`). For categorical variables, leave this as `.na`.
AusTraits uses the Unified Code for Units of Measure (UCUM) standard for units (https://ucum.org/ucum), but each database using the `traits.build` workflow can select their own choices for unit abbreviations. The UCUM standard follows clear, simple rules, but also has a flexible syntax for documenting notes that are recorded as part of the 'unit' for specific traits, yet are not formally units, in curly brackets. For instance, {count}/mm2 or umol{CO2}/m2/s, where the actual units are 1/mm2 and umol/m2/s. There are a few not-very-intuitive units in UCUM. `a` is `year` (annum).
**Notes**:
- If the units start with a punctuation symbol, the units must be in single, straight quotes, such as: `unit_in: '{count}/mm2'`
- It is best not to start units with a `-` (negative sign). In AusTraits we've adopted the convention of using, for instance, `neg_MPa` instead of `-MPa`
- **trait_name**: This is the trait name of the appropriate trait concept for the datasets `config/traits.yml`. For currently unsupported traits, leave this as `.na` but then fill in the rest of the data and flag this study as having a potential new trait concept. Then in the future, if an appropriate trait concept is added to the `traits.yml` file, the data can be read into the database by simply replacing the `.na` with a trait name. Each database will have their own criteria/rules for adding traits to the trait dictionary, and likely rules that evolve as a trait database grows. In AusTraits, if no appropriate trait concept exists in the trait dictionary, a new trait must be defined within the accompanying AusTraits Plant Dictionary and should only be added if it is clearly a distinct trait concept, can be explicitly defined, and there exists sufficient trait data that the measurements have comparative value.
- **entity_type**: Entity type indicates "what" is being observed for the trait measurements - as in the organismal-level to which the trait measurements apply. As such, `entity_type` can be `individual`, `population`, `species`, `genus`, `family` or `order`. Metapopulation-level measurements are coded as `population` and infraspecific taxon-level measurements are coded as `species`. See the [database structure vignette](database_structure.html#entity_type) for definitions of these accepted `entity_type` values.
**Note**:
- `entity_type` is about the "organismal-level" to which the trait measurement refers; this is separate from the taxonomic resolution of the entity's name.
- **value_type**: Value type indicates the statistical nature of the trait value recorded. Allowable value types are `mean`, `minimum`, `maximum`, `mode`, `range`, `raw`, and `bin`. See the [database structure vignette](database_structure.html#value_types) for definitions of these accepted value types. All categorical traits are generally scored as being a `mode`, the most commonly observed value. Note that for values that are `bins`, the two numbers are separated by a double-hyphen, `1 -- 10`.
- **basis_of_value**: Basis of value indicates how a value was determined. Allowable terms are `measurement`, `expert_score`, `model_derived`, and `literature`. See the [database structure vignette](database_structure.html#value_types) for definitions of these accepted `basis_of_value` values, but most categorical traits measurements are values that have been scored by an expert (`expert_score`) and most numeric trait values are `measurements`.
- **replicates**: Fill in with the appropriate number of measurements that comprise each value.
If the values are raw values (i.e. a measurement of an individual) `replicates: 1`.
If the values are, for instance, means of 5 leaves from an individual, `replicates: 5`.
If there is just a single population-level value for a trait, that comprises measurements on 5 individuals, `replicates: 5`.
For categorical variables, leave this as `.na`.
If there is a column that specifies replicate number, you can list the column name in the field.
- **methods**: This information can usually be copied verbatim from a manuscript and is a textual description of all components of the method used to measure the trait.
In general, methods sections extracted from pdfs include "special characters" (non-UTF-8 characters). Non-English alphabet characters are recognised (e.g. é, ö) and should remain unchanged. Other characters will be re-formatted during the study input process, so double check that degree symbols (º), en-dashes (--), em-dashes (---), and curly quotes (',',",") have been maintained or reformatted with a suitable alternative. Greek letters and some other characters are replaced with their Unicode equivalent (e.g. \<U+03A8\> replaces Psi (Ψ)); for these it is best to replace the symbol with an interpretable English-character equivalent.
If the there are two columns of data with measurements for the same trait using completely different methods, simply add the respective methods to the metadata for the respective columns. A `method_id` counter will be added to these during processing to ensure the correct trait values are linked to the correct methods. This is separate to `method_contexts` which are minor tweaks to the methods between measurements, that are expected to have concurrent effects on trait values (see below).
**NOTE**:
- If the identical methods apply to a string of traits, for the first trait use the following syntax, where the `&leaf_length_method` notation assigns the remaining text in the field as the `leaf_length_method`.
```
methods: &leaf_length_method All measurements were from dry herbarium
collections, with leaf and bracteole measurements taken from the largest
of these structures on each specimen.
```
Then for the next trait that uses this method you can just include. At the end of processing you can read/write the yml file and this will fill in the assigned text throughout.
```
methods: *leaf_length_method
````
In addition to the automatically propagated fields, there are a number of optional fields you can add if appropriate.
- **life_stage** If all measurements in a dataset were made on plants of the same `life stage` a global value should be entered under [`metadata[["dataset"]]`](#metadata_dataset). However if different traits were measured at different life stages you can specify a unique `life stage` for each trait or indicate a column where this information is stored.
- **basis_of_record** If all measurements in a dataset represent the same `basis_of_record` a global value should be entered under [`metadata[["dataset"]]`](#metadata_dataset). However if different traits have different basis_of_record values you can specify a unique `basis_of_record` value for each trait or indicate a column where this information is stored.
- **measurement_remarks**: Measurement remarks is a field to indicate miscellaneous comments. If these comments only apply to specific trait(s), this field should be specified with those trait's metadata sections. This meant to be information that is not captured by "methods" (which is fixed to a single value for a trait).
- **method_context** If different columns in a wide data.csv file indicate measurements on the same trait using different methods, this needs to be designated. At the bottom of the trait's metadata, add a `method_context_name` field (e.g. `method_context` or `leaf_age_type` are good options). Write a word or short phrase that indicate the method context property value that applies to that trait (data column). For instance, one trait might have `method_context: fully expanded leaves` and a second traits entry might have the same trait name and methods, but `method_context: leaves still expanding`. The method context details must also be added to the [contexts](#adding_contexts) section.
- **temporal_context** If different columns in a wide data.csv file indicate measurements on the same trait, on the same individuals at different points in time, this needs to be designated. At the bottom of the trait's metadata, add a `temporal_context_name` field (e.g. `temporal_context` or `measurement_time_of_day` work well). Write a word or short phrase that indicates which temporal context applies to that trait (data column). For instance, one trait might have `temporal_context: dry season` and a second entry with the same trait name and method might have `temporal_context: after rain`. The temporal context details must also be added to the
[contexts](#adding_contexts) section.
### Adding substitutions {#add_substitutions}
It is very unlikely that a contributor will use categorical trait values that are entirely identical to those listed as allowed trait values for the corresponding trait concept in the `traits.yml` file. You need to add substitutions for those that do not exactly align to match the wording and syntax of the trait values in the trait dictionary.
`metadata[["substitutions"]]` entries are formatted as:
```
substitutions:
- trait_name: dispersal_appendage
find: attached carpels
replace: floral_parts
- trait_name: dispersal_appendage
find: awn
replace: bristles
- trait_name: dispersal_appendage
find: awn bristles
replace: bristles
```
The three elements it includes are:
- **trait_name** is the AusTraits defined trait name.
- **find** is the trait value used in the data.csv file.
- **replace** is the trait value supported by AusTraits.
You can manually type substitutions into the `metadata.yml` file, ensuring you have the syntax and spacing accurate.
Alternately, function `metadata_add_substitution` adds single substitutions directly into `metadata[["substitutions"]]`:
```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_substitution(current_study, "trait_name", "find", "replace")
```
**Notes**:
- Combinations of multiple trait values are allowed - simply list them, space delimited (e.g. `shrub tree` for a species whose growth form includes both).
- Combinations of multiple trait values are reorganised into alphabetic order in order to collapse into fewer combinations (e.g. "fire_killed resprouts" and "resprouts fire_killed" are alphabetised and hence collapsed into one combination, "fire_killed resprouts").
- If a trait value is `N` or `Y` that needs to be in single, straight quotes (usually edited later, directly in the `metadata.yml` file)
If you have many substitutions to add, it is more efficient to create a spreadsheet with a list of all `trait_name` by `trait_value` combinations requiring substitutions. The spreadsheet would have four columns with headers `dataset_id`, `trait_name`, `find` and `replace`. This table can be read directly into the `metadata.yml` file using the function `metadata_add_substitutions_table`:
```{r, eval=FALSE, echo=TRUE}
substitutions_to_add <-
readr::read_csv("data/dataset_id/raw/substitutions_required.csv")
traits.build::metadata_add_substitutions_list(current_study, substitutions_to_add)
```
Once you've build the new dataset (see below), you can quickly create a table of all values that require substitutions: