adding_data_long.qmd

---
editor: 
  markdown: 
    wrap: 72
---

# Adding datasets, a lengthy guide

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  results = "asis",
  echo = FALSE,
  message = FALSE,
  warning = FALSE
)

library(traits.build)
```

```{r, echo=FALSE, results='hide', message=FALSE}
## Loads austraits into global name space
austraits <- austraits:::austraits_5.0.0_lite

schema <- get_schema()
definitions <- austraits$definitions
```

This vignette is an exhaustive reference for adding datasets to a traits.build database.

If you are embarking on building a new database using the `traits.build` standard a better place to get started are 7 [tutorials](tutorial_dataset_1.html).

Then come back to this document for details and unusual dataset circumstances that are not covered in the tutorials.

Other chapters you may want to read include:

-   an [overview of `traits.build`](overview.html),
-   the instructions provided to [data contributors](contributing_data.html),
-   the [structure of a compiled `traits.build` database](database_structure.html),
-   the [structure of the raw data files](file_structure.html)
-   the [overview for adding data](adding_data_brief.html).

## Getting started

The `traits.build` package offers a workflow to build a harmonised trait database from disparate sources, with different data formats and containing varying metadata.

There are two key components required to merge datasets into a database with a common output structure:

1)  A workflow to wrangle datasets into a standardised input format, using a combination of `{traits.build}` functions and manual steps.

2)  A process to harmonise information across datasets and build them into a single database.

This document details all the steps to format datasets into a pair of standardised files for input, a tabular data file and a structured metadata file. It includes examples of code you might use.

To begin, install the traits.build package.

```{r, echo=TRUE, eval=FALSE}
#remotes::install_github("traitecoevo/traits.build", quick=TRUE)

library(traits.build) 
```

## Standardised input files required

## Create a dataset folder

Add a new folder within the `data` folder. Its name should be the study's unique `dataset_id`.

The preferred format for `dataset_id` is the surname of the first author of any corresponding publication, followed by the year, as `surname_year`. E.g. `Falster_2005`. Wherever there are multiple studies with the same id, we add a suffix `_2`, `_3` etc. E.g.`Falster_2005`, `Falster_2005_2`.

`dataset_id` is one of the core identifiers within a `traits.build` database.

## Constructing the `data.csv` file {#csv_file}

The trait data for each study (`dataset_id`) must be in a single table, `data.csv`. The `data.csv` file can either be in a wide format (1 column for each trait, with the various `trait names` as the column headers) or long format (a single column for all `trait values` and an additional column for `trait name`.

### Required columns

-   `taxon_name`  
-   `trait_name` (many columns for wide format; 1 column for long format)
-   `value` (trait value; for long format only)
-   `location_name` (if required)  
-   `contexts` (if required)  
-   `collection_date` (if required)
-   `individual_id` (if required)

a.  For all field studies, ensure there is a column for `location_name`. If all measurements were made at a single location, a `location_name` column can easily be mutated using [custom_R_code](#custom_R) within the metadata.yml file. See sections [adding locations](#adding_locations) and [adding contexts](#adding_contexts) below for more information on compiling location and context data.

b.  If available, be sure to include a column with `collection date`. If possible, provide dates in `yyyy-mm-dd` (e.g. 2020-03-05) format or, if the day of the month isn't known, as `yyyy-mm` (e.g. 2020-03). However, any format is allowed and the column can be parsed to the proper yyyy-mm-dd format using `custom_R_code`. If the same `collection date` applies to the entire study it can be added directly into the metadata.yml file. 
    
c.  If applicable, ensure there are columns for all context properties, including experimental treatments, specific differences in method, a stratified sampling scheme within a plot, or sampling season. Additional context columns could be added through `custom_R_code` or keyed in where traits are added, but it is best to include a column in the data.csv file whenever possible. The protocol for adding context properties to the metadata file is under [adding contexts](#adding_contexts)

### Data may need to be summarised

Data submitted by a contributor should be in the rawest form possible; always request data with individual measurements over location/species means.

Some datasets include replicate measurements on an individual at a single point in time, such as the leaf area of 5 individual leaves. In AusTraits (the Australian plant trait database) we generally merge such measurements into an `individual mean` in the `data.csv` file, but the raw values are preserved in the contributor's raw data files. Be sure to calculate the number of replicates that contributed to each mean value.

When there is just a single column of trait values to summarise, use:

```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
  dplyr::group_by(individual, `species name`, location, context, etc) %>%
  dplyr::summarise(
    leaf_area_mean = mean(leaf_area),
    leaf_area_replicates = n()
    ) %>%
  dplyr::ungroup()
```
*Make sure you `group_by` all categorical variables you want to retain, for only columns that are grouping variables will be kept.*

When you want to take the mean of multiple data columns simultaneously, use:

```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
  dplyr::group_by(individual, `species name`, location, context, etc) %>%
  dplyr::summarise(
    across(c(leaf_area, `leaf N`), ~ mean(.x, na.rm = TRUE)),
    across(c(growth_form, `photosynthetic pathway`), ~ first(.x)),
    replicates = n()
  ) %>%
  dplyr::ungroup()
```

`{dplyr}` hints:

-   Categorical variables not included as grouping variables will return `NA`.
-   Generally use the function `first` for categorical variables - it simply retains the trait value in the first column.
-   You can identify runs of columns by column number/position. For instance `c(5:25), ~ mean(.x, na.rm = TRUE)` or `c(leaf_area:leaf_N), ~ mean(.x, na.rm = TRUE)`.
-   Be sure to `ungroup` at the end.
-   Before summarising, ensure variables you expect are numeric, are indeed numeric: `utils::str(data)`.

### Merging multiple spreadsheets

If multiple spreadsheets of data are submitted these must be merged together.

-   If the spreadsheets include different trait measurements made on the same individual (or location means for the same species), they are best merged using `dplyr::left_join`, specifying all conditions that need to be matched across spreadsheets (e.g. individual, species, location, context). Ensure the column names are identical between spreadsheets or specify columns that need to be matched.

```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2

data_1 %>% 
  dplyr::left_join(
    data_2,
    by = c("Individual", "Taxon" = "taxon", "Location", "Context")
  )
```

-   If the spreadsheets include trait measurements for different individuals (or possibly data at different scales - such as individual level data for some traits and species means for other traits), they are best merged using `dplyr::bind_rows`. Ensure the column names for taxon name, location name, context, individual, and collection date are identical between spreadsheets. If there are data for the same traits in both spreadsheets, make sure those column headers are identical as well.

```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2

data_1 %>% 
  dplyr::bind_rows(data_2)
```

### Taxon names

Taxon names need to be complete names. If the main data file includes code names, with a key as a separate file, they are best merged now to avoid many individual replacements later.

```{r, eval=FALSE, echo=TRUE}
readr::read_csv("data/dataset_id/raw/species_key.csv") -> species_key
readr::read_csv("data/dataset_id/raw/data_file.csv")  -> data

data %>%
  dplyr::left_join(species_key, by = "code")
```

### Unexpected hangups

-   When Excel saves an `.xls` file as a `.csv` file it only preserves the number of significant figures that are displayed on the screen. This means that if, for some reason, a column has been set to display a very low number of significant figures or a column is very narrow, data quality is lost.  

-   If you're reading a file into R where there are lots of blanks at the beginning of a column of numeric data, the defaults for `readr::`read_csv` fail to register the column as numeric. It is fixed by adding the argument `guess_max`:  

```{r, eval=FALSE, echo=TRUE}
read_csv("data/dataset_id/raw/raw_data.csv", guess_max = 10000)
```

This checks 10,000 rows of data before declaring the column is non-numeric.

(When `data.csv` files are read in through the `{traits.build}` workflow, `guess_max = 100000`.)

## Constructing the `metadata.yml` file {#metadata_file}

As described in detail [here](https://traitecoevo.github.io/traits.build-book/workflow.htm) the `metadata.yml` file maps the meanings of the individual columns within the `data.csv` file and documents all additional dataset metadata.

Before beginning, it is a good idea to look at the two example dataset metadata files in the [`traits.build-template` repository](https://github.com/traitecoevo/traits.build-template/tree/master/data), to become familiar with the general structure.

The sections of the `metadata.yml` file are:

-   [source](#source)
-   [contributors](#contributors)
-   [dataset](#metadata_dataset) (includes adding [custom R
    code](#custom_R))
-   [locations](#adding_locations)
-   [contexts](#adding_contexts)
-   [traits](#add_traits)
-   [substitutions](#add_substitutions)
-   [taxonomic_updates](#add_taxonomic_updates)
-   [exclude_observations](#exclude_observations)
-   [questions](#questions)

This document covers these metadata sections in sequence.

### Use a proper text editor

-   Install a proper text editor, such as Visual Studio Code (our favorite), Rstudio, textmate, or sublime text. Using Microsoft word will make a mess of the formatting.

### Source the `{traits.build}` functions

To assist you in constructing the `metadata.yml` file, we have developed functions to help propagate and fill in the different sections of the file.

If you haven't already, run:

```{r, eval=FALSE, echo=TRUE}
library(traits.build)
```

The functions for populating the metadata file all begin with `metadata_`.

A full list is available [here](https://traitecoevo.github.io/traits.build/reference/index.html#creating-metadata-files).

### Creating a template

The first step is to create a blank `metadata.yml` file.

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_create_template("Yang_2028")
```

As each function prompts you to enter the dataset_id, it can be useful to assign the dataset's id to a variable you can use repeatedly:

```{r, eval=FALSE, echo=TRUE}
current_study <- "Yang_2028"

traits.build::metadata_create_template(current_study)
```

This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date). It then creates a relatively empty metadata file `data/dataset_id/metadata.yml`.

The questions are:

-   Is the data long or wide format?

A wide dataset has each variable (i.e. trait ) as a column. A long dataset has a single row containing all trait values.

-   Select column for `taxon_name`
-   Select column for `trait_name` (long datasets only)
-   Select column for `trait values` (long datasets only)
-   Select column for `location_name`

If your `data.csv` file does not yet have a `location_name` column, this information can later be added manually.

-   Select column for `individual_id` (a column that links measurements on the same individual)
-   Select column for `collection_date`

If your `data.csv` file does not have a `collection_date` column, you will be prompted to *Enter collection_date range in format '2007/2009'*. A fixed value in a `yyyy`, `yyyy-mm` or `yyyy-mm-dd` format is accepted, either as a single value or range of values. This information can be edited later.

-   Indicate whether all traits need `repeat_measurements_id`'s

`repeat_measurements_id`'s are only required if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour). They can also be added to individual traits (later). They are intended to capture multiple "sub-measurements" that together comprise a single "trait measurement".

### Adding a source {#source}

The skeletal `metadata.yml` file created by `metadata_create_template` included a section for the primary source with default fields for a journal article.

You can manually enter citation details, but whenever possible, use one of the three functions developed to automatically propagate citation details.

#### **Adding source from a doi**

If you have a `doi` for your study, use the function.

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi")
```

The different elements within the source will automatically be generated.

Double check the information added to ensure:

1.  The title is in `sentence case`.  
2.  The information isn't in `all caps` (sources from a few journals gets read in as all caps).
3.  Pages numbers are present and include `--` between page numbers (for example, `123 -- 134`).
4.  If there is a colon (:) or apostrophe (') in a reference, the text for that line must be in quotes (").

By default, details are added as the primary source. If multiple sources are linked to a single `dataset_id`, you can specify a source as `secondary`.

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi", 
                                      type = "secondary")
```

-   Attempting to add a second primary source will overwrite the information already input. Instead, if there is a third resource to add, use `type = "secondary_2"`
-   Always check the `key` field, as it can be incorrect for hyphenated last names.
-   If the dataset being entered is a compilation of many original sources, you should add all the original sources, specifying, `type = "original_01"`, `type = "original_02"` etc. See [Richards_2008](https://github.com/traitecoevo/austraits.build/blob/master/data/Richards_2008/metadata.yml) for an example of a complex source list.

#### **Adding source from a bibtex file**

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_source_doi(dataset_id, file = "myref.bib")
```

(These options require the packages [rcrossref](https://github.com/ropensci/rcrossref) and [RefManageR](https://github.com/ropensci/RefManageR/) to be installed.)

#### **Proper formatting of different source types**

Different source types require different fields, formatting:

**Book:**

```         
source:
  primary:
      key: Cooper_2013
      bibtype: Book
      year: 2013
      author: Wendy Cooper and William T. Cooper
      title: Australian rainforest fruits
      publisher: CSIRO Publishing
      pages: 272
```

**Online resource:**

```         
source:
  primary:
    key: TMAG_2009
    bibtype: Online
    author: '{Tasmanian Herbarium}'
    year: 2009
    title: Flora of Tasmania Online
    publisher: Tasmanian Museum & Art Gallery (Hobart)
    url: http://www.tmag.tas.gov.au/floratasmania
```

**Thesis:**

```         
source:
  primary:
      key: Kanowski_2000
      bibtype: Thesis
      year: 1999
      author: John Kanowski
      title: Ecological determinants of the distribution and abundance of the folivorous
        marsupials endemic to the rainforests of the Atherton uplands, north Queensland.
      type: PhD
      institution: James Cook University, Townsville
```

**Unpublished dataset:**

```         
source:
  primary:
    key: Ooi_2018
    bibtype: Unpublished
    year: 2018
    author: Mark K. J. Ooi
    title: "Unpublished data: Herbivory survey within Royal National Park, University
      of New South Wales"
```
-   Note the title of an unpublished dataset must begin with the words "Unpublished data" and include the data collectors affiliation. 

### Adding contributors {#contributors}

The skeletal `metadata.yml` file created by the function `metadata_create_template` includes a template for entering details about data contributors. Edit this manually, duplicating if details for multiple people are required.

-   `data_collectors` are people who played a key intellectual role in the study's experimental design and data collection. Most studies have 1-3 `data_collectors` listed. Four fields of information are required for each data collector: `last_name`, `given_name`, `affiliation` and `ORCID` (if available). Nominate a single data collector to be the dataset's point of contact.
-   Additional field assistants can be listed under `assistants`.
-   The data entry person is listed under `dataset_curators`.
-   email addresses for the `data_collectors` are not included in the `metadata.yml` file, but it is recommended that a database curator maintain a list of email addresses of all data collectors to whom authorship may be extended on a future database data paper. Authorship "rules" will vary across databases, but for AusTraits we extend authorship to all `data_collectors` who we successfully contact.

For example, in Roderick_2002:

```         
contributors:
  data_collectors:
  - last_name: Roderick
    given_name: Michael
    ORCID: 0000-0002-3630-7739
    affiliation: The Australian National University, Australia
    additional_role: contact
  assistants: Michelle Cochrane
  dataset_curators: Elizabeth Wenk
```

### Custom R code {#custom_R}

The goal is always to maintain `data.csv` files that are as similar as possible to the contributed dataset. However, for many studies there are minor changes we want to make to a dataset before the data.csv file is processed by the `{traits.build}` workflow. These may include applying a function to transform a particular column of data, a function to filter data, or a function to replace a contributor's "measurement missing" placeholder symbol with `NA`. In each case it is appropriate to leave the rawer data in `data.csv` and edit the data table as it is read into the `{traits.build}` workflow.

#### **Background**

To allow custom modifications to a particular dataset before the common pipeline of operations gets applied, the workflow permits for some customised R code to be run as a first step in the processing pipeline. That pipeline (the function `process_custom_code` called within [`dataset_process`](https://github.com/traitecoevo/traits.build/blob/master/R/process.R)) looks like this:

```{r, eval=FALSE, echo=TRUE}
data <-
  readr::read_csv(filename_data_raw, col_types = cols(), guess_max = 100000, 
                  progress = FALSE) %>%
  process_custom_code(metadata[["dataset"]][["custom_R_code"]])()
```

The second line shows that the custom code gets applied, right after the file is loaded.

#### **Overview of options and syntax**

-   A copy of the file containing functions the AusTraits team have explicitly developed to use within the custom_R_code field is available at [custom_R_code.R](https://github.com/traitecoevo/traits.build-template/blob/master/R/custom_R_code.R) and should be placed with the `R` folder within your database repository, then sourced (`source("R/custom_R_code.R")`).
-   Place a single apostrophe (') at the start and end of your custom R code; this allows you to add line breaks between pipes.
-   Begin your custom R code with `data %>%`, then apply whatever fixes are needed.
-   Use functions from the packages [dplyr](https://dplyr.tidyverse.org), [tiydr](https://tidyr.tidyverse.org), [stringr](https://tidyr.tidyverse.org) (e.g. `mutate`, `rename`, `summarise`, `str_detect`), but avoid other packages.
-   Alternatively, use the functions we've created explicitly for pre-processing data that were sourced through the file `custom.R`. You may choose to expand this file within your own database repository.
-   Custom R code is not intended for reading in files. Any reading in and merging of multiple files should be done before creating the dataset's `data.csv` file.
-   Use pipes to weave together a single statement, where possible. If you need to manipulate/subset the data.csv file into multiple data frames and then bind them back together, you'll need to use semi colons `;` at the end of each statement.

##### Examples of appropriate use of custom R code

1.  **Converting times to `NY` strings**

Most sources from herbaria record `flowering_time` and `fruiting_time` as a span of months, while AusTraits codes these variables as a sequence of 12 N's and Y's for the 12 months. A series of functions make this conversion in custom_R_code. These include:

-   '`format_flowering_months`' (Create flowering times from start to end pair)
-   '`convert_month_range_string_to_binary`' (Converts flowering and fruiting month ranges to 12 element character strings of binary data)
-   '`convert_month_range_vec_to_binary`' (Convert vectors of month range to 12 element character strings of binary data)
-   '`collapse_multirow_phenology_data_to_binary_vec`' (Converts multi-row phenology data to a 12 digit binary string)

2.  **Splitting ranges into min, max pairs**

Many datasets from herbaria record traits like `leaf_length`, `leaf_width`, `seed_length`, etc. as a range (e.g. `2-8`). The function `separate_range` separates this data into a pair of columns with `minimum` and `maximum` values, which is the preferable way to merge the data into a trait database.

3.  **Removing duplicate values within a dataset**

Duplicate values within a study need to be filtered out using the custom function `replace_duplicates_with_NA`

If a species-level trait value has been entered repeatedly on rows containing individual-level trait measurements, you need to filter out the duplicates. For instance, plant growth form is generally a species-level observation, with the same value on every row with individual-level trait measurements. There are also instances, where a population-level numeric trait appears repeatedly, such as if nutrient analyses were performed on a bulked sample at each site.

Before applying the function, you must group by the variable(s) that contain the unique values. This might be at the species or population level. For instance, use `group_by(Species, Location)` if there are unique values at the species x location level.

```{r, eval=FALSE, echo=TRUE}
data %>%
  dplyr::group_by(Species) %>%
    dplyr::mutate(
      across(c(`leaf_percentN`, `plant growth form`), replace_duplicates_with_NA)
      ) %>%
  dplyr::ungroup()
```

4.  **Removing duplicate values across datasets**

Values that were sourced from a different study need to be filtered out. See [Duplicates between studies](#duplicates_between_studies) below -functions to automate this process are in progress.

5.  **Replacing "missing values" with NA's**

If missing data values in a dataset are represented by a symbol, such as `0` or `*`, these need to be converted to NA's:

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    across(c(`height (cm)`, `leaf area (mm2)`), ~ na_if(., 0))
  )
```

6.  **Mapping data from one trait to a second trait, part 1**

If a subset of data in a column are also `values` for a second trait in AusTraits, some data values can be duplicated into a second temporary column. In the example below, some data in the contributor's `fruit_type` column **also** apply to the trait `fruit_fleshiness` in AusTraits:

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    fruit_fleshiness = ifelse(`fruit type` == "pome", "fleshy", NA)
  )
```

The function `move_values_to_new_trait` is being developed to automate this and currently resides in the [`custom_R_code.R`](https://github.com/traitecoevo/austraits.build/blob/master/R/custom_R_code.R) file within the austraits.build repository.

7.  **Mapping data from one trait to a second trait, part 2**

If a subset of data in a column are *instead* `values` for a second trait in AusTraits, some data values can be moved to a second column (second trait), also using the function '`move_values_to_new_trait`'. In the example below, some data in the contributor's `growth_form` column *only* apply to the trait `parasitic` in AusTraits. Note you need to create a blank variable, before moving the trait values.

```{r, eval=FALSE, echo=TRUE}
data %>%
  dplyr::mutate(new_trait = NA_character) %>%
  move_values_to_new_trait(
    original_trait = "growth form",
    new_trait = "parasitic",
    original_values = "parasitic",
    values_for_new_trait = "parasitic",
    values_to_keep = "xx") %>%
  dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))
```

or

```{r, eval=FALSE, echo=TRUE}
data %>%
  dplyr::mutate(dispersal_appendage = NA.char) %>%
  move_values_to_new_trait(
    "fruits", "dispersal_appendage",
    c("dry & winged", "enclosed in aril"),
    c("wings", "aril"),
    c("xx", "enclosed") %>%
  dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))
```

-   Note, the parameter `values_to_keep` doesn't accept `NA`, leading to the clunky coding. This bug is known, but we haven't managed to fix it.

8.  **Mutating a new trait from other traits**

If the `data.csv` file includes raw data that you want to manipulate into a `trait`, or the contributor presents the data in a different formulation than AusTraits, you may choose to mutate a new column, containing a new `trait`.

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    root_mass_fraction = `root mass` / (`root mass` + `shoot mass`)
  )
```

9.  **Mutating a location name column**

If the dataset has location information, but lacks unique location names (or any location name), you might mutate a `location name` column to map in. (See also [Adding location details](#adding_locations)).

```{r, eval=FALSE, echo=TRUE}
data %>%
  dplyr::mutate(
    location_name = ifelse(location_name == "Mt Field" & habitat == "Montane rainforest", 
                           "Mt Field_wet", location_name),
    location_name = ifelse(location_name == "Mt Field" & habitat == "Dry sclerophyll", 
                           "Mt Field_dry", location_name)
  )
```

or

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    location_name = dplyr::case_when(
      longitude == 151.233056 ~ "heath",
      longitude == 151.245833 ~ "terrace",
      longitude == 151.2917 ~ "diatreme"
    )
  )
# Note with `dplyr::case_when`, 
# any rows that do not match any of the conditions become `NA`'s.
```

or

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    location_name = paste0("lat_", round(latitude,3),"_long_", round(longitude,3))
    )
  )
```

10. **Generating `measurement_remarks`**

Sometimes there is a note column with abbreviated information about individual rows of data that is appropriate to map as a context. This could be included in the field `measurement_remarks`:

```{r, eval=FALSE, echo=TRUE}
data %>%
  dplyr::mutate(
    measurement_remarks = paste0("maternal lineage ", Mother)
  )
```

11. **Reformatting dates**

You can reformat `collection_dates` to conform to the `yyyy-mm-dd` format, or add a date column

Converting from any `mdy` format to `yyyy-mm-dd` (e.g. `Dec 3 2015` to `2015-12-03`)

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    Date = Date %>% lubridate::mdy()
    )
```

Converting from any `dmy` format to `yyyy-mm-dd` (e.g. `3-12-2015` to `2015-12-03`)

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    Date = Date %>% lubridate::dmy()
    )
```

Converting from a `mmm-yyyy` (string) format to `yyyy-mm` (e.g. `Dec 2015` to `2015-12`)

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    Date = 
      lubridate::parse_date_time(Date, orders = "my") %>% 
      base::format.Date("%Y-%m")
    )
```

Converting from a `mdy` format to `yyyy-mm` (e.g. Excel has reinterpreted the data as full dates `12-01-2015` but the resolution should be "month" `2015-12`)

```{r, eval=FALSE, echo=TRUE}
data %>% 
  dplyr::mutate(
    Date = 
      lubridate::parse_date_time(Date, orders = "mdy") %>% 
      base::format.Date("%Y-%m")
    )
```

A particularly complicated example where some dates are presented as `yyyy-mm` and others as `yyyy-mm-dd`

```{r, eval=FALSE, echo=TRUE}
data %>%
    dplyr::mutate(
      weird_date = ifelse(stringr::str_detect(gathering_date, "^[0-9]{4}"), 
                          gathering_date, NA),
      gathering_date = gathering_date %>% 
          lubridate::mdy(quiet = T) %>% as.character(),
      gathering_date = coalesce(gathering_date, weird_date)
    ) %>%
    select(-weird_date)
```

#### **Testing your custom R code**

After you've added the custom R code to a file, check that output is indeed as intended:

```{r, eval=FALSE, echo=TRUE}
metadata_check_custom_R_code("Blackman_2010")
```

### Fill in `metadata[["dataset"]]` {#metadata_dataset}

The `dataset` section includes fields that are:

1.  filled in automatically by the function `metadata_create_template()`
2.  mandatory fields that need to be filled in manually for all datasets
3.  optional fields that are included and filled in only for a subset of datasets

#### **fields automatically filled in**

-   **data_is_long_format** yes/no

-   **taxon_name**

-   **location_name**

-   **collection_date** If this is not read in as a specified column, it needs to be filled in manually as `start date/end date` in yyyy-mm-dd, yyyy-mm, or yyyy format, depending on the relevant resolution. If the collection dates are unknown, write `unknown/publication year`, as in `unknown/2022`

-   **individual_id** Individual_id is one of the fields that can be read in during `metadata_create_template`. However, you may instead mutate your own `individual_id` using `custom_R_code` and add it in manually. For a wide dataset individual_id is required anytime there are multiple rows of data for the same individual and you want to keep these linked. This field should only be included if it is required.

    **WARNING** If you have an entry `individual_id: unknown` this assigns all rows of data to an individual named "unknown" and the entire dataset will be assumed to be from a single individual. This is why it is essential to omit this field if there isn't an actual row of data being read in.  
    **NOTE** For individual-level measurements, each row of data is presumed to be a different individual during dataset processing. Individual_id is only required if there are multiple rows of data (long or wide format) with information for the same individual.

-   **repeat_measurements_id** `repeat_measurement_id`'s are sequential integer identifiers assigned to a sequence of measurements on a single trait that together represent a single observation (and are assigned a single `observation_id` by the `traits.build` pipeline. The assumption is that these are measurements that document points on a response curve. The function `metadata_create_template` offers an option to add it to `metadata[["dataset"]]`, but it can alternately be specified under specific traits, as `repeat_measurements_id: TRUE`

#### **required fields manually filled in**

-   **description:** 1-2 sentence description of the study's goals. The abstract of a manuscript usually includes some good sentences/phrases to borrow.  

-   **basis_of_record:** Basis of record can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: `field`, `field_experiment`, `captive_cultivated`, `lab`, `preserved_specimen`, and `literature`. See the [database structure vignette](database_structure.html#life_stage) for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under `metadata[["dataset"]]` and for specific locations/traits under `metadata[["locations"]]` or `metadata[["traits"]]`, the location/trait value overrides that entered under `metadata[["dataset"]]`.  
 
-   **life_stage:** Life stage can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: `adult`, `sapling`, `seedling`, `juvenile`. See the [database structure vignette](database_structure.html#life_stage) for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under `metadata[["dataset"]]` and for specific locations/traits under `metadata[["locations"]]` or `metadata[["traits"]]`, the location/trait value overrides that entered under `metadata[["dataset"]]`.  

-   **sampling_strategy:** Often a quite long description of the sampling strategy, extracted verbatim from a manuscript whenever possible.  

-   **original_file:** The name of the file initially submitted to the database curators. It is generally archived in the dataset folder, in a subfolder named `raw`. For AusTraits datasets are also usually archived in the project's GoogleDrive folder.  

-   **notes:** Notes about the study and processing of data, especially if there were complications or if some data is suspected duplicates with another study and were filtered out.  

#### **optional fields manually filled in**

-   **measurement_remarks**: Measurement remarks is a field to capture a miscellaneous notes column. This should be information that is not captured by trait methods (which is fixed to a single value for a trait) or as a `context`. Measurement_remarks can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the `data.csv` file.

-   **entity_type** is standardly added to each trait, and is described below under traits, but a fixed value or column could be read in under `metadata[["dataset"]]`

#### Adding location details {#adding_locations}

Location data includes location names, latitude/longitude coordinates, verbal location descriptions, and any additional abiotic/biotic location variables provided by the contributor (or in the accompanying manuscript). For studies with more than a few locations, it is most efficient to create a table of this data that is automatically read into the `metadata.yml` file.

The function `metadata_add_locations` automatically propagates location information from a stand-alone location properties table into `metadata[["locations"]]`:

```{r, eval=FALSE, echo=TRUE}
locations <- read_csv("data/dataset_id/raw/locations.csv")
traits.build::metadata_add_locations(current_study, locations)
```

The function `metadata_add_locations` first prompts the user to identify the column with the location name and then to list all columns that contain location data. This automatically fills in the location component on the metadata file.

Rules for formatting a `locations` table to read in:

1.  Location names must be identical (including syntax, case) to those in `data.csv`

2.  Column headers for latitude and longitude data must read `latitude (deg)` and `longitude (deg)`

3.  Latitude and longitude must be in decimal degrees (i.e. -46.5832). There are many online converters to convert from `degrees,minutes,seconds` format or `UTM`. Or use the following formula: `decimal_degrees = degrees + (minutes/60) + (seconds/3600)`

4.  If there is a column with a general vegetation description (i.e. `rainforest`, `coastal heath` it should be titled `description`)

5.  Although location properties are not restricted to a controlled vocabulary, newly added studies should use the same location property syntax as others whenever possible, to allow future discoverability. To generate a list of already used under `location_property`:

```         
database$locations %>% dplyr::distinct(location_property)
```

Some examples of syntax to add `locations` data that exists in different formats.

-   When the main data.csv file has columns for a few location properties:

```{r, eval=FALSE, echo=TRUE}
locations <-
  check_custom_R_code(current_study) %>%
    dplyr::distinct(location_name, latitude, longitude, `veg type`) %>%
    dplyr::rename(dplyr::all_of(c("latitude (deg)" = "latitude",
                                  "longitude (deg)" = "longitude", 
                                  "description" = "veg type")))

traits.build::metadata_add_locations(current_study, locations)
```

-   If you were to want to add or edit the data, it is probably easiest to save the `locations` table, then edit in Excel, before reading it back into R

-   It is possible that you will want to specify `life_stage` or `basis_of_record` at the location_level. When required, it is usually easiest to manually add these fields to some or all locations.

### Adding contexts {#adding_contexts}

The dictionary definition of a context is *the situation within which something exists or happens, and that can help explain it*. This is exactly what `context_properties` are in AusTraits, ancillary information that is important to explaining and understanding a trait value.

AusTraits recognises 5 categories of contexts:

-   **treatment contexts** Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples include soil nutrient manipulations, growing temperatures, or CO2 enhancement.
-   **plot contexts** Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples are an property that is stratified within a "geographic location", such as topographic position. `Plots` are of course `locations` themselves; what is a `location` vs `plot_context` depends on the geographic resolution a dataset collector has applied to their locations. 
-   **entity contexts** Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. This might be the entity's sex, caste (for social insects), or host plant (for insects).
-   **temporal contexts** Context property that is a feature of a "point in time" that might affect the trait values measured on an individual, population or species-level entity. They generally represent repeat measurements on the same entity across time and may simply be numbered observations or might be explicitly linked to growing season or time of day. 
-   **method contexts** Context property that records specific information about a measurement method that is modified between measurements. These might be samples from different canopy light environments, different leaf ages, or sapwood samples from different branch diameters.

Context properties are not restricted to a controlled vocabulary. However, newly added studies should use the same context property syntax as others whenever possible, to allow future discoverability. To generate a list of terms already used under `context_property`, use:

```{r, echo=TRUE, eval=FALSE}
database$contexts %>% 
  dplyr::distinct(context_property, category)
```

Context properties are most easily read into the `metadata.yml` file with the dedicated function:

```{r, echo=TRUE, eval=FALSE}
traits.build::metadata_add_contexts(dataset_id)
```

The function first displays a list of all data columns (from the data.csv file) and prompts you to select those that are context properties.

1.  For each column you are asked to indicate its `category` (those described above).

2.  You are shown a list of the unique values present in the data column and asked if these require any substitutions. (y/n)

3.  You are asked if descriptions are required for the context property values (y/n)

This function then adds the contexts to the `metadata[["contexts"]]` section.

If you selected both substitutions and descriptions required:

```         
- context_property: unknown
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: unknown
    description: unknown
  - find: DEC
    value: unknown
    description: unknown
  - find: FEB
    value: unknown
    description: unknown
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: unknown
    description: unknown
  - find: added CO2
    value: unknown
    description: unknown
```

If you selected just substitutions required:

```         
- context_property: unknown
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: unknown
  - find: DEC
    value: unknown
  - find: FEB
    value: unknown
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: unknown
  - find: added CO2
    value: unknown
```

If you selected neither substitutions nor descriptions required:

```         
- context_property: unknown
  category: temporal_context
  var_in: month
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
```

-   You must then manually fill in the fields designated as `unknown`.
-   If there is a value in a column that is not a context property, set its value to `value: .na`

If there are additional context properties that were designated in the traits section, these will have to be added manually, as this information is not captured in a column that is read in. A final output might be:

```         
- context_property: sampling season
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: August
    description: August (late winter)
  - find: DEC
    value: December
    description: December (early summer)
  - find: FEB
    value: February
    description: February (late summer)
- context_property: CO2 treatment
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: 400 ppm
    description: Plants grown at ambient CO2 (400 ppm).
  - find: added CO2
    value: 640 ppm
    description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
  category: method_context
  var_in: method_context          # this field would be included in the relevant traits
  values:
  - value: 20°C                   # this value would be keyed in through the relevant traits
    description: Measurement made at 20°C
  - value: 25°C
    description: Measurement made at 25°C
```

### Adding traits {#add_traits}

The function `metadata_add_traits()` adds a scaffold for trait metadata to the skeletal `metadata.yml` file.

```{r, eval=FALSE, echo=TRUE}
metadata_add_traits(current_study)
```

You will be asked to indicate which columns include trait data.

This automatically propagates the following metadata fields for each trait selected into `metadata[["traits"]`. `var_in` is the name of a column in the `data.csv` file (for wide datasets) or a unique trait name in the `trait_name` column (for a long dataset):

```         
- var_in: leaf area (mm2)
  unit_in: .na
  trait_name: .na
  entity_type: .na
  value_type: .na
  basis_of_value: .na
  replicates: .na
  methods: .na
```

The trait details then need to be filled in manually.

-   **units**: fill in the units associated with the trait values in the submitted dataset - such as mm2 in the example above. If you're uncertain about the syntax/format used for some more complex units, look through the traits definition file (`config/traits.yml`) or the file showing unit conversions (`config/unit_conversions.csv`). For categorical variables, leave this as `.na`.

  AusTraits uses the Unified Code for Units of Measure (UCUM) standard for units (https://ucum.org/ucum), but each database using the `traits.build` workflow can select their own choices for unit abbreviations. The UCUM standard follows clear, simple rules, but also has a flexible syntax for documenting notes that are recorded as part of the 'unit' for specific traits, yet are not formally units, in curly brackets. For instance, {count}/mm2 or umol{CO2}/m2/s, where the actual units are 1/mm2 and umol/m2/s. There are a few not-very-intuitive units in UCUM. `a` is `year` (annum).
  
  **Notes**:  
  -   If the units start with a punctuation symbol, the units must be in single, straight quotes, such as: `unit_in: '{count}/mm2'`   
  -   It is best not to start units with a `-` (negative sign). In AusTraits we've adopted the convention of using, for instance, `neg_MPa` instead of `-MPa`  

-   **trait_name**: This is the trait name of the appropriate trait concept for the datasets `config/traits.yml`. For currently unsupported traits, leave this as `.na` but then fill in the rest of the data and flag this study as having a potential new trait concept. Then in the future, if an appropriate trait concept is added to the `traits.yml` file, the data can be read into the database by simply replacing the `.na` with a trait name. Each database will have their own criteria/rules for adding traits to the trait dictionary, and likely rules that evolve as a trait database grows. In AusTraits, if no appropriate trait concept exists in the trait dictionary, a new trait must be defined within the accompanying AusTraits Plant Dictionary and should only be added if it is clearly a distinct trait concept, can be explicitly defined, and there exists sufficient trait data that the measurements have comparative value.

-   **entity_type**: Entity type indicates "what" is being observed for the trait measurements - as in the organismal-level to which the trait measurements apply. As such, `entity_type` can be `individual`, `population`, `species`, `genus`, `family` or `order`. Metapopulation-level measurements are coded as `population` and infraspecific taxon-level measurements are coded as `species`. See the [database structure vignette](database_structure.html#entity_type) for definitions of these accepted `entity_type` values. 

  **Note**:  
  -   `entity_type` is about the "organismal-level" to which the trait measurement refers; this is separate from the taxonomic resolution of the entity's name.  

-   **value_type**: Value type indicates the statistical nature of the trait value recorded. Allowable value types are `mean`, `minimum`, `maximum`, `mode`, `range`, `raw`, and `bin`. See the [database structure vignette](database_structure.html#value_types) for definitions of these accepted value types. All categorical traits are generally scored as being a `mode`, the most commonly observed value. Note that for values that are `bins`, the two numbers are separated by a double-hyphen, `1 -- 10`.

-   **basis_of_value**: Basis of value indicates how a value was determined. Allowable terms are `measurement`, `expert_score`, `model_derived`, and `literature`. See the [database structure vignette](database_structure.html#value_types) for definitions of these accepted `basis_of_value` values, but most categorical traits measurements are values that have been scored by an expert (`expert_score`) and most numeric trait values are `measurements`.

-   **replicates**: Fill in with the appropriate number of measurements that comprise each value.

  If the values are raw values (i.e. a measurement of an individual) `replicates: 1`.  
  If the values are, for instance, means of 5 leaves from an individual, `replicates: 5`.  
  If there is just a single population-level value for a trait, that comprises measurements on 5 individuals, `replicates: 5`.  
  For categorical variables, leave this as `.na`.  
  If there is a column that specifies replicate number, you can list the column name in the field.  

-   **methods**: This information can usually be copied verbatim from a manuscript and is a textual description of all components of the method used to measure the trait.

  In general, methods sections extracted from pdfs include "special characters" (non-UTF-8 characters). Non-English alphabet characters are recognised (e.g. é, ö) and should remain unchanged. Other characters will be re-formatted during the study input process, so double check that degree symbols (º), en-dashes (--), em-dashes (---), and curly quotes (',',",") have been maintained or reformatted with a suitable alternative. Greek letters and some other characters are replaced with their Unicode equivalent (e.g. \<U+03A8\> replaces Psi (Ψ)); for these it is best to replace the symbol with an interpretable English-character equivalent.
  
  If the there are two columns of data with measurements for the same trait using completely different methods, simply add the respective methods to the metadata for the respective columns. A `method_id` counter will be added to these during processing to ensure the correct trait values are linked to the correct methods. This is separate to `method_contexts` which are minor tweaks to the methods between measurements, that are expected to have concurrent effects on trait values (see below).
  
  **NOTE**:   
  -   If the identical methods apply to a string of traits, for the first trait use the following syntax, where the `&leaf_length_method` notation assigns the remaining text in the field as the `leaf_length_method`.  

```
  methods: &leaf_length_method All measurements were from dry herbarium 
    collections, with leaf and bracteole measurements taken from the largest 
    of these structures on each specimen.
```
  
  Then for the next trait that uses this method you can just include. At the end of processing you can read/write the yml file and this will fill in the assigned text throughout.

```
  methods: *leaf_length_method
````

In addition to the automatically propagated fields, there are a number of optional fields you can add if appropriate.

-   **life_stage** If all measurements in a dataset were made on plants of the same `life stage` a global value should be entered under [`metadata[["dataset"]]`](#metadata_dataset). However if different traits were measured at different life stages you can specify a unique `life stage` for each trait or indicate a column where this information is stored.

-   **basis_of_record** If all measurements in a dataset represent the same `basis_of_record` a global value should be entered under [`metadata[["dataset"]]`](#metadata_dataset). However if different traits have different basis_of_record values you can specify a unique `basis_of_record` value for each trait or indicate a column where this information is stored.

-   **measurement_remarks**: Measurement remarks is a field to indicate miscellaneous comments. If these comments only apply to specific trait(s), this field should be specified with those trait's metadata sections. This meant to be information that is not captured by "methods" (which is fixed to a single value for a trait).

-   **method_context** If different columns in a wide data.csv file indicate measurements on the same trait using different methods, this needs to be designated. At the bottom of the trait's metadata, add a `method_context_name` field (e.g. `method_context` or `leaf_age_type` are good options). Write a word or short phrase that indicate the method context property value that applies to that trait (data column). For instance, one trait might have `method_context: fully expanded leaves` and a second traits entry might have the same trait name and methods, but `method_context: leaves still expanding`. The method context details must also be added to the [contexts](#adding_contexts) section.

-   **temporal_context** If different columns in a wide data.csv file indicate measurements on the same trait, on the same individuals at different points in time, this needs to be designated. At the bottom of the trait's metadata, add a `temporal_context_name` field (e.g. `temporal_context` or `measurement_time_of_day` work well). Write a word or short phrase that indicates which temporal context applies to that trait (data column). For instance, one trait might have `temporal_context: dry season` and a second entry with the same trait name and method might have `temporal_context: after rain`. The temporal context details must also be added to the
    [contexts](#adding_contexts) section.

### Adding substitutions {#add_substitutions}

It is very unlikely that a contributor will use categorical trait values that are entirely identical to those listed as allowed trait values for the corresponding trait concept in the `traits.yml` file. You need to add substitutions for those that do not exactly align to match the wording and syntax of the trait values in the trait dictionary.

`metadata[["substitutions"]]` entries are formatted as:

```         
substitutions:
- trait_name: dispersal_appendage
  find: attached carpels
  replace: floral_parts
- trait_name: dispersal_appendage
  find: awn
  replace: bristles
- trait_name: dispersal_appendage
  find: awn bristles
  replace: bristles
```

The three elements it includes are:  
-   **trait_name** is the AusTraits defined trait name.  
-   **find** is the trait value used in the data.csv file.  
-   **replace** is the trait value supported by AusTraits.   

You can manually type substitutions into the `metadata.yml` file, ensuring you have the syntax and spacing accurate.  

Alternately, function `metadata_add_substitution` adds single substitutions directly into `metadata[["substitutions"]]`:

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_substitution(current_study, "trait_name", "find", "replace")
```

  **Notes**:  
  -   Combinations of multiple trait values are allowed - simply list them, space delimited (e.g. `shrub tree` for a species whose growth form includes both).   
  -   Combinations of multiple trait values are reorganised into alphabetic order in order to collapse into fewer combinations (e.g. "fire_killed resprouts" and "resprouts fire_killed" are alphabetised and hence collapsed into one combination, "fire_killed resprouts").  
  -   If a trait value is `N` or `Y` that needs to be in single, straight quotes (usually edited later, directly in the `metadata.yml` file)  

If you have many substitutions to add, it is more efficient to create a spreadsheet with a list of all `trait_name` by `trait_value` combinations requiring substitutions. The spreadsheet would have four columns with headers `dataset_id`, `trait_name`, `find` and `replace`. This table can be read directly into the `metadata.yml` file using the function `metadata_add_substitutions_table`:

```{r, eval=FALSE, echo=TRUE}
substitutions_to_add <- 
  readr::read_csv("data/dataset_id/raw/substitutions_required.csv")

traits.build::metadata_add_substitutions_list(current_study, substitutions_to_add)
```

Once you've build the new dataset (see below), you can quickly create a table of all values that require substitutions:

```{r, eval=FALSE, echo=TRUE}
austraits$excluded_data %>%
  filter(
    dataset_id == current_study,
    error == "Unsupported trait value"
  ) %>%
  distinct(dataset_id, trait_name, value) %>%
  rename("find" = "value") %>%
  select(-dataset_id) %>%
  write_csv("data/dataset_id/raw/substitutions_required.csv")
```

Manually add the aligned values in Excel, then:

```{r, eval=FALSE, echo=TRUE}
substitutions_to_add <-
  readr::read_csv("data/dataset_id/raw/substitutions_required_after_editing.csv")

metadata_add_substitutions_list(dataset_id, substitutions_to_add)
```

### Adding taxonomic updates {#add_taxonomic_updates}

`metadata[["taxonomic_updates"]]` is a metadata section to document edits to taxonomic names to align the names submitted by the dataset contributor with a taxon name in the databases taxonomic resources master list, `config/taxon_list.csv`. This includes correcting typos, standardising syntax (punctuation, abbreviations used for words like `subspecies`), and reformatting names to adhere to taxonomic standards for a specific taxon group and a specific databases' rules.

`metadata[["taxonomic_updates"]]` entries are formatted as:

```         
taxonomic_updates:
- find: Acacia ancistrophylla/sclerophylla
  replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
  reason: Rewording taxon where `/` indicates uncertain species identification 
    to align with `APC accepted` genus (2022-11-10)
  taxonomic_resolution: genus
- find: Pimelea neo-anglica
  replace: Pimelea neoanglica
  reason: Fuzzy alignment with accepted canonical name in APC (2022-11-22)
  taxonomic_resolution: species
- find: Plantago gaudichaudiana
  replace: Plantago gaudichaudii
  reason: Fuzzy alignment with accepted canonical name in APC (2022-11-10)
  taxonomic_resolution: species
- find: Poa sp.
  replace: Poa sp. [Angevin_2011]
  reason: Adding dataset_id to genus-level taxon names. (2023-06-16)
  taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
  replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
  reason: Fuzzy match alignment with species-level canonical name in `APC known` 
    when everything except first 2 words ignored (2022-11-10)
  taxonomic_resolution: Species
```

  **Notes**:  
  -   Each trait database will have their own conventions for how to align names that cannot be perfectly matched to an accepted/valid taxon concept. The examples and notes provided here indicate the conventions used by AusTraits.   
  -   `Poa sp.` and `Acacia ancistrophylla/sclerophylla` are examples of taxon names that can only be aligned to genus. The `taxonomic_resolution` is therefore specified as `genus`. The portion of the name that can be aligned to the taxonomic resource must be before the square brackets. Any information within the square brackets is important for uniquely identifying this entry within the trait database, but does not provide additional taxonomic information.   
  -   `Polyalthia (Wyvur)` is a poorly formatted phrase name that has been matched to its appropriate syntax in the APC.   

The four elements it includes are:  
-   **find**: The original name given to taxon in the original data supplied by the authors.  
-   **replace**: The updated taxon name, that should now to aligned to a taxon name within the chosen taxonomic reference.  
-   **reason**: Records why the change was implemented, e.g. `typos`, `taxonomic synonyms`, and `standardising spellings`.  
-   **taxonomic_resolution**: The rank of the most specific taxon name (or scientific name) to which a submitted original name resolves.  

The function `metadata_add_taxonomic_change` adds single taxonomic updates directly into `metadata[["taxonomic_updates"]]`:

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_taxonomic_change(current_study, 
                                            "find", "replace", "reason", 
                                            "taxonomic_resolution")
```

The function `metadata_add_taxonomic_changes_list` adds a table of taxonomic updates directly into `metadata[["taxonomic_updates"]]`. The column headers must be `find`, `replace`, `reason`, and `taxonomic_resolution`.

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_add_taxonomic_changes_list(current_study, table_of_substitutions)
```

Working manually through taxonomic alignments for all datasets in a database can be a huge time sink. The AusTraits team developed the R-package `{APCalign}` to automate the process of aligning names of Australian plants to names within the National Species Lists, the APC and APNI. This is supplemented by an AusTraits-specific function [`build_align_taxon_names`](https://github.com/traitecoevo/austraits.build/blob/develop/R/build_align_taxon_names.R) that uses `{APCalign}` to automatically add `taxonomic_updates` to `metadata.yml` files. While these packages/functions are Australian-plant specific, they include code that can be re-purposed for other global regions or taxonomic groups.

For instance,

```{r, eval=FALSE, echo=TRUE}
taxon_list <- readr::read_csv("config/taxon_list.csv")

names_to_align <- 
  database$taxonomic_updates %>%
    dplyr::filter(dataset_id == current_study) %>%
    # next row down might be modified to also filter names in an external taxonomic resource
    dplyr::filter(!aligned_name %in% taxon_list$aligned_name) %>%
    dplyr::filter(is.na(taxonomic_resolution)) %>%
    dplyr::distinct(original_name)
```

Some of these names will require alignments and others might be truly garbage (`unknown species 1`) and should instead be [excluded](#exclude_observations).

### Excluded observations {#exclude_observations}

`metadata[["exclude_observations"]]` is a metadata section for excluding specific variable (column) values. It is most often used to exclude specific taxon names, but could be used for `locations`, `trait_name`, etc. These are values that are in the `data.csv` file but should be excluded from AusTraits.

`metadata[["exclude_observations"]]` entries are formatted as:

```{r, eval=FALSE, echo=TRUE}
exclude_observations:
- variable: taxon_name
  find: Campylopus introflexus, Dicranoloma menziesii, Philonotis tenuis, Polytrichastrum
    alpinum, Polytrichum juniperinum, Sphagnum cristatum
  reason: moss (E Wenk, 2020.06.18)
- variable: taxon_name
  find: Xanthoparmelia semiviridis
  reason: lichen (E Wenk, 2020.06.18)
```

The three elements it includes are:
-   **variable**: A variable from the traits table, typically `taxon_name`, `location_name` or `context_name`  
-   **find**: Value of variable to remove  
-   **reason**: Records why the data were removed, e.g. `exotic`  

**NOTE**: Multiple, comma-delimited values can be added under `find`.

The function `metadata_exclude_observations` adds single exclusions directly into `metadata[["exclude_observations"]]`:

```{r, eval=FALSE, echo=TRUE}
traits.build::metadata_exclude_observations(current_study, "variable", "find", "reason")
```

### Questions {#questions}

The final section of the `metadata.yml` file is titled `questions`. This is a location to:

1.  Ask the data contributor targeted questions about their study. When you generate a report (described below) these questions will appear at the top of the report.
    -   Preface the first question you have with `contributor:` (indented once), and additional questions with `question2:`, etc.
    -   Ask contributors about missing metadata
    -   Point contributors attention to odd data distributions, to make sure they look at those traits extra carefully.
    -   Let contributors know if you're uncertain about their units or if you transformed the data in a fairly major way.
    -   Ask the contributors if you're uncertain you aligned their trait names correctly.
2.  This is a place to list any trait data that are not yet `traits` supported by AusTraits. Use the following syntax, indented once: `additional_traits:`, followed by a list of traits.

### Hooray! You now have a fully propagated metadata.yml file!

Next is making sure it has captured all the data exactly as you've intended.

## Quality checks {#quality_checks}

If you haven't done so already, assign the `dataset_id` to a variable, `current_study`:

```{r, eval=FALSE, echo=TRUE}
current_study <- "Wright_2001"
```

This lets you have a list of tests you run for each study and you just have to reassign a new `dataset_id` to `current_study`.

### Clear formatting

The `clear formatting` code below reads and re-writes the yaml file. This is the same process that is repeated when running functions that automatically add substitutions or check taxonomy. Running it first ensures that any formatting issues introduced (or fixed) during the read/write process are identified and solved first.

For instance, the `write_metadata` function inserts line breaks every 80 characters and reworks other line breaks (except in custom_R_code). It also reformats special characters in the text, substituting in its accepted format for degree symbols, en-dashes, em-dashes and quotes, and substituting in Unicode codes for more obscure symbols.

```{r, eval=FALSE, echo=TRUE}
f <- file.path("data", current_study, "metadata.yml")
traits.build::read_metadata(f) %>% traits.build::write_metadata(f)
```

### Running tests {#running_tests}

An extensive dataset test protocol ensures: - the `metadata.yml` file is complete and properly formatted - details entered into the `metadata.yml` file match those in the accompanying `data.csv` file (column names, values for locations, contexts) - details for each trait (trait name, categorical trait values) match those in the trait dictionary

Certain special characters may show up as errors and need to be manually adjusted in the `metadata.yml` file

To run the dataset tests,

```{r, eval=FALSE, echo=TRUE}
# Tests run test on one study
traits.build::dataset_test(current_study)

# Tests run test on all studies
traits.build::dataset_test(dir("data"))
```

Messages identify errors in the dataset, hopefully pointing you quickly to the changes that are required.

Do not be disheartened by errors when you first run tests on a newly entered dataset - even after adding 100's of datasets into AusTraits it is very rare to have zero errors on a first run of `dataset_test()`.

Fix as many errors as you can and then rerun `dataset_test()` repeatedly until no errors remain.

You may want to fix errors in tandem with [building the new dataset](#build_new_dataset), such as to be able to quickly compile a list of trait values requiring [substitutions](#add_substitutions) or taxon names requiring [taxonomic updates](#add_taxonomic_updates)

See the [common issues](data_common_issues.html) chapter for solutions to common issues, such as:  
-   dataset not [pivoting](data_common_issues.html#cannot_pivot)  
-   unsupported trait values  

### Rebuild AusTraits {#build_new_dataset}

To continue your `checks` it is necessary to rebuild your database.

Until tests come back `clean` you can simply build the new dataset:

```{r, eval=FALSE, echo=TRUE}
traits.build::build_setup_pipeline(method = "remake", database_name = "database")
austraits <- remake::make(current_study)
```

To continue on to [building the dataset report](#build_report), you need to rebuild the entire database:

```{r, eval=FALSE, echo=TRUE}
traits.build::build_setup_pipeline(method = "remake", database_name = "database")
austraits <- remake::make("austraits")
```

### Check excluded data

AusTraits automatically excludes measurements for a number of reasons. Data might be excluded for legitimate reasons (value far out of range for a numeric trait) or because

These are available in the frame `database$excluded_data`.

Possible reasons for excluding measurements include:

-   **Missing species name**: Species name is missing from data.csv file for a given row of data. This usually occurs when there are stray characters in the data.csv file below the data -- delete these rows.

-   **Missing unit conversion**: Value was present but appropriate unit conversion was missing. This requires that you add a new unit conversion to the file `config/unit_conversions.csv`. Add additional conversions near similar unit conversions already in the file for easier searching in the future.

-   **Observation excluded in metadata**: Specific values, usually certain taxon names can be excluded in the metadata. This is generally used when a study includes a number of non-native and non-naturalised species that need to be excluded. These should be intentional exclusions, as they have been added by you.

-   **Trait name not in trait dictionary**: `trait_name` not listed in `config/traits.yml` as a trait concept. Double check you have used the correct spelling/exact syntax for the `trait_name`, adding a new `trait concept` to the `traits.yml` file if appropriate. If there is a trait that is currently unsupported by AusTraits, leave `trait_name: .na`. Do not fill in an arbitrary name.

-   **Unsupported trait value**: This error, referencing categorical traits, means that the `value` for a trait is not included in the list of supported trait values for that trait in `config/traits.yml`. See [adding many substitutions](#adding_many_substitutions) if there are many trait values requiring substitutions. If appropriate, add another trait value to the `traits.yml` file, but confer with other curators, as the lists of trait values have been carefully agreed upon through workshop sessions.

-   **Value does not convert to numeric**: Is there a strange character in the file preventing easy conversion? This error is rare and generally justified.

-   **Value out of allowable range**: This error, referencing numeric traits, means that the trait value, after unit conversions, falls outside of the allowable range specified for that trait in `config/traits.yml`. Sometimes the AusTraits range is too narrow and other times the author's value is truly an outlier that should be excluded. Look closely at these and adjust the range in `config/traits.yml` if justified. Generally, don't change the range until you've create a report for the study and confirmed that the general cloud of data aligns with other studies as excepted. Most frequently the units or unit conversion is what is incorrect.

-   **Value contains unsupported characters**: This error appears if there are unusual punctuation characters in the trait name. As such characters do not appear as allowed values within the trait dictionary, these represent transient errors that need to be corrected through either `metadata[["substitutions"]]` or, occasionally, in the `data.csv` file.

When you are finished running quality checks, no data should be excluded due to *Missing unit conversion*, *Trait name not in trait dictionary*, and it should be very rare that any data is excluded due to *Missing species name* or *Value contains unsupported characters*.

The dataset curator should be confident that every value that lands in the excluded data table is their legitimately.

For instance, out of the 1.8 million+ records in AusTraits v5.0.0, the excluded data table contains:

| Error                             | Count |
|-----------------------------------|-------|
| Observation excluded in metadata  | 4061  |
| Unsupported trait value           | 354   |
| Value does not convert to numeric | 12    |
| Value out of allowable range      | 427   |

The best way to view excluded data for a study is:

```{r, eval=FALSE, echo=TRUE}
austraits$excluded_data %>%
  dplyr::filter(
    dataset_id == current_study,
    error != "Observation excluded in metadata"
  ) %>%
  View()
```

Missing values (blank cells, cells with NA) are not included in the `excluded_data` table, because they are assumed to be legitimate blanks. If you want to confirm this, you need to temporarily change the default arguments for the internal function `dataset_process` where it is called within the `remake.yml` or `build.R` file that compiles the database. For instance, the default,

```{r, eval=FALSE, echo=TRUE}
      dataset_process("data/Ahrens_2019/data.csv",
                  Ahrens_2019_config,
                  schema
                 )
```

needs to be changed to:

```{r, eval=FALSE, echo=TRUE}
      dataset_process("data/Ahrens_2019/data.csv",
                  Ahrens_2019_config,
                  schema,
                  filter_missing_values = FALSE
                 )
```

To check how many of each error type are present for a study:

```{r, eval=FALSE, echo=TRUE}
database$excluded_data %>%
  dplyr::filter(dataset_id == current_study) %>%
  dplyr::pull(error) %>%
  table()
```

Or produce a table of error type by trait:

```{r, eval=FALSE, echo=TRUE}
database$excluded_data %>%
  dplyr::filter(
    dataset_id == current_study,
  ) %>%
  dplyr::select(trait_name, error) %>%
  table()
```

### Build study report {#report}

Another important check for each study is building a study report, that summarises all metadata and trait data.

Make sure you've rebuilt the entire database (not just the new study) before building the report.

```{r, eval=FALSE, echo=TRUE}
database <- remake::make("database")
traits.build::dataset_report(database, current_study, overwrite = TRUE)
```

**NOTES**:  
-   The report will appear in the folder `export/reports`    
-   The argument `overwrite = TRUE` overwrites pre-existing copies of the report in this folder.  

Check the study report to ensure:

-   All possible metadata fields were filled in  
-   The locations plot sensibly on the map  
-   For numeric traits, the trait values plot sensibly relative to other studies  
-   The list of unknown/unmatched species doesn't include names you think should be recognised/aligned  

If necessary, cycle back through earlier steps to fix any errors, rebuilding the study report as necessary

At the very end, re-clear formatting, re-run tests, rebuild AusTraits, rebuild report.

To generate a report for a collection of studies:

```{r, eval=FALSE, echo=TRUE}
traits.build::dataset_reports(database, c("Falster_2005_1", "Wright_2002"), 
                              overwrite = TRUE)
```

Or for all studies:

```{r, eval=FALSE, echo=TRUE}
traits.build::dataset_reports(database, overwrite = TRUE)
```

(Reports are written in [Rmarkdown](https://rstudio.github.io/rmarkdown/) and generated via the [knitr](https://cran.r-project.org/web/packages/knitr/) package. The template is [here](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/report_dataset.Rmd)).