diff --git a/04_basic_data_processing.qmd b/04_basic_data_processing.qmd index 682157d..a5f5d85 100644 --- a/04_basic_data_processing.qmd +++ b/04_basic_data_processing.qmd @@ -1,9 +1,6 @@ # Basic data processing -Now that we understand how **R** handles data, we can start working with pre-existing data files. These files need to be correctly formatted and in a file format that **R** can recognize. Don't worry, there are plenty of options. - -The first step when loading data in **R** is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. The working directory will change on different computers. To find our current working directory, we run: - +Now we can apply our understanding of **R** to work with files of pre-existing data. The first step when loading data into **R** is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. The working directory will change on different computers. To find our current working directory, we run: ```{r get working directory} #| eval: false getwd() @@ -15,14 +12,14 @@ print("C:/Users/user_name/workshop_folder/learning_r/code") ``` -We can move our working directory to any folder on your computer with the function `setwd()`. Just give `setwd()` the [file path](https://www.codecademy.com/resources/docs/general/file-paths) to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all files related to my project are in the same place. For example: +We can move our working directory to any folder on our computer by writing a new [file path](https://www.codecademy.com/resources/docs/general/file-paths) inside the function `setwd()`. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, every file related to my project is in the same place. For example: ```{r} #| eval: false setwd("C:/Users/user_name/workshop_folder/learning_r/code") ``` -We can also change your working directory by clicking on Session > Set Working Directory > Choose Directory in the **R**Studio menu bar. The Windows and Mac graphical user interfaces have similar options. If we start **R** from a UNIX command line (as on Linux machines), the working directory will be whichever directory we were in when we called **R**. +We can also change our working directory by clicking on Session > Set Working Directory > Choose Directory in the **R**Studio menu bar. The Windows and Mac graphical user interfaces have similar options. If we start **R** from a UNIX command line (as on Linux machines), the working directory will be whichever directory we were in when we called **R**. `list.files()` will show us what files are in our working directory. If the file that we want to open is in our working directory, then we are ready to proceed. @@ -32,10 +29,10 @@ Once we know where to find data files in our computer, we can start loading them ### Plain text files -Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files. - A plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter `|`, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion. +Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau and the Social Security Administration) publish their data as plain-text files. + We will work with data from [this](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower.csv)^[You can find the original file [here](https://alexd106.github.io/intro2R/data.html) courtesy of Douglas et al. (see references).] plain text file. Use `Ctrl+Shift+s` to download the file. I am going to save it in a folder called "data_files" inside my working directory under the name "flower.csv". But you can save it wherever you want as long as you can keep track of it. #### read.table @@ -45,9 +42,9 @@ We will work with data from [this](https://github.com/CSCAR/workshop-r-intro/blo flower_df <- read.table("data_files/flower.csv", header = TRUE, sep = ",") ``` -In the code above, I added two more arguments, `header` and `sep`. `header` tells **R** whether the first line of the file contains variable names instead of values. `sep` tells **R** the symbol that the file uses to separate the cells. +In the code above, I added arguments `header` and `sep`. `header` tells **R** whether the first line of the file contains variable names instead of values; this will help us identify the variables in the data frame. `sep` tells **R** the symbol that the file uses to separate the cells; this will help us preserve the correct location of the data cells. -Sometimes a plain-text file starts with text that is not part of the data set. Or, maybe we want to read only part of a data set. Argument `skip` tells **R** to skip a specific number of lines before it starts reading in values from the file. Argument `nrow` tells **R** to stop reading in values after it has read in a certain number of lines. Keep in mind that the header row doesn’t count towards the total rows allowed by `nrow`. +Sometimes a plain-text file starts with text that is not part of the data set. Or maybe we want to read only part of a data set. Argument `skip` tells **R** to skip a specific number of lines before it starts reading values from the file. Argument `nrow` tells **R** to only read a certain number of lines, starting from the top. Keep in mind that `nrow` does not count the header in the number of rows it reads. ```{r} flower_df_chunk <- read.table( @@ -74,11 +71,11 @@ flower_df_chunk #### HTML links -`read.table()` and its shortcuts allow us to load data files directly from a website. Instead of using the file's path or name, we can directly use a web address in the `file` argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file. +`read.table()` and its shortcuts allow us to load data files directly from a website. Instead of using the file's path or name, we can directly use a web address in the `file` argument of the function. Make sure to use the web address that links directly to the file, not to a web page that has a link to the file. #### read.fwf -*Fixed-width file* (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries. +A *fixed-width file* (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries. If our flowers data came in a fixed-width file, the first few lines would look like this: ```{flowers as a fwf} @@ -94,7 +91,7 @@ tip medium 1 9.8 10.08 12.2 72.7 9 Fixed-width files may be visually intuitive, but they are difficult to work with. This may explain why **R** has a function for reading fixed-width files, but not for saving them. -We can read fixed-width files into R with the function `read.fwf()`. This function adds another argument to the ones from `read.table()`: `widths`, which should be a vector of numbers. Each ith entry of the `widths` vector should state the width (in characters) of the ith column of the data set. +We can read fixed-width files into **R** with the function `read.fwf()`. This function adds another argument to the ones from `read.table()`: `widths`, which should be a vector of numbers. Each ith entry of the `widths` vector should state the width (in characters) of the ith column of the data set. ```{r} #| include: false @@ -109,7 +106,7 @@ flowers_fwf_df ### Excel files -The best way to load data from Excel files (.xlsx) into **R** is not to use Excel files. Instead, save these files as .csv or .txt files and then use `read.table`. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for **R** to read the files properly. Plain text files are simpler, so we can load and transfer them more easily. +The best way to load data from Excel files (.xlsx) into **R** is to first save these files as .csv or .txt files and then use `read.table`. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for **R** to read the files properly. Plain text files are simpler, so we can load and transfer them more easily. Still, there are ways to load Excel files into **R** if we *really* need to. **R** has no native way of loading these files, but we can use the package `readxl`, which works on all operating systems. We install it using `install.packages("readxl")`. Then we load it using `library(readxl)`. Once we load the package, we can use the function `read_excel()` to load files of the type .xls and .xlsx (see `help("read_excel")` for more information). @@ -117,7 +114,7 @@ Still, there are ways to load Excel files into **R** if we *really* need to. **R As with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that our data is transcribed properly, and allows us to customize the transformation. -But sometimes we can't transform the file to a plain-text format---maybe because we can't access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries: +But sometimes we can't transform the file to a plain-text format---maybe because we can't access the program that created the file (e.g., SAS or SPSS). In these cases, we can resort to one of several libraries: + `haven`, for reading files from SAS, SPSS, and Stata. + `R.matlab` for reading files for versions MAT 4 and MAT 5. @@ -125,7 +122,7 @@ But sometimes we can't transform the file to a plain-text format---maybe because ## Cleaning data -Once we load our data files as data.frames in **R**, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as "data cleaning". We will practice data cleaning using a "messy" version of the flower data that we loaded above. You can get this messy version from [here](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower_messy.csv). Again, you can use `Ctrl+Shift+s` to download the file. +Once we load our data files as data.frames in **R**, we should verify that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as "data cleaning". We will practice data cleaning using a "messy" version of the flower data that we loaded above. You can get this messy version from [here](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower_messy.csv). Again, you can use `Ctrl+Shift+s` to download the file. Since this is a .csv file, we can load it into **R** using: ```{r loading messy flower data} @@ -139,7 +136,7 @@ colnames(flower_messy_df) If we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, `read.csv()` automatically substitutes these blank spaces with periods `.`, so that the names conform to **R**'s conventions. `read.csv()` is pretty good at checking column names and other things, but it's not perfect. So, it's always a good idea to double-check everything ourselves. -The column names of `flower_messy_df` are legible, but unwieldy. They have a mix of upper and lower-case that we don't want to struggle with. Let's rewrite all the names in lower case, which is quick and easy if we use `tolower()`. +The column names of `flower_messy_df` are legible, but unwieldy. We don't want to struggle with their mix of upper and lower-case letters. Let's rewrite all the names in lower case, which is quick and easy if we use `tolower()`. ```{r colnames to lower case} new_colnames <- tolower(colnames(flower_messy_df)) # Modify column names new_colnames @@ -158,8 +155,7 @@ colnames(flower_clean_df) <- new_colnames # Replace column names in data frame colnames(flower_clean_df) # Check our work ``` - -The column names are almost ready. The last change will be to substitute the periods in the names with underscores. In **R**, this is purely out of personal preference, but it's a good excuse to meet the function `gsub()`, which substitutes patterns of strings: +The last change to these column names will be to substitute the periods in the names with underscores. In **R**, this is purely out of personal preference, but it's a good excuse to meet the function `gsub()`, which substitutes patterns of strings: ```{r substitute periods with underscores in colnames} colnames(flower_clean_df) <- gsub( pattern = "\\.", # What we want to remove @@ -171,7 +167,7 @@ colnames(flower_clean_df) Note that I had to use `"\\."` instead of simply `"."` to match the period. The reason is that `gsub()` interprets `"."` as saying "match any character". This may sound silly but it helps when working with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression)---a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to explain here, but if you expect to work with text data regularly, I encourage you to learn more about them. -With our improved column names it will be easier to focus on giving every column an appropriate format. Numbers should be of type "double" or "integer", and text should be of type "character". Let's check the types of the columns in our current data set. +With our improved column names it will be easier to focus on giving every column an appropriate format: numbers should be of type "double" or "integer", and text should be of type "character" of "factor". Let's check the types of the columns in our current data set. ```{r check column types} str(flower_clean_df) @@ -188,8 +184,8 @@ head(flower_clean_df[["flowers"]]) flower_clean_df["flowers"] <- gsub( pattern = "\"", # \" the backlash tells R to match quotes replacement = "", # This is how we write "nothing" - x = flower_clean_df$flowers # x needs to be a vector, so use - # double brackets or dollar sign + x = flower_clean_df$flowers # x needs to be a vector, so use + # double brackets or dollar sign ) head(flower_clean_df$flowers) ``` @@ -201,19 +197,19 @@ typeof(flower_clean_df$flowers) head(flower_clean_df$flowers) ``` -Columns "treat" and "nitrogen" are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors. +Columns "treat" and "nitrogen" are of type character. This is not wrong, but it will be easier to handle them if we convert them to factors. ```{r} flower_clean_df["treat"] <- factor(flower_clean_df$treat) flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen) str(flower_clean_df) ``` -The transformation fixed column "flowers", but column "nitrogen" looks suspicious. It is supposed to have only three values ("low", "medium", and "high"), but its description counts eight values. Let's examine them more closely: +Column "flowers" looks fine, but column "nitrogen" looks suspicious. It is supposed to have only three values ("low", "medium", and "high"), but its description counts eight values. Let's examine them more closely: ```{r check levels of nitrogen column} levels(flower_clean_df$nitrogen) ``` -Remember that **R** is case sensitive, so it interprets each spelling as a different value. We can fix this using `tolower()` once more. Note that this will convert the "nitrogen" column back to a simple character type, so we have to reconvert it to factor. +Remember that **R** is case sensitive, so it interprets each of spelling "high" and "low" as a different value. We can fix this using `tolower()` once more. Note that this will convert the "nitrogen" column back to a simple character type, so we have to reconvert it to factor. ```{r nitrogen column to all lowercase} flower_clean_df["nitrogen"] <- tolower(flower_clean_df$nitrogen) flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen) @@ -259,11 +255,11 @@ boxplot( height ~ nitrogen, data = flower_clean_df, col = c("yellow", "blue", "pink"), - main = "No clear pattern between height and nitrogen" + main = "No clear association between height and nitrogen" ) ``` -Now let's say we want to investigate the relationship between shoot area and leaf area. And let's check whether that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value. +Now let's investigate the relationship between shoot area and leaf area. And let's check whether that relationship changes depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value. ```{r leaf area vs shoot area by treat} plot( @@ -283,7 +279,7 @@ legend( ) ``` -Now let's say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13. +Now let's see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13. ```{r mosaic plot for nitrogen vs treat} nitrogen_by_treat_table = xtabs( diff --git a/docs/04_basic_data_processing.html b/docs/04_basic_data_processing.html index 65452a2..34f7aeb 100644 --- a/docs/04_basic_data_processing.html +++ b/docs/04_basic_data_processing.html @@ -210,8 +210,7 @@

4 
getwd()
@@ -220,19 +219,19 @@

4  -

We can move our working directory to any folder on your computer with the function setwd(). Just give setwd() the file path to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all files related to my project are in the same place. For example:

+

We can move our working directory to any folder on our computer by writing a new file path inside the function setwd(). I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, every file related to my project is in the same place. For example:

setwd("C:/Users/user_name/workshop_folder/learning_r/code")
-

We can also change your working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar. The Windows and Mac graphical user interfaces have similar options. If we start R from a UNIX command line (as on Linux machines), the working directory will be whichever directory we were in when we called R.

+

We can also change our working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar. The Windows and Mac graphical user interfaces have similar options. If we start R from a UNIX command line (as on Linux machines), the working directory will be whichever directory we were in when we called R.

list.files() will show us what files are in our working directory. If the file that we want to open is in our working directory, then we are ready to proceed.

4.1 Loading data

Once we know where to find data files in our computer, we can start loading them into R. Note, however, that we need specific ways to open different file formats.

4.1.1 Plain text files

-

Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files.

A plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.

+

Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau and the Social Security Administration) publish their data as plain-text files.

We will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. I am going to save it in a folder called “data_files” inside my working directory under the name “flower.csv”. But you can save it wherever you want as long as you can keep track of it.

4.1.1.1 read.table

@@ -240,8 +239,8 @@

flower_df <- read.table("data_files/flower.csv", header = TRUE, sep = ",")
-

In the code above, I added two more arguments, header and sep. header tells R whether the first line of the file contains variable names instead of values. sep tells R the symbol that the file uses to separate the cells.

-

Sometimes a plain-text file starts with text that is not part of the data set. Or, maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading in values from the file. Argument nrow tells R to stop reading in values after it has read in a certain number of lines. Keep in mind that the header row doesn’t count towards the total rows allowed by nrow.

+

In the code above, I added arguments header and sep. header tells R whether the first line of the file contains variable names instead of values; this will help us identify the variables in the data frame. sep tells R the symbol that the file uses to separate the cells; this will help us preserve the correct location of the data cells.

+

Sometimes a plain-text file starts with text that is not part of the data set. Or maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading values from the file. Argument nrow tells R to only read a certain number of lines, starting from the top. Keep in mind that nrow does not count the header in the number of rows it reads.

flower_df_chunk <- read.table(
     "data_files/flower.csv", 
@@ -273,11 +272,11 @@ 

4.1.1.3 HTML links

-

read.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.

+

read.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Make sure to use the web address that links directly to the file, not to a web page that has a link to the file.

4.1.1.4 read.fwf

-

Fixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.

+

A fixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.

If our flowers data came in a fixed-width file, the first few lines would look like this:

treat  nitrogen block  height  weight  leafarea  shootarea  flowers
@@ -289,18 +288,18 @@ 

Fixed-width files may be visually intuitive, but they are difficult to work with. This may explain why R has a function for reading fixed-width files, but not for saving them.

-

We can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.

+

We can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.

4.1.2 Excel files

-

The best way to load data from Excel files (.xlsx) into R is not to use Excel files. Instead, save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.

+

The best way to load data from Excel files (.xlsx) into R is to first save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.

Still, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on all operating systems. We install it using install.packages("readxl"). Then we load it using library(readxl). Once we load the package, we can use the function read_excel() to load files of the type .xls and .xlsx (see help("read_excel") for more information).

4.1.3 Files from other programs

As with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that our data is transcribed properly, and allows us to customize the transformation.

-

But sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries:

+

But sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g., SAS or SPSS). In these cases, we can resort to one of several libraries:

  • haven, for reading files from SAS, SPSS, and Stata.
  • R.matlab for reading files for versions MAT 4 and MAT 5.
  • @@ -310,7 +309,7 @@

    4.2 Cleaning data

    -

    Once we load our data files as data.frames in R, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. We will practice data cleaning using a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.

    +

    Once we load our data files as data.frames in R, we should verify that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. We will practice data cleaning using a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.

    Since this is a .csv file, we can load it into R using:

    flower_messy_df = read.csv("data_files/flower_messy.csv", header = TRUE)
    @@ -324,7 +323,7 @@

    If we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() is pretty good at checking column names and other things, but it’s not perfect. So, it’s always a good idea to double-check everything ourselves.

    -

    The column names of flower_messy_df are legible, but unwieldy. They have a mix of upper and lower-case that we don’t want to struggle with. Let’s rewrite all the names in lower case, which is quick and easy if we use tolower().

    +

    The column names of flower_messy_df are legible, but unwieldy. We don’t want to struggle with their mix of upper and lower-case letters. Let’s rewrite all the names in lower case, which is quick and easy if we use tolower().

    new_colnames <- tolower(colnames(flower_messy_df)) # Modify column names
     new_colnames
    @@ -347,7 +346,7 @@

    -

    The column names are almost ready. The last change will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference, but it’s a good excuse to meet the function gsub(), which substitutes patterns of strings:

    +

    The last change to these column names will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference, but it’s a good excuse to meet the function gsub(), which substitutes patterns of strings:

    colnames(flower_clean_df) <- gsub(
         pattern = "\\.", # What we want to remove
    @@ -361,7 +360,7 @@ 

    Note that I had to use "\\." instead of simply "." to match the period. The reason is that gsub() interprets "." as saying “match any character”. This may sound silly but it helps when working with regular expressions—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to explain here, but if you expect to work with text data regularly, I encourage you to learn more about them.

    -

    With our improved column names it will be easier to focus on giving every column an appropriate format. Numbers should be of type “double” or “integer”, and text should be of type “character”. Let’s check the types of the columns in our current data set.

    +

    With our improved column names it will be easier to focus on giving every column an appropriate format: numbers should be of type “double” or “integer”, and text should be of type “character” of “factor”. Let’s check the types of the columns in our current data set.

    str(flower_clean_df)
    @@ -388,8 +387,8 @@

    flower_clean_df["flowers"] <- gsub(
         pattern = "\"", # \" the backlash tells R to match quotes
         replacement = "", # This is how we write "nothing"
    -    x = flower_clean_df$flowers # x needs to be a vector, so use 
    -                                     # double brackets or dollar sign
    +    x = flower_clean_df$flowers # x needs to be a vector, so use
    +                                # double brackets or dollar sign
     )
     head(flower_clean_df$flowers)

    @@ -408,7 +407,7 @@

    [1] 1 10 10 1 4 9

    -

    Columns “treat” and “nitrogen” are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.

    +

    Columns “treat” and “nitrogen” are of type character. This is not wrong, but it will be easier to handle them if we convert them to factors.

    flower_clean_df["treat"] <- factor(flower_clean_df$treat)
     flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
    @@ -425,14 +424,14 @@ 

    -

    The transformation fixed column “flowers”, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description counts eight values. Let’s examine them more closely:

    +

    Column “flowers” looks fine, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description counts eight values. Let’s examine them more closely:

    levels(flower_clean_df$nitrogen)
    [1] "high"   "High"   "HIGH"   "low"    "lOw"    "Low"    "medium" "Medium"
    -

    Remember that R is case sensitive, so it interprets each spelling as a different value. We can fix this using tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.

    +

    Remember that R is case sensitive, so it interprets each of spelling “high” and “low” as a different value. We can fix this using tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.

    flower_clean_df["nitrogen"] <- tolower(flower_clean_df$nitrogen)
     flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
    @@ -495,13 +494,13 @@ 

    height ~ nitrogen, data = flower_clean_df, col = c("yellow", "blue", "pink"), - main = "No clear pattern between height and nitrogen" + main = "No clear association between height and nitrogen" )

    -

    Now let’s say we want to investigate the relationship between shoot area and leaf area. And let’s check whether that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.

    +

    Now let’s investigate the relationship between shoot area and leaf area. And let’s check whether that relationship changes depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.

    plot(
         x = flower_clean_df$leaf_area,
    @@ -522,7 +521,7 @@ 

    -

    Now let’s say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.

    +

    Now let’s see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.

    nitrogen_by_treat_table = xtabs(
         formula = ~ nitrogen + treat,
    diff --git a/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png b/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png
    index 3618151..a04ec76 100644
    Binary files a/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png and b/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png differ
    diff --git a/docs/search.json b/docs/search.json
    index 36cc0dc..d196ac6 100644
    --- a/docs/search.json
    +++ b/docs/search.json
    @@ -172,21 +172,21 @@
         "href": "04_basic_data_processing.html#loading-data",
         "title": "4  Basic data processing",
         "section": "4.1 Loading data",
    -    "text": "4.1 Loading data\nOnce we know where to find data files in our computer, we can start loading them into R. Note, however, that we need specific ways to open different file formats.\n\n4.1.1 Plain text files\nPlain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files.\nA plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.\nWe will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. I am going to save it in a folder called “data_files” inside my working directory under the name “flower.csv”. But you can save it wherever you want as long as you can keep track of it.\n\n4.1.1.1 read.table\nread.table() can load plain-text files. The first argument of read.table() is the name of our file (if it is in your working directory), or the file path to our file (if it is not in our working directory).\n\nflower_df <- read.table(\"data_files/flower.csv\", header = TRUE, sep = \",\")\n\nIn the code above, I added two more arguments, header and sep. header tells R whether the first line of the file contains variable names instead of values. sep tells R the symbol that the file uses to separate the cells.\nSometimes a plain-text file starts with text that is not part of the data set. Or, maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading in values from the file. Argument nrow tells R to stop reading in values after it has read in a certain number of lines. Keep in mind that the header row doesn’t count towards the total rows allowed by nrow.\n\nflower_df_chunk <- read.table(\n    \"data_files/flower.csv\", \n    header = TRUE, \n    sep = \",\", \n    skip = 0, \n    nrow = 3\n)\nflower_df_chunk\n\n  treat nitrogen block height weight leafarea shootarea flowers\n1   tip   medium     1    7.5   7.62     11.7      31.9       1\n2   tip   medium     1   10.7  12.14     14.1      46.0      10\n3   tip   medium     1   11.2  12.76      7.1      66.7      10\n\n\nread.table() has other arguments that we can tweak. You can consult the function’s help page to know more about it.\n\n\n4.1.1.2 Shortcuts for read.table\nR has shortcut functions that call read.table() in the background with different default values for popular types of files:\n\nread.table is the general purpose read function.\nread.csv reads comma-separated values (.csv) files.\nread.delim reads tab-delimited files.\nread.csv2 reads .csv files with European decimal format.\nread.delim2 reads tab-delimited files with European decimal format.\n\n\n\n4.1.1.3 HTML links\nread.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.\n\n\n4.1.1.4 read.fwf\nFixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.\nIf our flowers data came in a fixed-width file, the first few lines would look like this:\n\ntreat  nitrogen block  height  weight  leafarea  shootarea  flowers\ntip    medium   1      7.5     7.62    11.7      31.9       1\ntip    medium   1      10.7    12.14   14.1      46.0       10\ntip    medium   1      11.2    12.76   7.1       66.7       10\ntip    medium   1      10.4    8.78    11.9      20.3       1\ntip    medium   1      10.4    13.58   14.5      26.9       4\ntip    medium   1      9.8     10.08   12.2      72.7       9\n\nFixed-width files may be visually intuitive, but they are difficult to work with. This may explain why R has a function for reading fixed-width files, but not for saving them.\nWe can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.\n\n\n\n4.1.2 Excel files\nThe best way to load data from Excel files (.xlsx) into R is not to use Excel files. Instead, save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.\nStill, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on all operating systems. We install it using install.packages(\"readxl\"). Then we load it using library(readxl). Once we load the package, we can use the function read_excel() to load files of the type .xls and .xlsx (see help(\"read_excel\") for more information).\n\n\n4.1.3 Files from other programs\nAs with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that our data is transcribed properly, and allows us to customize the transformation.\nBut sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries:\n\nhaven, for reading files from SAS, SPSS, and Stata.\nR.matlab for reading files for versions MAT 4 and MAT 5.\nforeign for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use haven in these cases."
    +    "text": "4.1 Loading data\nOnce we know where to find data files in our computer, we can start loading them into R. Note, however, that we need specific ways to open different file formats.\n\n4.1.1 Plain text files\nA plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.\nPlain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau and the Social Security Administration) publish their data as plain-text files.\nWe will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. I am going to save it in a folder called “data_files” inside my working directory under the name “flower.csv”. But you can save it wherever you want as long as you can keep track of it.\n\n4.1.1.1 read.table\nread.table() can load plain-text files. The first argument of read.table() is the name of our file (if it is in your working directory), or the file path to our file (if it is not in our working directory).\n\nflower_df <- read.table(\"data_files/flower.csv\", header = TRUE, sep = \",\")\n\nIn the code above, I added arguments header and sep. header tells R whether the first line of the file contains variable names instead of values; this will help us identify the variables in the data frame. sep tells R the symbol that the file uses to separate the cells; this will help us preserve the correct location of the data cells.\nSometimes a plain-text file starts with text that is not part of the data set. Or maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading values from the file. Argument nrow tells R to only read a certain number of lines, starting from the top. Keep in mind that nrow does not count the header in the number of rows it reads.\n\nflower_df_chunk <- read.table(\n    \"data_files/flower.csv\", \n    header = TRUE, \n    sep = \",\", \n    skip = 0, \n    nrow = 3\n)\nflower_df_chunk\n\n  treat nitrogen block height weight leafarea shootarea flowers\n1   tip   medium     1    7.5   7.62     11.7      31.9       1\n2   tip   medium     1   10.7  12.14     14.1      46.0      10\n3   tip   medium     1   11.2  12.76      7.1      66.7      10\n\n\nread.table() has other arguments that we can tweak. You can consult the function’s help page to know more about it.\n\n\n4.1.1.2 Shortcuts for read.table\nR has shortcut functions that call read.table() in the background with different default values for popular types of files:\n\nread.table is the general purpose read function.\nread.csv reads comma-separated values (.csv) files.\nread.delim reads tab-delimited files.\nread.csv2 reads .csv files with European decimal format.\nread.delim2 reads tab-delimited files with European decimal format.\n\n\n\n4.1.1.3 HTML links\nread.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Make sure to use the web address that links directly to the file, not to a web page that has a link to the file.\n\n\n4.1.1.4 read.fwf\nA fixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.\nIf our flowers data came in a fixed-width file, the first few lines would look like this:\n\ntreat  nitrogen block  height  weight  leafarea  shootarea  flowers\ntip    medium   1      7.5     7.62    11.7      31.9       1\ntip    medium   1      10.7    12.14   14.1      46.0       10\ntip    medium   1      11.2    12.76   7.1       66.7       10\ntip    medium   1      10.4    8.78    11.9      20.3       1\ntip    medium   1      10.4    13.58   14.5      26.9       4\ntip    medium   1      9.8     10.08   12.2      72.7       9\n\nFixed-width files may be visually intuitive, but they are difficult to work with. This may explain why R has a function for reading fixed-width files, but not for saving them.\nWe can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.\n\n\n\n4.1.2 Excel files\nThe best way to load data from Excel files (.xlsx) into R is to first save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.\nStill, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on all operating systems. We install it using install.packages(\"readxl\"). Then we load it using library(readxl). Once we load the package, we can use the function read_excel() to load files of the type .xls and .xlsx (see help(\"read_excel\") for more information).\n\n\n4.1.3 Files from other programs\nAs with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that our data is transcribed properly, and allows us to customize the transformation.\nBut sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g., SAS or SPSS). In these cases, we can resort to one of several libraries:\n\nhaven, for reading files from SAS, SPSS, and Stata.\nR.matlab for reading files for versions MAT 4 and MAT 5.\nforeign for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use haven in these cases."
       },
       {
         "objectID": "04_basic_data_processing.html#cleaning-data",
         "href": "04_basic_data_processing.html#cleaning-data",
         "title": "4  Basic data processing",
         "section": "4.2 Cleaning data",
    -    "text": "4.2 Cleaning data\nOnce we load our data files as data.frames in R, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. We will practice data cleaning using a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.\nSince this is a .csv file, we can load it into R using:\n\nflower_messy_df = read.csv(\"data_files/flower_messy.csv\", header = TRUE)\n\nFirst, we should ensure the column names to follow the rules we saw in section 1. This will facilitate accessing the data in the columns later. We can check these column names using the colnames() function:\n\ncolnames(flower_messy_df)\n\n[1] \"Treat\"      \"Nitrogen\"   \"block\"      \"Height\"     \"Weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"FLOWERS\"   \n\n\nIf we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() is pretty good at checking column names and other things, but it’s not perfect. So, it’s always a good idea to double-check everything ourselves.\nThe column names of flower_messy_df are legible, but unwieldy. They have a mix of upper and lower-case that we don’t want to struggle with. Let’s rewrite all the names in lower case, which is quick and easy if we use tolower().\n\nnew_colnames <- tolower(colnames(flower_messy_df)) # Modify column names\nnew_colnames\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"flowers\"   \n\n\nThese new column names are better, but we still need to change them inside flower_messy_df. Before moving on, let’s create a new data set called flower_clean_df.\n\nflower_clean_df <- flower_messy_df\n\nUsing a copy of the original data set makes it easier to track our changes because we can always look at the original version. It also eases backtracking when we make a mistake because we don’t have to reload our original data (which can take a long time with large files).\nNow we can use our improved column names.\n\ncolnames(flower_clean_df) <- new_colnames # Replace column names in data frame\ncolnames(flower_clean_df) # Check our work\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"flowers\"   \n\n\nThe column names are almost ready. The last change will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference, but it’s a good excuse to meet the function gsub(), which substitutes patterns of strings:\n\ncolnames(flower_clean_df) <- gsub(\n    pattern = \"\\\\.\", # What we want to remove\n    replacement = \"_\", # What we want to have instead\n    x = colnames(flower_clean_df) # The object we want to modify\n)\ncolnames(flower_clean_df)\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf_area\"  \"shoot_area\" \"flowers\"   \n\n\nNote that I had to use \"\\\\.\" instead of simply \".\" to match the period. The reason is that gsub() interprets \".\" as saying “match any character”. This may sound silly but it helps when working with regular expressions—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to explain here, but if you expect to work with text data regularly, I encourage you to learn more about them.\nWith our improved column names it will be easier to focus on giving every column an appropriate format. Numbers should be of type “double” or “integer”, and text should be of type “character”. Let’s check the types of the columns in our current data set.\n\nstr(flower_clean_df)\n\n'data.frame':   96 obs. of  8 variables:\n $ treat     : chr  \"tip\" \"tip\" \"tip\" \"tip\" ...\n $ nitrogen  : chr  \"medium\" \"medium\" \"medium\" \"Medium\" ...\n $ block     : int  1 1 1 1 1 1 1 1 2 2 ...\n $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers   : chr  \"\\\"1\\\"\" \"10\" \"10\" \"1\" ...\n\n\nColumn “flowers” seems to contain numbers but is classified as type “character”. The reason is that there are quotes around the first value in this column:\n\nhead(flower_clean_df[[\"flowers\"]])\n\n[1] \"\\\"1\\\"\" \"10\"    \"10\"    \"1\"     \"4\"     \"9\"    \n\n\nR recognizes that the value itself has quotes, so it adds a backslash \\ to differentiate them from the quotes it uses to print strings. We can manually coerce the column “flowers” to be of type double, but first we must remove those confusing quotes.\n\nflower_clean_df[\"flowers\"] <- gsub(\n    pattern = \"\\\"\", # \\\" the backlash tells R to match quotes\n    replacement = \"\", # This is how we write \"nothing\"\n    x = flower_clean_df$flowers # x needs to be a vector, so use \n                                     # double brackets or dollar sign\n)\nhead(flower_clean_df$flowers)\n\n[1] \"1\"  \"10\" \"10\" \"1\"  \"4\"  \"9\" \n\n\nNow we can transform the column to be of type “double”.\n\nflower_clean_df[\"flowers\"] <- as.numeric(flower_clean_df$flowers)\ntypeof(flower_clean_df$flowers)\n\n[1] \"double\"\n\nhead(flower_clean_df$flowers)\n\n[1]  1 10 10  1  4  9\n\n\nColumns “treat” and “nitrogen” are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.\n\nflower_clean_df[\"treat\"] <- factor(flower_clean_df$treat)\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df$nitrogen)\nstr(flower_clean_df)\n\n'data.frame':   96 obs. of  8 variables:\n $ treat     : Factor w/ 2 levels \"notip\",\"tip\": 2 2 2 2 2 2 2 2 2 2 ...\n $ nitrogen  : Factor w/ 8 levels \"high\",\"High\",..: 7 7 7 8 7 7 8 7 7 7 ...\n $ block     : int  1 1 1 1 1 1 1 1 2 2 ...\n $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers   : num  1 10 10 1 4 9 7 6 5 8 ...\n\n\nThe transformation fixed column “flowers”, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description counts eight values. Let’s examine them more closely:\n\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\"   \"High\"   \"HIGH\"   \"low\"    \"lOw\"    \"Low\"    \"medium\" \"Medium\"\n\n\nRemember that R is case sensitive, so it interprets each spelling as a different value. We can fix this using tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.\n\nflower_clean_df[\"nitrogen\"] <- tolower(flower_clean_df$nitrogen)\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df$nitrogen)\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\"   \"low\"    \"medium\"\n\n\nUnless I have a good reason not to, I usually transform all character columns to have only lower case letters."
    +    "text": "4.2 Cleaning data\nOnce we load our data files as data.frames in R, we should verify that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. We will practice data cleaning using a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.\nSince this is a .csv file, we can load it into R using:\n\nflower_messy_df = read.csv(\"data_files/flower_messy.csv\", header = TRUE)\n\nFirst, we should ensure the column names to follow the rules we saw in section 1. This will facilitate accessing the data in the columns later. We can check these column names using the colnames() function:\n\ncolnames(flower_messy_df)\n\n[1] \"Treat\"      \"Nitrogen\"   \"block\"      \"Height\"     \"Weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"FLOWERS\"   \n\n\nIf we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() is pretty good at checking column names and other things, but it’s not perfect. So, it’s always a good idea to double-check everything ourselves.\nThe column names of flower_messy_df are legible, but unwieldy. We don’t want to struggle with their mix of upper and lower-case letters. Let’s rewrite all the names in lower case, which is quick and easy if we use tolower().\n\nnew_colnames <- tolower(colnames(flower_messy_df)) # Modify column names\nnew_colnames\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"flowers\"   \n\n\nThese new column names are better, but we still need to change them inside flower_messy_df. Before moving on, let’s create a new data set called flower_clean_df.\n\nflower_clean_df <- flower_messy_df\n\nUsing a copy of the original data set makes it easier to track our changes because we can always look at the original version. It also eases backtracking when we make a mistake because we don’t have to reload our original data (which can take a long time with large files).\nNow we can use our improved column names.\n\ncolnames(flower_clean_df) <- new_colnames # Replace column names in data frame\ncolnames(flower_clean_df) # Check our work\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf.area\"  \"shoot.area\" \"flowers\"   \n\n\nThe last change to these column names will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference, but it’s a good excuse to meet the function gsub(), which substitutes patterns of strings:\n\ncolnames(flower_clean_df) <- gsub(\n    pattern = \"\\\\.\", # What we want to remove\n    replacement = \"_\", # What we want to have instead\n    x = colnames(flower_clean_df) # The object we want to modify\n)\ncolnames(flower_clean_df)\n\n[1] \"treat\"      \"nitrogen\"   \"block\"      \"height\"     \"weight\"    \n[6] \"leaf_area\"  \"shoot_area\" \"flowers\"   \n\n\nNote that I had to use \"\\\\.\" instead of simply \".\" to match the period. The reason is that gsub() interprets \".\" as saying “match any character”. This may sound silly but it helps when working with regular expressions—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to explain here, but if you expect to work with text data regularly, I encourage you to learn more about them.\nWith our improved column names it will be easier to focus on giving every column an appropriate format: numbers should be of type “double” or “integer”, and text should be of type “character” of “factor”. Let’s check the types of the columns in our current data set.\n\nstr(flower_clean_df)\n\n'data.frame':   96 obs. of  8 variables:\n $ treat     : chr  \"tip\" \"tip\" \"tip\" \"tip\" ...\n $ nitrogen  : chr  \"medium\" \"medium\" \"medium\" \"Medium\" ...\n $ block     : int  1 1 1 1 1 1 1 1 2 2 ...\n $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers   : chr  \"\\\"1\\\"\" \"10\" \"10\" \"1\" ...\n\n\nColumn “flowers” seems to contain numbers but is classified as type “character”. The reason is that there are quotes around the first value in this column:\n\nhead(flower_clean_df[[\"flowers\"]])\n\n[1] \"\\\"1\\\"\" \"10\"    \"10\"    \"1\"     \"4\"     \"9\"    \n\n\nR recognizes that the value itself has quotes, so it adds a backslash \\ to differentiate them from the quotes it uses to print strings. We can manually coerce the column “flowers” to be of type double, but first we must remove those confusing quotes.\n\nflower_clean_df[\"flowers\"] <- gsub(\n    pattern = \"\\\"\", # \\\" the backlash tells R to match quotes\n    replacement = \"\", # This is how we write \"nothing\"\n    x = flower_clean_df$flowers # x needs to be a vector, so use\n                                # double brackets or dollar sign\n)\nhead(flower_clean_df$flowers)\n\n[1] \"1\"  \"10\" \"10\" \"1\"  \"4\"  \"9\" \n\n\nNow we can transform the column to be of type “double”.\n\nflower_clean_df[\"flowers\"] <- as.numeric(flower_clean_df$flowers)\ntypeof(flower_clean_df$flowers)\n\n[1] \"double\"\n\nhead(flower_clean_df$flowers)\n\n[1]  1 10 10  1  4  9\n\n\nColumns “treat” and “nitrogen” are of type character. This is not wrong, but it will be easier to handle them if we convert them to factors.\n\nflower_clean_df[\"treat\"] <- factor(flower_clean_df$treat)\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df$nitrogen)\nstr(flower_clean_df)\n\n'data.frame':   96 obs. of  8 variables:\n $ treat     : Factor w/ 2 levels \"notip\",\"tip\": 2 2 2 2 2 2 2 2 2 2 ...\n $ nitrogen  : Factor w/ 8 levels \"high\",\"High\",..: 7 7 7 8 7 7 8 7 7 7 ...\n $ block     : int  1 1 1 1 1 1 1 1 2 2 ...\n $ height    : num  7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight    : num  7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num  11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num  31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers   : num  1 10 10 1 4 9 7 6 5 8 ...\n\n\nColumn “flowers” looks fine, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description counts eight values. Let’s examine them more closely:\n\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\"   \"High\"   \"HIGH\"   \"low\"    \"lOw\"    \"Low\"    \"medium\" \"Medium\"\n\n\nRemember that R is case sensitive, so it interprets each of spelling “high” and “low” as a different value. We can fix this using tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.\n\nflower_clean_df[\"nitrogen\"] <- tolower(flower_clean_df$nitrogen)\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df$nitrogen)\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\"   \"low\"    \"medium\"\n\n\nUnless I have a good reason not to, I usually transform all character columns to have only lower case letters."
       },
       {
         "objectID": "04_basic_data_processing.html#data-summaries-and-visualizations",
         "href": "04_basic_data_processing.html#data-summaries-and-visualizations",
         "title": "4  Basic data processing",
         "section": "4.3 Data summaries and visualizations",
    -    "text": "4.3 Data summaries and visualizations\nNow that our data is clean, we can get more complete summaries to understand it better. Function summary() recognizes the type of each column and displays an intuitively appropriate summary:\n\nsummary(flower_clean_df)\n\n   treat      nitrogen      block         height           weight      \n notip:48   high  :32   Min.   :1.0   Min.   : 1.200   Min.   : 5.790  \n tip  :48   low   :32   1st Qu.:1.0   1st Qu.: 4.475   1st Qu.: 9.027  \n            medium:32   Median :1.5   Median : 6.450   Median :11.395  \n                        Mean   :1.5   Mean   : 6.840   Mean   :12.155  \n                        3rd Qu.:2.0   3rd Qu.: 9.025   3rd Qu.:14.537  \n                        Max.   :2.0   Max.   :17.200   Max.   :23.890  \n   leaf_area       shoot_area        flowers      \n Min.   : 5.80   Min.   :  5.80   Min.   : 1.000  \n 1st Qu.:11.07   1st Qu.: 39.05   1st Qu.: 4.000  \n Median :13.45   Median : 70.05   Median : 6.000  \n Mean   :14.05   Mean   : 79.78   Mean   : 7.062  \n 3rd Qu.:16.45   3rd Qu.:113.28   3rd Qu.: 9.000  \n Max.   :49.20   Max.   :189.60   Max.   :17.000  \n\n\nNow let’s imagine we want to study the distribution of values for weight. We can use a histogram to check the shape.\n\nhist(\n    flower_clean_df$weight, \n    breaks = 15,\n    xlab = \"Weight\",\n    main = \"Histogram for weight\"\n)\n\n\n\n\nOr we can get a simpler description using a box plot.\n\nboxplot(\n    flower_clean_df$weight, \n    xlab = \"height\", \n    col = \"darkgreen\",\n    main = \"Boxplot for weight\"\n)\n\n\n\n\nA single box plot has less information than a histogram. But it is easier to compare box plots to look for “big” differences between distributions. Let’s compare the distributions of height by nitrogen level:\n\nboxplot(\n    height ~ nitrogen,\n    data = flower_clean_df, \n    col = c(\"yellow\", \"blue\", \"pink\"),\n    main = \"No clear pattern between height and nitrogen\"\n)\n\n\n\n\nNow let’s say we want to investigate the relationship between shoot area and leaf area. And let’s check whether that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.\n\nplot(\n    x = flower_clean_df$leaf_area,\n    y = flower_clean_df$shoot_area, \n    col = flower_clean_df$treat,\n    main = \"Shoot area seems proportional to leaf area\",\n    xlab = \"Leaf area\",\n    ylab = \"Shoot area\"\n)\n# Add a legend to the plot\nlegend(\n    x = \"bottomright\", \n    legend = levels(flower_clean_df$treat), \n    col = 1:length(levels(flower_clean_df$treat)), \n    pch = 16\n)\n\n\n\n\nNow let’s say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.\n\nnitrogen_by_treat_table = xtabs(\n    formula = ~ nitrogen + treat,\n    data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]\n)\nnitrogen_by_treat_table\n\n        treat\nnitrogen notip tip\n  high      14  12\n  low        7   3\n  medium    10   7\n\nmosaicplot(nitrogen_by_treat_table, main = \"Nitrogen by treat table\")"
    +    "text": "4.3 Data summaries and visualizations\nNow that our data is clean, we can get more complete summaries to understand it better. Function summary() recognizes the type of each column and displays an intuitively appropriate summary:\n\nsummary(flower_clean_df)\n\n   treat      nitrogen      block         height           weight      \n notip:48   high  :32   Min.   :1.0   Min.   : 1.200   Min.   : 5.790  \n tip  :48   low   :32   1st Qu.:1.0   1st Qu.: 4.475   1st Qu.: 9.027  \n            medium:32   Median :1.5   Median : 6.450   Median :11.395  \n                        Mean   :1.5   Mean   : 6.840   Mean   :12.155  \n                        3rd Qu.:2.0   3rd Qu.: 9.025   3rd Qu.:14.537  \n                        Max.   :2.0   Max.   :17.200   Max.   :23.890  \n   leaf_area       shoot_area        flowers      \n Min.   : 5.80   Min.   :  5.80   Min.   : 1.000  \n 1st Qu.:11.07   1st Qu.: 39.05   1st Qu.: 4.000  \n Median :13.45   Median : 70.05   Median : 6.000  \n Mean   :14.05   Mean   : 79.78   Mean   : 7.062  \n 3rd Qu.:16.45   3rd Qu.:113.28   3rd Qu.: 9.000  \n Max.   :49.20   Max.   :189.60   Max.   :17.000  \n\n\nNow let’s imagine we want to study the distribution of values for weight. We can use a histogram to check the shape.\n\nhist(\n    flower_clean_df$weight, \n    breaks = 15,\n    xlab = \"Weight\",\n    main = \"Histogram for weight\"\n)\n\n\n\n\nOr we can get a simpler description using a box plot.\n\nboxplot(\n    flower_clean_df$weight, \n    xlab = \"height\", \n    col = \"darkgreen\",\n    main = \"Boxplot for weight\"\n)\n\n\n\n\nA single box plot has less information than a histogram. But it is easier to compare box plots to look for “big” differences between distributions. Let’s compare the distributions of height by nitrogen level:\n\nboxplot(\n    height ~ nitrogen,\n    data = flower_clean_df, \n    col = c(\"yellow\", \"blue\", \"pink\"),\n    main = \"No clear association between height and nitrogen\"\n)\n\n\n\n\nNow let’s investigate the relationship between shoot area and leaf area. And let’s check whether that relationship changes depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.\n\nplot(\n    x = flower_clean_df$leaf_area,\n    y = flower_clean_df$shoot_area, \n    col = flower_clean_df$treat,\n    main = \"Shoot area seems proportional to leaf area\",\n    xlab = \"Leaf area\",\n    ylab = \"Shoot area\"\n)\n# Add a legend to the plot\nlegend(\n    x = \"bottomright\", \n    legend = levels(flower_clean_df$treat), \n    col = 1:length(levels(flower_clean_df$treat)), \n    pch = 16\n)\n\n\n\n\nNow let’s see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.\n\nnitrogen_by_treat_table = xtabs(\n    formula = ~ nitrogen + treat,\n    data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]\n)\nnitrogen_by_treat_table\n\n        treat\nnitrogen notip tip\n  high      14  12\n  low        7   3\n  medium    10   7\n\nmosaicplot(nitrogen_by_treat_table, main = \"Nitrogen by treat table\")"
       },
       {
         "objectID": "04_basic_data_processing.html#success",