diff --git a/04_basic_data_processing.qmd b/04_basic_data_processing.qmd
index bbcba3d..bd724b8 100644
--- a/04_basic_data_processing.qmd
+++ b/04_basic_data_processing.qmd
@@ -2,13 +2,13 @@
Now that we understand how **R** handles data, we can start working with pre-existing data files. These files need to be correctly formatted and in a file format that **R** can recognize. Don't worry, there are plenty of options.
-The first step when loading data in **R** is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. Working directory will vary on different computers. To determine which directory **R** is using as your working directory, run:
+The first step when loading data in **R** is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. The working directory will change on different computers. To determine which directory **R** is using as your working directory, run:
```{r get working directory}
getwd()
```
-You can move your working directory to any folder on your computer with the function `setwd()`. Just give `setwd()` the [file path](https://www.codecademy.com/resources/docs/general/file-paths) to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all related data, scripts, graphs, and reports are in the same place. For example:
+You can move your working directory to any folder on your computer with the function `setwd()`. Just give `setwd()` the [file path](https://www.codecademy.com/resources/docs/general/file-paths) to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all files related to my project are in the same place. For example:
```{r}
#| eval: false
@@ -29,7 +29,7 @@ Plain-text files are simple and many programs can read them. This is why many or
A plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter `|`, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.
-We will work with data from [this](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower.csv)^[You can find the original file [here](https://alexd106.github.io/intro2R/data.html) courtesy of Douglas et al. (see references).] plain text file. Use `Ctrl+Shift+s` to download the file. Then save it in your working directory with the name "flower".
+We will work with data from [this](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower.csv)^[You can find the original file [here](https://alexd106.github.io/intro2R/data.html) courtesy of Douglas et al. (see references).] plain text file. Use `Ctrl+Shift+s` to download the file. I am going to save it in a folder called "data_files" inside my working directory under the name "flower.csv". But you can save it wherever you want as long as you can keep track of it.
#### read.table
@@ -65,9 +65,13 @@ flower_df_chunk
+ `read.csv2` reads .csv files with European decimal format.
+ `read.delim2` reads tab-delimited files with European decimal format.
+#### HTML links
+
+`read.table()` and its shortcuts allow us to load data files directly from a website. Instead of using the file's path or name, we can directly use a web address in the `file` argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.
+
#### read.fwf
-There is a type of plain-text file called *fixed-width file* (.fwf). Instead of a symbol, a fixed-width file uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.
+*Fixed-width file* (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.
If our flowers data came in a fixed-width file, the first few lines would look like this:
```{flowers as a fwf}
@@ -96,41 +100,194 @@ flowers_fwf_df <- read.fwf(
flowers_fwf_df
```
-#### HTML links
+### Excel files
+
+The best way to load data from Excel files (.xlsx) into **R** is not to use Excel files. Instead, save these files as .csv or .txt files and then use `read.table`. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for **R** to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.
+
+Still, there are ways to load Excel files into **R** if we *really* need to. **R** has no native way of loading these files, but we can use the package `readxl`, which works on all operating systems. You install it using `install.packages("readxl")`. Then load it using `library(readxl)`. Once you load the package, you can use the function `read_excel()` to load files of the type .xls and .xlsx (see help("read_excel") for more information).
+
+### Files from other programs
-`read.table` and its shortcuts allow us to load data files directly from a website. Instead of using the file's path or name, we can directly use a web address in the `file` argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.
+As with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that your data is transcribed properly, and allows us to customize the transformation.
+
+But sometimes we can't transform the file to a plain-text format---maybe because we can't access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries:
+
++ `haven`, for reading files from SAS, SPSS, and Stata.
++ `R.matlab` for reading files for versions MAT 4 and MAT 5.
++ `foreign` for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use `heaven` in these cases.
## Cleaning data
-First we want to make sure that the column names follow the rules we saw in section 1. This will facilitate working with different columns later.
+Once we load our data files as data.frames in **R**, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as "data cleaning". To practice this cleaning, we will use a "messy" version of the flower data that we loaded above. You can get this messy version from here. Again, you can use `Ctrl+Shift+s` to download the file.
+
+Since this is a .csv file, we can load it using:
+```{r loading messy flower data}
+flower_messy_df = read.csv("data_files/flower_messy.csv", header = TRUE)
+```
+
+First, we should ensure the column names to follow the rules we saw in section 1. This will facilitate working with different columns later. We can check these column names using the `colnames()` function:
+```{r check colnames}
+colnames(flower_messy_df)
+```
+
+If we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, `read.csv()` automatically substitutes these blank spaces with periods `.`, so that the names conform to **R**'s conventions. `read.csv()` checks for other things too, and it often does a pretty good job by itself. But it's not perfect, so it's always a good idea to double-check everything ourselves.
+
+The column names of `flower_messy_df` look fine, but they can be better. Note that some names have capital letters, while others only have lower-case letters. Remembering the exact mix of upper and lower case letters is a drag, so why don't we make them all lower case? A fast way to do this is to use the `tolower()` function, which changes all characters in a vector of strings to lower case:
+```{r colnames to lower case}
+new_colnames <- tolower(colnames(flower_messy_df)) # Modify column names
+new_colnames
+```
+
+These new column names are better, but we have not changed the column names inside `flower_messy_df`. Before moving on, let's create a new data set called `flower_clean_df`. Using a copy of the original data set makes it easier to track our changes because we can always look at the original version. It also eases backtracking when we make a mistake because we don't have to reload our original data---a lengthy process with large files. Creating a copy is easy:
+```{r create flower_clean_df}
+flower_clean_df <- flower_messy_df
+```
-Now we want to ensure that every column has the right format. Let's check the types of the columns in our current data set **R**.
+Now we can use our improved column names.
```{r}
-# str(flowers_df_clean)
+colnames(flower_clean_df) <- new_colnames # Replace column names in data frame
+colnames(flower_clean_df) # Check our work
+```
+
+
+The column names are almost ready. The last change will be to substitute the periods in the names with underscores. In **R**, this is purely out of personal preference. However, the change is a good excuse to get acquainted with function `gsub()`, which substitutes patterns of strings:
+```{r substitute periods with underscores in colnames}
+colnames(flower_clean_df) <- gsub(
+ pattern = "\\.", # What we want to substitute
+ replacement = "_", # What we want to have instead
+ x = colnames(flower_clean_df) # The object we want to modify
+)
+colnames(flower_clean_df)
```
-Columns INSERT NAMES HERE are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.
+Note that I had to use `"\\."` instead of simply `"."` to match the period. The reason is that `gsub()` interprets `"."` as saying "match any character". This may sound silly but it helps when using a [regular expression](https://en.wikipedia.org/wiki/Regular_expression)---a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to discuss here, but if you expect to work with text data regularly, I encourage you to learn more about them.
+
+Now we want to ensure that every column has the right format. Numbers should be of type "double" or "integer", and text should be of type "character". Let's check the types of the columns in our current data set **R**.
+
+```{r check column types}
+str(flower_clean_df)
+```
+
+Column "flowers" seems to contain numbers but is classified as type "character". The reason is that there are quotes around the first value in this column:
```{r}
+head(flower_clean_df[["flowers"]])
+```
+
+**R** recognizes that the value itself has quotes, so it adds a backslash `\` to the quotes to differentiate them from the quotes it uses to print strings. Before we can manually coerce the column "flowers" to be of type double, we have to eliminate those confusing quotes.
+```{r eliminate quotes from flowers column}
+flower_clean_df["flowers"] <- gsub(
+ pattern = "\"", # \" the backlash tells R to match quotes
+ replacement = "", # This is how we write "nothing"
+ x = flower_clean_df[["flowers"]] # x needs to be a vector, so use
+ # double brackets or dollar sign
+)
+head(flower_clean_df[["flowers"]])
```
-Next, let's substitute the "missing" values with
+Now we can transform the column to be of type "double".
+```{r coerce flowers column into double}
+flower_clean_df["flowers"] <- as.numeric(flower_clean_df[["flowers"]])
+typeof(flower_clean_df[["flowers"]])
+head(flower_clean_df[["flowers"]])
+```
-Notice that column INSERT NAME HERE is of type character, but it has numbers there. In this case, the reason is that one value in INSERT NAME HERE has quotation marks, so **R** coerces the entire column to be of type character. We can fix this by doing:
+Columns "treat" and "nitrogen" are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.
```{r}
+flower_clean_df["treat"] <- factor(flower_clean_df[["treat"]])
+flower_clean_df["nitrogen"] <- factor(flower_clean_df[["nitrogen"]])
+str(flower_clean_df)
+```
+The transformation worked, but column "nitrogen" looks suspicious. It is supposed to have only three values ("low", "medium", and "high"), but its description says it has eight levels. Let's examine them more closely:
+```{r check levels of nitrogen column}
+levels(flower_clean_df$nitrogen)
```
+Remember **R** is case sensitive, so it interprets each spelling as a different value. We can fix this using our friend `tolower()` once more. Note that this will convert the "nitrogen" column back to a simple character type, so we have to reconvert it to factor.
+```{r nitrogen column to all lowercase}
+flower_clean_df["nitrogen"] <- tolower(flower_clean_df$nitrogen)
+flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
+levels(flower_clean_df$nitrogen)
+```
+
+Unless I have a good reason not to, I usually transform all character columns to have only lower case letters.
+
## Data summaries and visualizations
Now that our data is clean, we can get more complete summaries to understand what is going on. Function `summary()` recognizes the type of each column and displays an intuitively appropriate summary:
-```{r}
-# summary(flower_df_clean)
+```{r summary of flower_clean_df}
+summary(flower_clean_df)
+```
+
+Now let's imagine we want to study the distribution of values for weight. We can use a histogram to check the shape.
+
+```{r histogram for weight}
+hist(
+ flower_clean_df$weight,
+ breaks = 15,
+ xlab = "Weight",
+ main = "Histogram for weight"
+)
+```
+
+Or we can get a simpler description using a box plot
+```{r boxplot for weight}
+boxplot(
+ flower_clean_df$weight, xlab = "height",
+ col = "darkgreen",
+ main = "Boxplot for weight"
+)
+```
+
+
+A single box plot has little information compared to a histogram. But box plots make it easier to look for "big" differences in the distribution of values. Let's compare the distributions of height by nitrogen level:
+
+```{r height by nitrogen boxplots}
+boxplot(
+ height ~ nitrogen,
+ data = flower_clean_df,
+ col = c("yellow", "blue", "pink"),
+ main = "No clear pattern between height and nitrogen"
+)
+```
+
+Now let's say we want to investigate the relationship between shoot area and leaf area. And let's see that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.
+
+```{r leaf area vs shoot area by treat}
+plot(
+ x = flower_clean_df$leaf_area,
+ y = flower_clean_df$shoot_area,
+ col = flower_clean_df$treat,
+ main = "Shoot area seems proportional to leaf area",
+ xlab = "Leaf area",
+ ylab = "Shoot area"
+)
+# Add a legend to the plot
+legend(
+ x = "bottomright",
+ legend = levels(flower_clean_df$treat),
+ col = 1:length(levels(flower_clean_df$treat)),
+ pch = 16
+)
+```
+
+Now let's say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.
+
+```{r mosaic plot for nitrogen vs treat}
+nitrogen_by_treat_table = xtabs(
+ formula = ~ nitrogen + treat,
+ data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]
+)
+nitrogen_by_treat_table
+mosaicplot(nitrogen_by_treat_table, main = "Nitrogen by treat table")
```
+## Success!
+Dear reader, you are now a capable user**R**. From this humble introduction, you can now choose your own adventure and learn more about many different topics in **R**. Be curious, be bold, and, above all, be patient. **R**ome wasn't built in a day. Best of luck, fellow traveler!
## References
diff --git a/data_files/flower_messy.csv b/data_files/flower_messy.csv
index 1606e83..45f6527 100644
--- a/data_files/flower_messy.csv
+++ b/data_files/flower_messy.csv
@@ -1,5 +1,5 @@
-Treat,Nitrogen,block,Height,Weight,leaf area,shoot area,Flowers
-tip,medium,1,7.5,7.62,11.7,31.9,1
+Treat,Nitrogen,block,Height,Weight,leaf area,shoot area,FLOWERS
+tip,medium,1,7.5,7.62,11.7,31.9,"""1"""
tip,medium,1,10.7,12.14,14.1,46,10
tip,medium,1,11.2,12.76,7.1,66.7,10
tip,Medium,1,10.4,8.78,11.9,20.3,1
diff --git a/docs/04_basic_data_processing.html b/docs/04_basic_data_processing.html
index 4e406a4..6684a28 100644
--- a/docs/04_basic_data_processing.html
+++ b/docs/04_basic_data_processing.html
@@ -223,10 +223,13 @@
You can move your working directory to any folder on your computer with the function setwd(). Just give setwd() the file path to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all related data, scripts, graphs, and reports are in the same place. For example:
+
You can move your working directory to any folder on your computer with the function setwd(). Just give setwd() the file path to your new working directory. I prefer to set my working directory to a folder dedicated to whichever project I am currently working on. This way, all files related to my project are in the same place. For example:
Plain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files.
A plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.
-
We will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. Then save it in your working directory with the name “flower”.
+
We will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. I am going to save it in a folder called “data_files” inside my working directory under the name “flower.csv”. But you can save it wherever you want as long as you can keep track of it.
4.1.1.1 read.table
read.table() can load plain-text files. The first argument of read.table() is the name of your file (if it is in your working directory), or the file path to your file (if it is not in your working directory).
@@ -309,9 +312,13 @@
-
4.1.1.3 read.fwf
-
There is a type of plain-text file called fixed-width file (.fwf). Instead of a symbol, a fixed-width file uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.
+
+
4.1.1.3 HTML links
+
read.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.
+
+
+
4.1.1.4 read.fwf
+
Fixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.
If our flowers data came in a fixed-width file, the first few lines would look like this:
Fixed-width files may be visually intuitive, but they are difficult to work with. Perhaps because of this, R has a function for reading fixed-width files, but not for saving them.
You can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.
-
-
4.1.1.4 HTML links
-
read.table and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.
+
+
4.1.2 Excel files
+
The best way to load data from Excel files (.xlsx) into R is not to use Excel files. Instead, save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.
+
Still, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on all operating systems. You install it using install.packages("readxl"). Then load it using library(readxl). Once you load the package, you can use the function read_excel() to load files of the type .xls and .xlsx (see help(“read_excel”) for more information).
+
+
+
4.1.3 Files from other programs
+
As with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that your data is transcribed properly, and allows us to customize the transformation.
+
But sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries:
+
+
haven, for reading files from SAS, SPSS, and Stata.
+
R.matlab for reading files for versions MAT 4 and MAT 5.
+
foreign for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use heaven in these cases.
+
4.2 Cleaning data
-
First we want to make sure that the column names follow the rules we saw in section 1. This will facilitate working with different columns later.
-
Now we want to ensure that every column has the right format. Let’s check the types of the columns in our current data set R.
+
Once we load our data files as data.frames in R, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. To practice this cleaning, we will use a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.
First, we should ensure the column names to follow the rules we saw in section 1. This will facilitate working with different columns later. We can check these column names using the colnames() function:
If we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() checks for other things too, and it often does a pretty good job by itself. But it’s not perfect, so it’s always a good idea to double-check everything ourselves.
+
The column names of flower_messy_df look fine, but they can be better. Note that some names have capital letters, while others only have lower-case letters. Remembering the exact mix of upper and lower case letters is a drag, so why don’t we make them all lower case? A fast way to do this is to use the tolower() function, which changes all characters in a vector of strings to lower case:
These new column names are better, but we have not changed the column names inside flower_messy_df. Before moving on, let’s create a new data set called flower_clean_df. Using a copy of the original data set makes it easier to track our changes because we can always look at the original version. It also eases backtracking when we make a mistake because we don’t have to reload our original data—a lengthy process with large files. Creating a copy is easy:
+
+
flower_clean_df <- flower_messy_df
+
+
Now we can use our improved column names.
+
+
colnames(flower_clean_df) <- new_colnames # Replace column names in data frame
+colnames(flower_clean_df) # Check our work
The column names are almost ready. The last change will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference. However, the change is a good excuse to get acquainted with function gsub(), which substitutes patterns of strings:
+
+
colnames(flower_clean_df) <-gsub(
+pattern ="\\.", # What we want to substitute
+replacement ="_", # What we want to have instead
+x =colnames(flower_clean_df) # The object we want to modify
+)
+colnames(flower_clean_df)
Note that I had to use "\\." instead of simply "." to match the period. The reason is that gsub() interprets "." as saying “match any character”. This may sound silly but it helps when using a regular expression—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to discuss here, but if you expect to work with text data regularly, I encourage you to learn more about them.
+
Now we want to ensure that every column has the right format. Numbers should be of type “double” or “integer”, and text should be of type “character”. Let’s check the types of the columns in our current data set R.
Column “flowers” seems to contain numbers but is classified as type “character”. The reason is that there are quotes around the first value in this column:
+
+
head(flower_clean_df[["flowers"]])
+
+
[1] "\"1\"" "10" "10" "1" "4" "9"
+
+
+
R recognizes that the value itself has quotes, so it adds a backslash \ to the quotes to differentiate them from the quotes it uses to print strings. Before we can manually coerce the column “flowers” to be of type double, we have to eliminate those confusing quotes.
+
+
flower_clean_df["flowers"] <-gsub(
+pattern ="\"", # \" the backlash tells R to match quotes
+replacement ="", # This is how we write "nothing"
+x = flower_clean_df[["flowers"]] # x needs to be a vector, so use
+# double brackets or dollar sign
+)
+head(flower_clean_df[["flowers"]])
+
+
[1] "1" "10" "10" "1" "4" "9"
+
+
+
Now we can transform the column to be of type “double”.
The transformation worked, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description says it has eight levels. Let’s examine them more closely:
Columns INSERT NAMES HERE are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.
-
Next, let’s substitute the “missing” values with
-
Notice that column INSERT NAME HERE is of type character, but it has numbers there. In this case, the reason is that one value in INSERT NAME HERE has quotation marks, so R coerces the entire column to be of type character. We can fix this by doing:
+
Remember R is case sensitive, so it interprets each spelling as a different value. We can fix this using our friend tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.
Unless I have a good reason not to, I usually transform all character columns to have only lower case letters.
4.3 Data summaries and visualizations
Now that our data is clean, we can get more complete summaries to understand what is going on. Function summary() recognizes the type of each column and displays an intuitively appropriate summary:
-
# summary(flower_df_clean)
+
summary(flower_clean_df)
+
+
treat nitrogen block height weight
+ notip:48 high :32 Min. :1.0 Min. : 1.200 Min. : 5.790
+ tip :48 low :32 1st Qu.:1.0 1st Qu.: 4.475 1st Qu.: 9.027
+ medium:32 Median :1.5 Median : 6.450 Median :11.395
+ Mean :1.5 Mean : 6.840 Mean :12.155
+ 3rd Qu.:2.0 3rd Qu.: 9.025 3rd Qu.:14.537
+ Max. :2.0 Max. :17.200 Max. :23.890
+ leaf_area shoot_area flowers
+ Min. : 5.80 Min. : 5.80 Min. : 1.000
+ 1st Qu.:11.07 1st Qu.: 39.05 1st Qu.: 4.000
+ Median :13.45 Median : 70.05 Median : 6.000
+ Mean :14.05 Mean : 79.78 Mean : 7.062
+ 3rd Qu.:16.45 3rd Qu.:113.28 3rd Qu.: 9.000
+ Max. :49.20 Max. :189.60 Max. :17.000
+
+
+
Now let’s imagine we want to study the distribution of values for weight. We can use a histogram to check the shape.
A single box plot has little information compared to a histogram. But box plots make it easier to look for “big” differences in the distribution of values. Let’s compare the distributions of height by nitrogen level:
Now let’s say we want to investigate the relationship between shoot area and leaf area. And let’s see that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.
+
+
plot(
+x = flower_clean_df$leaf_area,
+y = flower_clean_df$shoot_area,
+col = flower_clean_df$treat,
+main ="Shoot area seems proportional to leaf area",
+xlab ="Leaf area",
+ylab ="Shoot area"
+)
+# Add a legend to the plot
+legend(
+x ="bottomright",
+legend =levels(flower_clean_df$treat),
+col =1:length(levels(flower_clean_df$treat)),
+pch =16
+)
+
+
+
+
Now let’s say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.
treat
+nitrogen notip tip
+ high 14 12
+ low 7 3
+ medium 10 7
+
+
mosaicplot(nitrogen_by_treat_table, main ="Nitrogen by treat table")
+
+
+
+
+
+
+
4.4 Success!
+
Dear reader, you are now a capable userR. From this humble introduction, you can now choose your own adventure and learn more about many different topics in R. Be curious, be bold, and, above all, be patient. Rome wasn’t built in a day. Best of luck, fellow traveler!
diff --git a/docs/04_basic_data_processing_files/figure-html/boxplot for weight-1.png b/docs/04_basic_data_processing_files/figure-html/boxplot for weight-1.png
new file mode 100644
index 0000000..ed9e31f
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/boxplot for weight-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png b/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png
new file mode 100644
index 0000000..3618151
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/height by nitrogen boxplots-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/histogram for weight-1.png b/docs/04_basic_data_processing_files/figure-html/histogram for weight-1.png
new file mode 100644
index 0000000..8600dd5
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/histogram for weight-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/leaf area vs shoot area by treat-1.png b/docs/04_basic_data_processing_files/figure-html/leaf area vs shoot area by treat-1.png
new file mode 100644
index 0000000..efc2e6a
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/leaf area vs shoot area by treat-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/mosaic plot for nitrogen vs treat-1.png b/docs/04_basic_data_processing_files/figure-html/mosaic plot for nitrogen vs treat-1.png
new file mode 100644
index 0000000..4bcf802
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/mosaic plot for nitrogen vs treat-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-7-1.png b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..cc104a8
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-8-1.png b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-8-1.png
new file mode 100644
index 0000000..a703ae5
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-8-1.png differ
diff --git a/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-9-1.png b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-9-1.png
new file mode 100644
index 0000000..fb6b97d
Binary files /dev/null and b/docs/04_basic_data_processing_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/search.json b/docs/search.json
index 7d0eb2b..ab67eac 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -319,28 +319,28 @@
"href": "04_basic_data_processing.html#loading-data",
"title": "4 Basic data processing",
"section": "4.1 Loading data",
- "text": "4.1 Loading data\nOnce we know where to find data files in our computer, we can start loading them into R. Note, however, that we need specific ways to open different file formats.\n\n4.1.1 Plain text files\nPlain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files.\nA plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.\nWe will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. Then save it in your working directory with the name “flower”.\n\n4.1.1.1 read.table\nread.table() can load plain-text files. The first argument of read.table() is the name of your file (if it is in your working directory), or the file path to your file (if it is not in your working directory).\n\nflower_df <- read.table(\"data_files/flower.csv\", header = TRUE, sep = \",\")\n\nIn the code above, I added two more arguments, header and sep. header tells R whether the first line of the file contains variable names instead of values. sep tells R the symbol that the file uses to separate the cells.\nSometimes a plain-text file starts with text that is not part of the data set. Or, maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading in values from the file. Argument nrow tells R to stop reading in values after it has read in a certain number of lines. Keep in mind that the header row doesn’t count towards the total rows allowed by nrow.\n\nflower_df_chunk <- read.table(\n \"data_files/flower.csv\", \n header = TRUE, \n sep = \",\", \n skip = 0, \n nrow = 3\n)\nflower_df_chunk\n\n treat nitrogen block height weight leafarea shootarea flowers\n1 tip medium 1 7.5 7.62 11.7 31.9 1\n2 tip medium 1 10.7 12.14 14.1 46.0 10\n3 tip medium 1 11.2 12.76 7.1 66.7 10\n\n\nread.table() has other arguments that you can tweak. You can consult the function’s help page to know more about it.\n\n\n4.1.1.2 Shortcuts for read.table\nR has shortcut functions that call read.table() in the background with different default values for popular types of files:\n\nread.table is the general purpose read function.\nread.csv reads comma-separated values (.csv) files.\nread.delim reads tab-delimited files.\nread.csv2 reads .csv files with European decimal format.\nread.delim2 reads tab-delimited files with European decimal format.\n\n\n\n4.1.1.3 Excel files\nThe best way to load Excel files (.xlsx) into R is not to use Excel files. Instead, save these files as .csv or .text files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other hidden, complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load, and transfer them more easily.\nStill, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl. If you don’t have it installed, you can type install.packages(\"readxl\"). Then\n\n\n4.1.1.4 HTML links\nread.table and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.\n\n\n4.1.1.5 read.fwf\nFixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.\nIf our flowers data came in a fixed-width file, the first few lines would look like this:\n\ntreat nitrogen block height weight leafarea shootarea flowers\ntip medium 1 7.5 7.62 11.7 31.9 1\ntip medium 1 10.7 12.14 14.1 46.0 10\ntip medium 1 11.2 12.76 7.1 66.7 10\ntip medium 1 10.4 8.78 11.9 20.3 1\ntip medium 1 10.4 13.58 14.5 26.9 4\ntip medium 1 9.8 10.08 12.2 72.7 9\n\nFixed-width files may be visually intuitive, but they are difficult to work with. Perhaps because of this, R has a function for reading fixed-width files, but not for saving them.\nYou can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set."
+ "text": "4.1 Loading data\nOnce we know where to find data files in our computer, we can start loading them into R. Note, however, that we need specific ways to open different file formats.\n\n4.1.1 Plain text files\nPlain-text files are simple and many programs can read them. This is why many organizations (e.g., the Census Bureau, the Social Security Administration, etc.) publish their data as plain-text files.\nA plain-text file stores a table of data in a text document. Each row of the table is saved on its own line, and a simple symbol separates the cells within a row. This symbol is often a comma, but it can also be a tab, a pipe delimiter |, or any other character. Each file only uses one symbol to separate cells, which minimizes confusion.\nWe will work with data from this1 plain text file. Use Ctrl+Shift+s to download the file. I am going to save it in a folder called “data_files” inside my working directory under the name “flower.csv”. But you can save it wherever you want as long as you can keep track of it.\n\n4.1.1.1 read.table\nread.table() can load plain-text files. The first argument of read.table() is the name of your file (if it is in your working directory), or the file path to your file (if it is not in your working directory).\n\nflower_df <- read.table(\"data_files/flower.csv\", header = TRUE, sep = \",\")\n\nIn the code above, I added two more arguments, header and sep. header tells R whether the first line of the file contains variable names instead of values. sep tells R the symbol that the file uses to separate the cells.\nSometimes a plain-text file starts with text that is not part of the data set. Or, maybe we want to read only part of a data set. Argument skip tells R to skip a specific number of lines before it starts reading in values from the file. Argument nrow tells R to stop reading in values after it has read in a certain number of lines. Keep in mind that the header row doesn’t count towards the total rows allowed by nrow.\n\nflower_df_chunk <- read.table(\n \"data_files/flower.csv\", \n header = TRUE, \n sep = \",\", \n skip = 0, \n nrow = 3\n)\nflower_df_chunk\n\n treat nitrogen block height weight leafarea shootarea flowers\n1 tip medium 1 7.5 7.62 11.7 31.9 1\n2 tip medium 1 10.7 12.14 14.1 46.0 10\n3 tip medium 1 11.2 12.76 7.1 66.7 10\n\n\nread.table() has other arguments that you can tweak. You can consult the function’s help page to know more about it.\n\n\n4.1.1.2 Shortcuts for read.table\nR has shortcut functions that call read.table() in the background with different default values for popular types of files:\n\nread.table is the general purpose read function.\nread.csv reads comma-separated values (.csv) files.\nread.delim reads tab-delimited files.\nread.csv2 reads .csv files with European decimal format.\nread.delim2 reads tab-delimited files with European decimal format.\n\n\n\n4.1.1.3 HTML links\nread.table() and its shortcuts allow us to load data files directly from a website. Instead of using the file’s path or name, we can directly use a web address in the file argument of the function. Do make sure that you are using the web address that links directly to the file, not to a web page that has a link to the file.\n\n\n4.1.1.4 read.fwf\nFixed-width file (.fwf) is a type of plain-text file that, instead of a symbol, uses its layout to separate data cells. Each row is still in a single line, and each column begins at a specific number of characters from the left-hand side of the document. To correctly position its data, the file adds an arbitrary number of character spaces between data entries.\nIf our flowers data came in a fixed-width file, the first few lines would look like this:\n\ntreat nitrogen block height weight leafarea shootarea flowers\ntip medium 1 7.5 7.62 11.7 31.9 1\ntip medium 1 10.7 12.14 14.1 46.0 10\ntip medium 1 11.2 12.76 7.1 66.7 10\ntip medium 1 10.4 8.78 11.9 20.3 1\ntip medium 1 10.4 13.58 14.5 26.9 4\ntip medium 1 9.8 10.08 12.2 72.7 9\n\nFixed-width files may be visually intuitive, but they are difficult to work with. Perhaps because of this, R has a function for reading fixed-width files, but not for saving them.\nYou can read fixed-width files into R with the function read.fwf(). This function adds another argument to the ones from read.table(): widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set.\n\n\n\n4.1.2 Excel files\nThe best way to load data from Excel files (.xlsx) into R is not to use Excel files. Instead, save these files as .csv or .txt files and then use read.table. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for R to read the files properly. Plain text files are simpler, so we can load and transfer them more easily.\nStill, there are ways to load Excel files into R if we really need to. R has no native way of loading these files, but we can use the package readxl, which works on all operating systems. You install it using install.packages(\"readxl\"). Then load it using library(readxl). Once you load the package, you can use the function read_excel() to load files of the type .xls and .xlsx (see help(“read_excel”) for more information).\n\n\n4.1.3 Files from other programs\nAs with Excel files, I suggest that you first try to transform files from other programs to plain-text files. This transformation is usually the best way to verify that your data is transcribed properly, and allows us to customize the transformation.\nBut sometimes we can’t transform the file to a plain-text format—maybe because we can’t access the program that created the file (e.g. SAS). In these cases, we can resort to one of several libraries:\n\nhaven, for reading files from SAS, SPSS, and Stata.\nR.matlab for reading files for versions MAT 4 and MAT 5.\nforeign for reading minitab and Systat file formats. This library can also read files from SAS, SPSS, and Stata, but I prefer to use heaven in these cases."
},
{
"objectID": "04_basic_data_processing.html#cleaning-data",
"href": "04_basic_data_processing.html#cleaning-data",
"title": "4 Basic data processing",
"section": "4.2 Cleaning data",
- "text": "4.2 Cleaning data\nFirst we want to make sure that the column names follow the rules we saw in section 1. This will facilitate working with different columns later.\nNow we want to ensure that every column has the right format. Let’s check the types of the columns in our current data set R.\n\n# str(flowers_df_clean)\n\nColumns INSERT NAMES HERE are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.\nNext, let’s substitute the “missing” values with\nNotice that column INSERT NAME HERE is of type character, but it has numbers there. In this case, the reason is that one value in INSERT NAME HERE has quotation marks, so R coerces the entire column to be of type character. We can fix this by doing:"
+ "text": "4.2 Cleaning data\nOnce we load our data files as data.frames in R, we want to make sure that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as “data cleaning”. To practice this cleaning, we will use a “messy” version of the flower data that we loaded above. You can get this messy version from here. Again, you can use Ctrl+Shift+s to download the file.\nSince this is a .csv file, we can load it using:\n\nflower_messy_df = read.csv(\"data_files/flower_messy.csv\", header = TRUE)\n\nFirst, we should ensure the column names to follow the rules we saw in section 1. This will facilitate working with different columns later. We can check these column names using the colnames() function:\n\ncolnames(flower_messy_df)\n\n[1] \"Treat\" \"Nitrogen\" \"block\" \"Height\" \"Weight\" \n[6] \"leaf.area\" \"shoot.area\" \"FLOWERS\" \n\n\nIf we open the data file using something like Excel or Notepad, we can see that the names for columns 6 and 7 had blank spaces inside it. When loading the data, read.csv() automatically substitutes these blank spaces with periods ., so that the names conform to R’s conventions. read.csv() checks for other things too, and it often does a pretty good job by itself. But it’s not perfect, so it’s always a good idea to double-check everything ourselves.\nThe column names of flower_messy_df look fine, but they can be better. Note that some names have capital letters, while others only have lower-case letters. Remembering the exact mix of upper and lower case letters is a drag, so why don’t we make them all lower case? A fast way to do this is to use the tolower() function, which changes all characters in a vector of strings to lower case:\n\nnew_colnames <- tolower(colnames(flower_messy_df)) # Modify column names\nnew_colnames\n\n[1] \"treat\" \"nitrogen\" \"block\" \"height\" \"weight\" \n[6] \"leaf.area\" \"shoot.area\" \"flowers\" \n\n\nThese new column names are better, but we have not changed the column names inside flower_messy_df. Before moving on, let’s create a new data set called flower_clean_df. Using a copy of the original data set makes it easier to track our changes because we can always look at the original version. It also eases backtracking when we make a mistake because we don’t have to reload our original data—a lengthy process with large files. Creating a copy is easy:\n\nflower_clean_df <- flower_messy_df\n\nNow we can use our improved column names.\n\ncolnames(flower_clean_df) <- new_colnames # Replace column names in data frame\ncolnames(flower_clean_df) # Check our work\n\n[1] \"treat\" \"nitrogen\" \"block\" \"height\" \"weight\" \n[6] \"leaf.area\" \"shoot.area\" \"flowers\" \n\n\nThe column names are almost ready. The last change will be to substitute the periods in the names with underscores. In R, this is purely out of personal preference. However, the change is a good excuse to get acquainted with function gsub(), which substitutes patterns of strings:\n\ncolnames(flower_clean_df) <- gsub(\n pattern = \"\\\\.\", # What we want to substitute\n replacement = \"_\", # What we want to have instead\n x = colnames(flower_clean_df) # The object we want to modify\n)\ncolnames(flower_clean_df)\n\n[1] \"treat\" \"nitrogen\" \"block\" \"height\" \"weight\" \n[6] \"leaf_area\" \"shoot_area\" \"flowers\" \n\n\nNote that I had to use \"\\\\.\" instead of simply \".\" to match the period. The reason is that gsub() interprets \".\" as saying “match any character”. This may sound silly but it helps when using a regular expression—a syntax to find many different, complicated patterns in strings. Regular expressions are too complicated to discuss here, but if you expect to work with text data regularly, I encourage you to learn more about them.\nNow we want to ensure that every column has the right format. Numbers should be of type “double” or “integer”, and text should be of type “character”. Let’s check the types of the columns in our current data set R.\n\nstr(flower_clean_df)\n\n'data.frame': 96 obs. of 8 variables:\n $ treat : chr \"tip\" \"tip\" \"tip\" \"tip\" ...\n $ nitrogen : chr \"medium\" \"medium\" \"medium\" \"Medium\" ...\n $ block : int 1 1 1 1 1 1 1 1 2 2 ...\n $ height : num 7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight : num 7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num 11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num 31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers : chr \"\\\"1\\\"\" \"10\" \"10\" \"1\" ...\n\n\nColumn “flowers” seems to contain numbers but is classified as type “character”. The reason is that there are quotes around the first value in this column:\n\nhead(flower_clean_df[[\"flowers\"]])\n\n[1] \"\\\"1\\\"\" \"10\" \"10\" \"1\" \"4\" \"9\" \n\n\nR recognizes that the value itself has quotes, so it adds a backslash \\ to the quotes to differentiate them from the quotes it uses to print strings. Before we can manually coerce the column “flowers” to be of type double, we have to eliminate those confusing quotes.\n\nflower_clean_df[\"flowers\"] <- gsub(\n pattern = \"\\\"\", # \\\" the backlash tells R to match quotes\n replacement = \"\", # This is how we write \"nothing\"\n x = flower_clean_df[[\"flowers\"]] # x needs to be a vector, so use \n # double brackets or dollar sign\n)\nhead(flower_clean_df[[\"flowers\"]])\n\n[1] \"1\" \"10\" \"10\" \"1\" \"4\" \"9\" \n\n\nNow we can transform the column to be of type “double”.\n\nflower_clean_df[\"flowers\"] <- as.numeric(flower_clean_df[[\"flowers\"]])\ntypeof(flower_clean_df[[\"flowers\"]])\n\n[1] \"double\"\n\nhead(flower_clean_df[[\"flowers\"]])\n\n[1] 1 10 10 1 4 9\n\n\nColumns “treat” and “nitrogen” are of type character, which is not wrong, but it will be easier to handle them if we convert them to factors.\n\nflower_clean_df[\"treat\"] <- factor(flower_clean_df[[\"treat\"]])\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df[[\"nitrogen\"]])\nstr(flower_clean_df)\n\n'data.frame': 96 obs. of 8 variables:\n $ treat : Factor w/ 2 levels \"notip\",\"tip\": 2 2 2 2 2 2 2 2 2 2 ...\n $ nitrogen : Factor w/ 8 levels \"high\",\"High\",..: 7 7 7 8 7 7 8 7 7 7 ...\n $ block : int 1 1 1 1 1 1 1 1 2 2 ...\n $ height : num 7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ...\n $ weight : num 7.62 12.14 12.76 8.78 13.58 ...\n $ leaf_area : num 11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ...\n $ shoot_area: num 31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ...\n $ flowers : num 1 10 10 1 4 9 7 6 5 8 ...\n\n\nThe transformation worked, but column “nitrogen” looks suspicious. It is supposed to have only three values (“low”, “medium”, and “high”), but its description says it has eight levels. Let’s examine them more closely:\n\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\" \"High\" \"HIGH\" \"low\" \"lOw\" \"Low\" \"medium\" \"Medium\"\n\n\nRemember R is case sensitive, so it interprets each spelling as a different value. We can fix this using our friend tolower() once more. Note that this will convert the “nitrogen” column back to a simple character type, so we have to reconvert it to factor.\n\nflower_clean_df[\"nitrogen\"] <- tolower(flower_clean_df$nitrogen)\nflower_clean_df[\"nitrogen\"] <- factor(flower_clean_df$nitrogen)\nlevels(flower_clean_df$nitrogen)\n\n[1] \"high\" \"low\" \"medium\"\n\n\nUnless I have a good reason not to, I usually transform all character columns to have only lower case letters."
},
{
"objectID": "04_basic_data_processing.html#data-summaries-and-visualizations",
"href": "04_basic_data_processing.html#data-summaries-and-visualizations",
"title": "4 Basic data processing",
"section": "4.3 Data summaries and visualizations",
- "text": "4.3 Data summaries and visualizations\nNow that our data is clean, we can get more complete summaries to understand what is going on. Function summary() recognizes the type of each column and displays an intuitively appropriate summary:\n\n# summary(flower_df_clean)"
+ "text": "4.3 Data summaries and visualizations\nNow that our data is clean, we can get more complete summaries to understand what is going on. Function summary() recognizes the type of each column and displays an intuitively appropriate summary:\n\nsummary(flower_clean_df)\n\n treat nitrogen block height weight \n notip:48 high :32 Min. :1.0 Min. : 1.200 Min. : 5.790 \n tip :48 low :32 1st Qu.:1.0 1st Qu.: 4.475 1st Qu.: 9.027 \n medium:32 Median :1.5 Median : 6.450 Median :11.395 \n Mean :1.5 Mean : 6.840 Mean :12.155 \n 3rd Qu.:2.0 3rd Qu.: 9.025 3rd Qu.:14.537 \n Max. :2.0 Max. :17.200 Max. :23.890 \n leaf_area shoot_area flowers \n Min. : 5.80 Min. : 5.80 Min. : 1.000 \n 1st Qu.:11.07 1st Qu.: 39.05 1st Qu.: 4.000 \n Median :13.45 Median : 70.05 Median : 6.000 \n Mean :14.05 Mean : 79.78 Mean : 7.062 \n 3rd Qu.:16.45 3rd Qu.:113.28 3rd Qu.: 9.000 \n Max. :49.20 Max. :189.60 Max. :17.000 \n\n\nNow let’s imagine we want to study the distribution of values for weight. We can use a histogram to check the shape.\n\nhist(\n flower_clean_df$weight, \n breaks = 15,\n xlab = \"Weight\",\n main = \"Histogram for weight\"\n)\n\n\n\n\nOr we can get a simpler description using a box plot\n\nboxplot(\n flower_clean_df$weight, xlab = \"height\", \n col = \"darkgreen\",\n main = \"Boxplot for weight\"\n)\n\n\n\n\nA single box plot has little information compared to a histogram. But box plots make it easier to look for “big” differences in the distribution of values. Let’s compare the distributions of height by nitrogen level:\n\nboxplot(\n height ~ nitrogen,\n data = flower_clean_df, \n col = c(\"yellow\", \"blue\", \"pink\"),\n main = \"No clear pattern between height and nitrogen\"\n)\n\n\n\n\nNow let’s say we want to investigate the relationship between shoot area and leaf area. And let’s see that relationship differs depending on the value of treat. We can use a scatter plot with shoot area and leaf area, and we can color each point by their treat value.\n\nplot(\n x = flower_clean_df$leaf_area,\n y = flower_clean_df$shoot_area, \n col = flower_clean_df$treat,\n main = \"Shoot area seems proportional to leaf area\",\n xlab = \"Leaf area\",\n ylab = \"Shoot area\"\n)\n# Add a legend to the plot\nlegend(\n x = \"bottomright\", \n legend = levels(flower_clean_df$treat), \n col = 1:length(levels(flower_clean_df$treat)), \n pch = 16\n)\n\n\n\n\nNow let’s say we want to see how frequently the values of nitrogen and treat combine with each other, but only for flowers with a leaf area greater than 13.\n\nnitrogen_by_treat_table = xtabs(\n formula = ~ nitrogen + treat,\n data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]\n)\nnitrogen_by_treat_table\n\n treat\nnitrogen notip tip\n high 14 12\n low 7 3\n medium 10 7\n\nmosaicplot(nitrogen_by_treat_table, main = \"Nitrogen by treat table\")"
},
{
"objectID": "04_basic_data_processing.html#references",
"href": "04_basic_data_processing.html#references",
"title": "4 Basic data processing",
- "section": "4.4 References",
- "text": "4.4 References\nMost of this section is based on “Hands-On Programming with R”, by Garret Grolemund; and on “An Introduction to R”, by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau."
+ "section": "4.5 References",
+ "text": "4.5 References\nMost of this section is based on “Hands-On Programming with R”, by Garret Grolemund; and on “An Introduction to R”, by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau."
},
{
"objectID": "04_basic_data_processing.html#footnotes",
@@ -355,5 +355,12 @@
"title": "2 Getting Started with R",
"section": "2.9 Acquiring external packages",
"text": "2.9 Acquiring external packages\nWe don’t need to reinvent the wheel every time we need to do something that is not available in R’s default version. We can easily download packages from CRAN’s online repositories to get many useful functions.\nTo install a package from CRAN, we can use the install.packages() function. For example, if we wanted to install the package readxl (for loading .xslx files), we would need:\n\ninstall.packages(\"readxl\", dependencies = TRUE)\n\nThe argument dependencies tells R whether it should also download other packages that readxl needs to work.\nR may ask you to select a CRAN mirror, which simply put refers to the location of the servers you want to download from. Choose a mirror close to where you are.\nAfter installing a package, we need to load it into R before we can use its functions. To load the package readxl, we need to use the function library(), which will also load any other packages required to load readxl and may print additional information.\n\nlibrary(\"readxl\")\n\nEvery time we start a new R session we need to load the packages we need. If we try to run a function without loading its package first, we will get an error message saying that R could not find it.\nWriting all our library() statements at the top of our R scripts is almost always a good idea. This helps us know that we need to load the libraries at the start our sessions; and it helps others know quickly that they will need to have those libraries installed to be able to use our code.\nSometimes only need one or two functions from a library. To avoid loading the entire library, we can access the specific function directly by specifying the package name followed by two colons and then the function name. For example:\n\nreadxl::read_xlsx(\"fake_data_file.xlsx\")"
+ },
+ {
+ "objectID": "04_basic_data_processing.html#success",
+ "href": "04_basic_data_processing.html#success",
+ "title": "4 Basic data processing",
+ "section": "4.4 Success!",
+ "text": "4.4 Success!\nDear reader, you are now a capable userR. From this humble introduction, you can now choose your own adventure and learn more about many different topics in R. Be curious, be bold, and, above all, be patient. Rome wasn’t built in a day. Best of luck, fellow traveler!"
}
]
\ No newline at end of file