update remixed instructions

rhodyrstats · Jan 9, 2020 · 3fa63ca · 3fa63ca
1 parent 7c238d5
commit 3fa63ca
Show file tree

Hide file tree

Showing 3 changed files with 267 additions and 7 deletions.
diff --git a/Oz_fires_DC_remix.Rmd b/Oz_fires_DC_remix.Rmd
@@ -49,6 +49,15 @@ temperature <- read_csv('https://tinyurl.com/Oz-fire-temp')
 You can see some information about the data we have just loaded.
 The name of each column is shown along with the type of data in that column.
 The data are stored in a format we call a data frame.
+
+<p style="color:blue">
+You will see the message `Parsed with column specification`, followed by each column name and its data type.
+When you execute `read_csv` on a data file, it looks through the first 1000 rows of each column and
+guesses the data type for each column as it reads it into R. For example, in this dataset, `read_csv`
+reads columns as `col_double` (a numeric data type), and as `col_character`. You have the
+option to specify the data type for a column manually by using the `col_types` argument in `read_csv`.
+</p>
+
 For more details on this dataset see the [Tidy Tuesday site](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-07).
 
 <p style="color:blue">
@@ -84,7 +93,7 @@ yearly_temp <- read_csv('https://tinyurl.com/Oz-mean-temp')
 Now we'll plot the temperature as a function of time.
 
 <p style="color:blue">
-## Plotting with **`ggplot2`**
+### Plotting with **`ggplot2`**
 
 **`ggplot2`** is a plotting package that makes it simple to create complex plots
 from data in a data frame. It provides a more programmatic interface for
@@ -153,13 +162,180 @@ ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
   geom_point()+geom_line()
 ```
 
-In a few quick commands we can already plot temperature and observe how it's been increasing.
+And let's tidy this into a publication-quality plot.
+```{r t5}
+ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
+  geom_line() +
+  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
+  theme_bw()
+```
 
+In a few quick commands we can already plot temperature and observe how it's been increasing.
 
 **Notes**
 
 <span style="color: blue">
 - Anything you put in the `ggplot()` function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in `aes()`.<br>
 - You can also specify mappings for a given geom independently of the mappings defined globally in the `ggplot()` function.<br>
 - The `+` sign used to add new layers must be placed at the end of the line containing the *previous* layer. If, instead, the `+` sign is added at the beginning of the line containing the new layer, **`ggplot2`** will not add the new layer and will return an error message.
-</span>
+</span>
+
+## Manipulating data
+
+In the prior section I gave you a summary table of temperature data.
+Let's consider how you could generate this summary table and do other data manipulation
+given our original datasets.
+
+<p style="color:blue">
+### Data Manipulation using **`dplyr`** and **`tidyr`**
+
+Bracket subsetting is handy, but it can be cumbersome and difficult to read,
+especially for complicated operations. Enter **`dplyr`**. **`dplyr`** is a package for
+making tabular data manipulation easier. It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.
+
+Packages in R are basically sets of additional functions that let you do more
+stuff. The functions we've been using so far, like `str()` or `data.frame()`,
+come built into R; packages give you access to more of them. Before you use a
+package for the first time you need to install it on your machine, and then you
+should import it in every subsequent R session when you need it. You should
+already have installed the **`tidyverse`** package. This is an
+"umbrella-package" that installs several packages useful for data analysis which
+work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc.
+
+
+The **`tidyverse`** package tries to address 3 common issues that arise when
+doing data analysis with some of the functions that come with R:
+
+1. The results from a base R function sometimes depend on the type of data.
+2. Using R expressions in a non standard way, which can be confusing for new
+   learners.
+3. Hidden arguments, having default operations that new learners are not aware
+   of.
+
+The package **`dplyr`** provides easy tools for the most common data manipulation
+tasks. It is built to work directly with data frames, with many common tasks
+optimized by being written in a compiled language (C++). An additional feature is the
+ability to work directly with data stored in an external database. The benefits of
+doing this are that the data can be managed natively in a relational database,
+queries can be conducted on that database, and only the results of the query are
+returned.
+
+This addresses a common problem with R in that all operations are conducted
+in-memory and thus the amount of data you can work with is limited by available
+memory. The database connections essentially remove that limitation in that you
+can connect to a database of many hundreds of GB, conduct queries on it directly, and pull
+back into R only what you need for analysis.
+
+The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is non-trivial, and **`tidyr`** gives you tools for this and more sophisticated  data manipulation.
+
+To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want to check out this
+[handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).
+
+We're going to learn some of the most common **`dplyr`** functions:
+
+- `select()`: subset columns
+- `filter()`: subset rows on conditions
+- `mutate()`: create new columns by using information from other columns
+- `group_by()` and `summarize()`: create summary statistics on grouped data
+- `arrange()`: sort results
+- `count()`: count discrete values
+
+### Selecting columns and filtering rows
+
+To choose rows based on a specific criterion, use `filter()`.
+</p>
+
+For our data let's speculate that we may be seeing an increase in fires due to 
+changes in maximum daily temperatures.
+After all, it could be spiking temperatures that allow fires to occur,
+and that wouldn't be reflected well in the mean temperature for a day.
+In this case we need to filter our data to include only data where the column
+`temp_type` is listed as max.
+
+```{r f}
+temperatures_maxs <- filter(temperature,temp_type == "max")
+```
+
+Now consider what summary information you want.
+You probably want the average maximum temperature for each city for each year
+to look at changes over time.
+Let's look at how we would calculate the average maximum temperature for one city for one year.
+Then we'll be able to extend this to all cities and all years. 
+
+First filter your data to include only data from 2019 at "PERTH AIRPORT".
+This is a challenge for you.
+
+```{r f1}
+temperatures_maxs_PerthAir <- filter(temperatures_maxs,site_name == "PERTH AIRPORT")
+
+```
+
+You should have gotten stuck on how we know whether data came from 2019.
+That information is in the date column but you have to extract it.
+This is a great lesson.
+You first need to think about what you want your data to look like.
+Once you know what you want you can figure out how to communicate that to the computer.
+
+In this case we'll use the `lubridate` package to process date information.
+
+Challenge: load the lubridate library.
+
+```{r pack2, warning=FALSE}
+library(lubridate)
+```
+
+You can use the `year` function to extract the year from the date column.
+Assign that to a new column in your data frame.
+
+```{r d}
+temperatures_maxs_PerthAir$year <- year(temperatures_maxs_PerthAir$date)
+```
+
+Now you can filter for data from 2019.
+This is a challenge.
+
+```{r f2}
+temperatures_maxs_PerthAir_2019 <- filter(temperatures_maxs_PerthAir,year == 2019)
+
+```
+
+Now you can calculate the mean max temperature for this site in this year using the mean
+function.
+
+```{r m}
+mean(temperatures_maxs_PerthAir_2019$temperature)
+
+```
+
+<p style="color:blue">
+### Pipes
+
+What if you want to select and filter at the same time? There are three
+ways to do this: use intermediate steps, nested functions, or pipes.
+
+With intermediate steps, you create a temporary data frame and use
+that as input to the next function, like this:
+
+```{r, purl = FALSE}
+#surveys2 <- filter(surveys, weight < 5)
+#surveys_sml <- select(surveys2, species_id, sex, weight)
+```
+
+This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
+
+You can also nest functions (i.e. one function inside of another), like this:
+
+```{r, purl = FALSE}
+#surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
+```
+This is handy, but can be difficult to read if too many functions are nested, as
+R evaluates the expression from the inside out (in this case, filtering, then selecting).
+
+The last option, *pipes*, are a recent addition to R. Pipes let you take
+the output of one function and send it directly to the next, which is useful
+when you need to do many things to the same dataset.  Pipes in R look like
+`%>%` and are made available via the **`magrittr`** package, installed automatically
+with **`dplyr`**. 
+</p>
+
+
diff --git a/Oz_fires_DC_remix.html b/Oz_fires_DC_remix.html
diff --git a/data/data_mods.R b/data/data_mods.R
@@ -28,3 +28,8 @@ yearly_temp <- temperature %>%
 write_csv(yearly_temp,path="data/Oz_mean_temp.csv")
 ggplot(yearly_temp, aes(year,temperature,color=city_name)) +
   geom_line()
+
+temperatures_maxs <- filter(temperature,temp_type == "max")
+temperatures_maxs %>% filter(site_name == "PERTH AIRPORT") %>%
+  filter(!is.na(temperature)) %>%
+  summarize(temperature = mean(temperature))