Skip to content

Commit

Permalink
update remixed instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
rachelss committed Jan 9, 2020
1 parent 7c238d5 commit 3fa63ca
Show file tree
Hide file tree
Showing 3 changed files with 267 additions and 7 deletions.
182 changes: 179 additions & 3 deletions Oz_fires_DC_remix.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,15 @@ temperature <- read_csv('https://tinyurl.com/Oz-fire-temp')
You can see some information about the data we have just loaded.
The name of each column is shown along with the type of data in that column.
The data are stored in a format we call a data frame.

<p style="color:blue">
You will see the message `Parsed with column specification`, followed by each column name and its data type.
When you execute `read_csv` on a data file, it looks through the first 1000 rows of each column and
guesses the data type for each column as it reads it into R. For example, in this dataset, `read_csv`
reads columns as `col_double` (a numeric data type), and as `col_character`. You have the
option to specify the data type for a column manually by using the `col_types` argument in `read_csv`.
</p>

For more details on this dataset see the [Tidy Tuesday site](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-07).

<p style="color:blue">
Expand Down Expand Up @@ -84,7 +93,7 @@ yearly_temp <- read_csv('https://tinyurl.com/Oz-mean-temp')
Now we'll plot the temperature as a function of time.

<p style="color:blue">
## Plotting with **`ggplot2`**
### Plotting with **`ggplot2`**

**`ggplot2`** is a plotting package that makes it simple to create complex plots
from data in a data frame. It provides a more programmatic interface for
Expand Down Expand Up @@ -153,13 +162,180 @@ ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
geom_point()+geom_line()
```

In a few quick commands we can already plot temperature and observe how it's been increasing.
And let's tidy this into a publication-quality plot.
```{r t5}
ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
geom_line() +
labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
theme_bw()
```

In a few quick commands we can already plot temperature and observe how it's been increasing.

**Notes**

<span style="color: blue">
- Anything you put in the `ggplot()` function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in `aes()`.<br>
- You can also specify mappings for a given geom independently of the mappings defined globally in the `ggplot()` function.<br>
- The `+` sign used to add new layers must be placed at the end of the line containing the *previous* layer. If, instead, the `+` sign is added at the beginning of the line containing the new layer, **`ggplot2`** will not add the new layer and will return an error message.
</span>
</span>

## Manipulating data

In the prior section I gave you a summary table of temperature data.
Let's consider how you could generate this summary table and do other data manipulation
given our original datasets.

<p style="color:blue">
### Data Manipulation using **`dplyr`** and **`tidyr`**

Bracket subsetting is handy, but it can be cumbersome and difficult to read,
especially for complicated operations. Enter **`dplyr`**. **`dplyr`** is a package for
making tabular data manipulation easier. It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.

Packages in R are basically sets of additional functions that let you do more
stuff. The functions we've been using so far, like `str()` or `data.frame()`,
come built into R; packages give you access to more of them. Before you use a
package for the first time you need to install it on your machine, and then you
should import it in every subsequent R session when you need it. You should
already have installed the **`tidyverse`** package. This is an
"umbrella-package" that installs several packages useful for data analysis which
work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc.


The **`tidyverse`** package tries to address 3 common issues that arise when
doing data analysis with some of the functions that come with R:

1. The results from a base R function sometimes depend on the type of data.
2. Using R expressions in a non standard way, which can be confusing for new
learners.
3. Hidden arguments, having default operations that new learners are not aware
of.

The package **`dplyr`** provides easy tools for the most common data manipulation
tasks. It is built to work directly with data frames, with many common tasks
optimized by being written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The benefits of
doing this are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the query are
returned.

This addresses a common problem with R in that all operations are conducted
in-memory and thus the amount of data you can work with is limited by available
memory. The database connections essentially remove that limitation in that you
can connect to a database of many hundreds of GB, conduct queries on it directly, and pull
back into R only what you need for analysis.

The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is non-trivial, and **`tidyr`** gives you tools for this and more sophisticated data manipulation.

To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want to check out this
[handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).

We're going to learn some of the most common **`dplyr`** functions:

- `select()`: subset columns
- `filter()`: subset rows on conditions
- `mutate()`: create new columns by using information from other columns
- `group_by()` and `summarize()`: create summary statistics on grouped data
- `arrange()`: sort results
- `count()`: count discrete values

### Selecting columns and filtering rows

To choose rows based on a specific criterion, use `filter()`.
</p>

For our data let's speculate that we may be seeing an increase in fires due to
changes in maximum daily temperatures.
After all, it could be spiking temperatures that allow fires to occur,
and that wouldn't be reflected well in the mean temperature for a day.
In this case we need to filter our data to include only data where the column
`temp_type` is listed as max.

```{r f}
temperatures_maxs <- filter(temperature,temp_type == "max")
```

Now consider what summary information you want.
You probably want the average maximum temperature for each city for each year
to look at changes over time.
Let's look at how we would calculate the average maximum temperature for one city for one year.
Then we'll be able to extend this to all cities and all years.

First filter your data to include only data from 2019 at "PERTH AIRPORT".
This is a challenge for you.

```{r f1}
temperatures_maxs_PerthAir <- filter(temperatures_maxs,site_name == "PERTH AIRPORT")
```

You should have gotten stuck on how we know whether data came from 2019.
That information is in the date column but you have to extract it.
This is a great lesson.
You first need to think about what you want your data to look like.
Once you know what you want you can figure out how to communicate that to the computer.

In this case we'll use the `lubridate` package to process date information.

Challenge: load the lubridate library.

```{r pack2, warning=FALSE}
library(lubridate)
```

You can use the `year` function to extract the year from the date column.
Assign that to a new column in your data frame.

```{r d}
temperatures_maxs_PerthAir$year <- year(temperatures_maxs_PerthAir$date)
```

Now you can filter for data from 2019.
This is a challenge.

```{r f2}
temperatures_maxs_PerthAir_2019 <- filter(temperatures_maxs_PerthAir,year == 2019)
```

Now you can calculate the mean max temperature for this site in this year using the mean
function.

```{r m}
mean(temperatures_maxs_PerthAir_2019$temperature)
```

<p style="color:blue">
### Pipes

What if you want to select and filter at the same time? There are three
ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use
that as input to the next function, like this:

```{r, purl = FALSE}
#surveys2 <- filter(surveys, weight < 5)
#surveys_sml <- select(surveys2, species_id, sex, weight)
```

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

```{r, purl = FALSE}
#surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
```
This is handy, but can be difficult to read if too many functions are nested, as
R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, *pipes*, are a recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset. Pipes in R look like
`%>%` and are made available via the **`magrittr`** package, installed automatically
with **`dplyr`**.
</p>


87 changes: 83 additions & 4 deletions Oz_fires_DC_remix.html

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions data/data_mods.R
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,8 @@ yearly_temp <- temperature %>%
write_csv(yearly_temp,path="data/Oz_mean_temp.csv")
ggplot(yearly_temp, aes(year,temperature,color=city_name)) +
geom_line()

temperatures_maxs <- filter(temperature,temp_type == "max")
temperatures_maxs %>% filter(site_name == "PERTH AIRPORT") %>%
filter(!is.na(temperature)) %>%
summarize(temperature = mean(temperature))

0 comments on commit 3fa63ca

Please sign in to comment.