Skip to content

Commit

Permalink
add code example
Browse files Browse the repository at this point in the history
  • Loading branch information
aammd committed May 29, 2024
1 parent 38583b3 commit 931419d
Show file tree
Hide file tree
Showing 11 changed files with 358 additions and 7 deletions.
4 changes: 2 additions & 2 deletions _freeze/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "9430ac80ce935232c59b2e0ecf0fbfaf",
"hash": "2bc3011fa8771e57561de03d64ebfe98",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"secret_weapon_workshop\"\neditor_options: \n chunk_output_type: console\n---\n\n\nThis is a Quarto website.\n\nTo learn more about Quarto websites visit <https://quarto.org/docs/websites>.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the ggplot2 library\nlibrary(ggplot2)\n\n# Create a data frame with a single row for the rectangle\ndata <- data.frame(\n xmin = 0,\n xmax = 1,\n ymin = 0,\n ymax = 1\n)\n\n# Create the plot\nggplot(data) +\n geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),\n fill = \"lightgreen\", color = \"darkgreen\", \n alpha = 0.5, \n inherit.aes = FALSE, \n size = 1, \n radius = 0.1) + # Adjust radius for rounded corners\n theme_void() +\n theme(panel.background = element_rect(fill = NA)) # Set background to transparent\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), :\nIgnoring unknown parameters: `radius`\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the required libraries\nlibrary(ggplot2)\nlibrary(gganimate)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nNo renderer backend detected. gganimate will default to writing frames to separate files\nConsider installing:\n- the `gifski` package for gif output\n- the `av` package for video output\nand restarting the R session\n```\n\n\n:::\n\n```{.r .cell-code}\n# Define the gap size\ngap_size <- 0.02\n\n# Create data for the initial state (one rectangle)\ndata_initial <- data.frame(\n xmin = 0,\n xmax = 1,\n ymin = 0,\n ymax = 1,\n id = 1\n)\n\n# Create data for the final state (five rectangles with gaps)\ndata_final <- data.frame(\n xmin = rep(0, 5),\n xmax = rep(1, 5),\n ymin = seq(0, 0.8, by = 0.2) + gap_size / 2,\n ymax = seq(0.2, 1, by = 0.2) - gap_size / 2,\n id = 1:5\n)\n\n# Combine both datasets and add a state column\ndata <- rbind(\n cbind(data_initial, state = \"one\"),\n cbind(data_final, state = \"five\")\n)\n\n# Create the plot\np <- ggplot(data) +\n geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, group = id),\n fill = \"lightgreen\", color = \"darkgreen\", \n alpha = 0.5, \n inherit.aes = FALSE, \n size = 1) + \n theme_void() +\n theme(panel.background = element_rect(fill = NA)) + # Set background to transparent\n transition_states(state, transition_length = 2, state_length = 1) +\n ease_aes('cubic-in-out')\n\n# Animate the plot\nanimate(p, nframes = 10, width = 600, height = 600)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: No renderer available. Please install the gifski, av, or magick package to\ncreate animated output\n```\n\n\n:::\n:::\n",
"markdown": "---\ntitle: \"Secret weapon workshop\"\neditor_options: \n chunk_output_type: console\n---\n\n\n## What is the secret weapon? \n\nIn many ways, ecology is the discipline of replication. \nMost ecological studies have multiple species, multiple sites, or multiple time points in their study -- almost by definition. \nAs a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). \nThese models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\n\nAside from the technical complexity, there's also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what's happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\n\nEnter the secret weapon. The secret weapon is the name [Andrew Gelman ](https://statmodeling.stat.columbia.edu/2005/03/07/the_secret_weap/). Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\n\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called \"statistics on statistics\", or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\n\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who's beginning a quantitative analysis project.\n\nThis workshop will be divided into 3 parts. First, we'll do a worked example using the `mite` data from the package `vegan`. In the second, we're going to split into an exercise section where we will look at different examples from some long-term datasets that I've found. And third, we will have time to work on our own datasets. \n\n\n## Worked example\n\n### Data preparation -- Center and scale your predictors!\n\nThis example uses the famous `mite` dataset, collected by Daniel Borcard in 1989 and included in the famous `vegan` package.\n\n[Slides introducing the dataset](https://bios2.github.io/hiermod/slides/01_Data/#/oribatid-mite)\n\nFirst, we'll load the data and do some reorganization and preparation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(mite, package = \"vegan\")\ndata(\"mite.env\", package = \"vegan\")\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(cmdstanr)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThis is cmdstanr version 0.7.1\n- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr\n- CmdStan path: /home/andrew/software/cmdstan\n- CmdStan version: 2.34.1\n```\n\n\n:::\n\n```{.r .cell-code}\n# combine data and environment\nmite_data_long <- bind_cols(mite.env, mite) |> \n pivot_longer(Brachy:Trimalc2, names_to = \"spp\", values_to = \"abd\")\n```\n:::\n\n\nBecause we're ecologists, we'll do one of the most classic ecological models: a Binomial GLM, modelling the presence or absence of a species based on the water content of the soil.\n\n$$\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_xx\n\\end{align}\n$$\n\nWhere $x$ is some predictor variable and $\\frac{e^a}{1 + e^a}$ is the logit link function.\nTo keep things simple and univariate, let's consider only water for $x$. \n\nFirst, a quick word about centering and scaling a predictor variable:\n\n1. I center the predictor by subtracting the mean. This changes the *intercept* of my linear predictor. It becomes the mean log-odds of occurrance when the water content is average\n2. I divide water content by 100. The dataset has units of **grams per Litre** of water (see `?vegan::mite.env` for more details). This is fine, but I don't think mites are able to sense differences as precise as a millimeter of water either way. by dividing by 10 I transform this into centilitres, which is more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_data_long_transformed <- mite_data_long |> \n mutate(presabs = as.numeric(abd>0),\n # center predictors\n water = (WatrCont - mean(WatrCont)) / 100\n )\n```\n:::\n\n\n\nThis makes the model:\n\n$$\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_{\\text{water}}\\frac{\\text{water} - \\overline{\\text{water}}}{100}\n\\end{align}\n$$\n\n\n### The secret weapon: visually\n\nWe can already get a very good visual expression of the Secret Weapon approach using the power of ggplot2 and `stat_smooth()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_data_long_transformed |> \n ggplot(aes(x = water, y = presabs)) + \n geom_point() + \n stat_smooth(method = \"glm\", method.args = list(family = \"binomial\")) + \n facet_wrap(~spp)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nsome things to notice about this figure:\n\n- the x-axis scale has been transformed from \"grams per litre\" to \"centilitres away from average\"\n- there is a ton of variation in how different species respond to water!\n\n### The secret weapon: in tidyverse code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glms <- mite_data_long_transformed |> \n nest_by(spp) |> \n mutate(logistic_regressions = list(\n glm(presabs ~ water,\n family = \"binomial\",\n data = data))) |> \n mutate(coefs = list(broom::tidy(logistic_regressions)))\n```\n:::\n\n\n:::{.callout-note}\n## Split-Apply-Combine\n\nTo explore this kind of thinking, we are going to use an approach sometimes called [\"split-apply-combine\"](https://vita.had.co.nz/papers/plyr.pdf)\n\nThere are many possible ways to do this in practice. We are using a technique here from the tidyverse, which you can [read more about](https://dplyr.tidyverse.org/articles/rowwise.html).\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glm_coefs <- mite_many_glms |> \n select(-data, -logistic_regressions) |> \n unnest(coefs)\n\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate, y = spp,\n xmin = estimate - std.error,\n xmax = estimate + std.error)) + \n geom_pointrange() + \n facet_wrap(~term, scales = \"free\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nAs you can see, some of these estimates are high, others low. We could also plot these as histograms to see this distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate)) + \n geom_histogram(binwidth = .5) + \n facet_wrap(~term, scales = \"free\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nOnce again, the two parameters of this model represent:\n\n- *Intercept* The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.\n- *water* this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate.\n",
"supporting": [
"index_files"
],
Expand Down
Binary file added _freeze/index/figure-html/unnamed-chunk-3-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _freeze/index/figure-html/unnamed-chunk-5-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _freeze/index/figure-html/unnamed-chunk-6-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 931419d

Please sign in to comment.