diff --git a/_freeze/index/execute-results/html.json b/_freeze/index/execute-results/html.json
index bd57d26..4f9ea5f 100644
--- a/_freeze/index/execute-results/html.json
+++ b/_freeze/index/execute-results/html.json
@@ -1,8 +1,8 @@
{
- "hash": "9430ac80ce935232c59b2e0ecf0fbfaf",
+ "hash": "2bc3011fa8771e57561de03d64ebfe98",
"result": {
"engine": "knitr",
- "markdown": "---\ntitle: \"secret_weapon_workshop\"\neditor_options: \n chunk_output_type: console\n---\n\n\nThis is a Quarto website.\n\nTo learn more about Quarto websites visit .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the ggplot2 library\nlibrary(ggplot2)\n\n# Create a data frame with a single row for the rectangle\ndata <- data.frame(\n xmin = 0,\n xmax = 1,\n ymin = 0,\n ymax = 1\n)\n\n# Create the plot\nggplot(data) +\n geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),\n fill = \"lightgreen\", color = \"darkgreen\", \n alpha = 0.5, \n inherit.aes = FALSE, \n size = 1, \n radius = 0.1) + # Adjust radius for rounded corners\n theme_void() +\n theme(panel.background = element_rect(fill = NA)) # Set background to transparent\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), :\nIgnoring unknown parameters: `radius`\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the required libraries\nlibrary(ggplot2)\nlibrary(gganimate)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nNo renderer backend detected. gganimate will default to writing frames to separate files\nConsider installing:\n- the `gifski` package for gif output\n- the `av` package for video output\nand restarting the R session\n```\n\n\n:::\n\n```{.r .cell-code}\n# Define the gap size\ngap_size <- 0.02\n\n# Create data for the initial state (one rectangle)\ndata_initial <- data.frame(\n xmin = 0,\n xmax = 1,\n ymin = 0,\n ymax = 1,\n id = 1\n)\n\n# Create data for the final state (five rectangles with gaps)\ndata_final <- data.frame(\n xmin = rep(0, 5),\n xmax = rep(1, 5),\n ymin = seq(0, 0.8, by = 0.2) + gap_size / 2,\n ymax = seq(0.2, 1, by = 0.2) - gap_size / 2,\n id = 1:5\n)\n\n# Combine both datasets and add a state column\ndata <- rbind(\n cbind(data_initial, state = \"one\"),\n cbind(data_final, state = \"five\")\n)\n\n# Create the plot\np <- ggplot(data) +\n geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, group = id),\n fill = \"lightgreen\", color = \"darkgreen\", \n alpha = 0.5, \n inherit.aes = FALSE, \n size = 1) + \n theme_void() +\n theme(panel.background = element_rect(fill = NA)) + # Set background to transparent\n transition_states(state, transition_length = 2, state_length = 1) +\n ease_aes('cubic-in-out')\n\n# Animate the plot\nanimate(p, nframes = 10, width = 600, height = 600)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: No renderer available. Please install the gifski, av, or magick package to\ncreate animated output\n```\n\n\n:::\n:::\n",
+ "markdown": "---\ntitle: \"Secret weapon workshop\"\neditor_options: \n chunk_output_type: console\n---\n\n\n## What is the secret weapon? \n\nIn many ways, ecology is the discipline of replication. \nMost ecological studies have multiple species, multiple sites, or multiple time points in their study -- almost by definition. \nAs a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). \nThese models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\n\nAside from the technical complexity, there's also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what's happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\n\nEnter the secret weapon. The secret weapon is the name [Andrew Gelman ](https://statmodeling.stat.columbia.edu/2005/03/07/the_secret_weap/). Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\n\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called \"statistics on statistics\", or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\n\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who's beginning a quantitative analysis project.\n\nThis workshop will be divided into 3 parts. First, we'll do a worked example using the `mite` data from the package `vegan`. In the second, we're going to split into an exercise section where we will look at different examples from some long-term datasets that I've found. And third, we will have time to work on our own datasets. \n\n\n## Worked example\n\n### Data preparation -- Center and scale your predictors!\n\nThis example uses the famous `mite` dataset, collected by Daniel Borcard in 1989 and included in the famous `vegan` package.\n\n[Slides introducing the dataset](https://bios2.github.io/hiermod/slides/01_Data/#/oribatid-mite)\n\nFirst, we'll load the data and do some reorganization and preparation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(mite, package = \"vegan\")\ndata(\"mite.env\", package = \"vegan\")\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(cmdstanr)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThis is cmdstanr version 0.7.1\n- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr\n- CmdStan path: /home/andrew/software/cmdstan\n- CmdStan version: 2.34.1\n```\n\n\n:::\n\n```{.r .cell-code}\n# combine data and environment\nmite_data_long <- bind_cols(mite.env, mite) |> \n pivot_longer(Brachy:Trimalc2, names_to = \"spp\", values_to = \"abd\")\n```\n:::\n\n\nBecause we're ecologists, we'll do one of the most classic ecological models: a Binomial GLM, modelling the presence or absence of a species based on the water content of the soil.\n\n$$\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_xx\n\\end{align}\n$$\n\nWhere $x$ is some predictor variable and $\\frac{e^a}{1 + e^a}$ is the logit link function.\nTo keep things simple and univariate, let's consider only water for $x$. \n\nFirst, a quick word about centering and scaling a predictor variable:\n\n1. I center the predictor by subtracting the mean. This changes the *intercept* of my linear predictor. It becomes the mean log-odds of occurrance when the water content is average\n2. I divide water content by 100. The dataset has units of **grams per Litre** of water (see `?vegan::mite.env` for more details). This is fine, but I don't think mites are able to sense differences as precise as a millimeter of water either way. by dividing by 10 I transform this into centilitres, which is more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_data_long_transformed <- mite_data_long |> \n mutate(presabs = as.numeric(abd>0),\n # center predictors\n water = (WatrCont - mean(WatrCont)) / 100\n )\n```\n:::\n\n\n\nThis makes the model:\n\n$$\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_{\\text{water}}\\frac{\\text{water} - \\overline{\\text{water}}}{100}\n\\end{align}\n$$\n\n\n### The secret weapon: visually\n\nWe can already get a very good visual expression of the Secret Weapon approach using the power of ggplot2 and `stat_smooth()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_data_long_transformed |> \n ggplot(aes(x = water, y = presabs)) + \n geom_point() + \n stat_smooth(method = \"glm\", method.args = list(family = \"binomial\")) + \n facet_wrap(~spp)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nsome things to notice about this figure:\n\n- the x-axis scale has been transformed from \"grams per litre\" to \"centilitres away from average\"\n- there is a ton of variation in how different species respond to water!\n\n### The secret weapon: in tidyverse code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glms <- mite_data_long_transformed |> \n nest_by(spp) |> \n mutate(logistic_regressions = list(\n glm(presabs ~ water,\n family = \"binomial\",\n data = data))) |> \n mutate(coefs = list(broom::tidy(logistic_regressions)))\n```\n:::\n\n\n:::{.callout-note}\n## Split-Apply-Combine\n\nTo explore this kind of thinking, we are going to use an approach sometimes called [\"split-apply-combine\"](https://vita.had.co.nz/papers/plyr.pdf)\n\nThere are many possible ways to do this in practice. We are using a technique here from the tidyverse, which you can [read more about](https://dplyr.tidyverse.org/articles/rowwise.html).\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glm_coefs <- mite_many_glms |> \n select(-data, -logistic_regressions) |> \n unnest(coefs)\n\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate, y = spp,\n xmin = estimate - std.error,\n xmax = estimate + std.error)) + \n geom_pointrange() + \n facet_wrap(~term, scales = \"free\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nAs you can see, some of these estimates are high, others low. We could also plot these as histograms to see this distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate)) + \n geom_histogram(binwidth = .5) + \n facet_wrap(~term, scales = \"free\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nOnce again, the two parameters of this model represent:\n\n- *Intercept* The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.\n- *water* this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate.\n",
"supporting": [
"index_files"
],
diff --git a/_freeze/index/figure-html/unnamed-chunk-3-1.png b/_freeze/index/figure-html/unnamed-chunk-3-1.png
new file mode 100644
index 0000000..113ff7f
Binary files /dev/null and b/_freeze/index/figure-html/unnamed-chunk-3-1.png differ
diff --git a/_freeze/index/figure-html/unnamed-chunk-5-1.png b/_freeze/index/figure-html/unnamed-chunk-5-1.png
new file mode 100644
index 0000000..71f3b4c
Binary files /dev/null and b/_freeze/index/figure-html/unnamed-chunk-5-1.png differ
diff --git a/_freeze/index/figure-html/unnamed-chunk-6-1.png b/_freeze/index/figure-html/unnamed-chunk-6-1.png
new file mode 100644
index 0000000..15621b9
Binary files /dev/null and b/_freeze/index/figure-html/unnamed-chunk-6-1.png differ
diff --git a/docs/index.html b/docs/index.html
index ab5f916..3da78c1 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -20,6 +20,40 @@
margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
vertical-align: middle;
}
+/* CSS for syntax highlighting */
+pre > code.sourceCode { white-space: pre; position: relative; }
+pre > code.sourceCode > span { line-height: 1.25; }
+pre > code.sourceCode > span:empty { height: 1.2em; }
+.sourceCode { overflow: visible; }
+code.sourceCode > span { color: inherit; text-decoration: inherit; }
+div.sourceCode { margin: 1em 0; }
+pre.sourceCode { margin: 0; }
+@media screen {
+div.sourceCode { overflow: auto; }
+}
+@media print {
+pre > code.sourceCode { white-space: pre-wrap; }
+pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
+}
+pre.numberSource code
+ { counter-reset: source-line 0; }
+pre.numberSource code > span
+ { position: relative; left: -4em; counter-increment: source-line; }
+pre.numberSource code > span > a:first-child::before
+ { content: counter(source-line);
+ position: relative; left: -1em; text-align: right; vertical-align: baseline;
+ border: none; display: inline-block;
+ -webkit-touch-callout: none; -webkit-user-select: none;
+ -khtml-user-select: none; -moz-user-select: none;
+ -ms-user-select: none; user-select: none;
+ padding: 0 4px; width: 4em;
+ }
+pre.numberSource { margin-left: 3em; padding-left: 4px; }
+div.sourceCode
+ { }
+@media screen {
+pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
+}
@@ -67,6 +101,35 @@
}
}
+
+
+
+
@@ -113,6 +176,12 @@
In many ways, ecology is the discipline of replication. Most ecological studies have multiple species, multiple sites, or multiple time points in their study – almost by definition. As a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). These models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.
Aside from the technical complexity, there’s also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what’s happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.
-
Enter the secret weapon. The secret weapon is the name Andrew Gelman gives to a very simple idea. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.
+
Enter the secret weapon. The secret weapon is the name Andrew Gelman. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.
Then we can either analyze the summary statistics in a second study, which is a technique sometimes called “statistics on statistics”, or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.
This is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who’s beginning a quantitative analysis project.
This workshop will be divided into 3 parts. First, we’ll do a worked example using the mite data from the package vegan. In the second, we’re going to split into an exercise section where we will look at different examples from some long-term datasets that I’ve found. And third, we will have time to work on our own datasets.
+
+
+
Worked example
+
+
Data preparation – Center and scale your predictors!
+
This example uses the famous mite dataset, collected by Daniel Borcard in 1989 and included in the famous vegan package.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ dplyr 1.1.4 ✔ readr 2.1.5
+✔ forcats 1.0.0 ✔ stringr 1.5.1
+✔ ggplot2 3.5.1 ✔ tibble 3.2.1
+✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+✔ purrr 1.0.2
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+
+
library(cmdstanr)
+
+
This is cmdstanr version 0.7.1
+- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
+- CmdStan path: /home/andrew/software/cmdstan
+- CmdStan version: 2.34.1
+
+
# combine data and environment
+mite_data_long <-bind_cols(mite.env, mite) |>
+pivot_longer(Brachy:Trimalc2, names_to ="spp", values_to ="abd")
+
+
Because we’re ecologists, we’ll do one of the most classic ecological models: a Binomial GLM, modelling the presence or absence of a species based on the water content of the soil.
Where \(x\) is some predictor variable and \(\frac{e^a}{1 + e^a}\) is the logit link function. To keep things simple and univariate, let’s consider only water for \(x\).
+
First, a quick word about centering and scaling a predictor variable:
+
+
I center the predictor by subtracting the mean. This changes the intercept of my linear predictor. It becomes the mean log-odds of occurrance when the water content is average
+
I divide water content by 100. The dataset has units of grams per Litre of water (see ?vegan::mite.env for more details). This is fine, but I don’t think mites are able to sense differences as precise as a millimeter of water either way. by dividing by 10 I transform this into centilitres, which is more informative.
Once again, the two parameters of this model represent:
+
+
Intercept The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.
+
water this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate.
+
+
diff --git a/index_files/figure-html/unnamed-chunk-1-1.png b/docs/index_files/figure-html/unnamed-chunk-1-1.png
similarity index 100%
rename from index_files/figure-html/unnamed-chunk-1-1.png
rename to docs/index_files/figure-html/unnamed-chunk-1-1.png
diff --git a/docs/index_files/figure-html/unnamed-chunk-3-1.png b/docs/index_files/figure-html/unnamed-chunk-3-1.png
new file mode 100644
index 0000000..113ff7f
Binary files /dev/null and b/docs/index_files/figure-html/unnamed-chunk-3-1.png differ
diff --git a/docs/index_files/figure-html/unnamed-chunk-5-1.png b/docs/index_files/figure-html/unnamed-chunk-5-1.png
new file mode 100644
index 0000000..71f3b4c
Binary files /dev/null and b/docs/index_files/figure-html/unnamed-chunk-5-1.png differ
diff --git a/docs/index_files/figure-html/unnamed-chunk-6-1.png b/docs/index_files/figure-html/unnamed-chunk-6-1.png
new file mode 100644
index 0000000..15621b9
Binary files /dev/null and b/docs/index_files/figure-html/unnamed-chunk-6-1.png differ
diff --git a/docs/search.json b/docs/search.json
index 183ace4..ba3d671 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -4,14 +4,14 @@
"href": "index.html",
"title": "Secret weapon workshop",
"section": "",
- "text": "In many ways, ecology is the discipline of replication. Most ecological studies have multiple species, multiple sites, or multiple time points in their study – almost by definition. As a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). These models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\nAside from the technical complexity, there’s also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what’s happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\nEnter the secret weapon. The secret weapon is the name Andrew Gelman gives to a very simple idea. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called “statistics on statistics”, or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who’s beginning a quantitative analysis project.\nThis workshop will be divided into 3 parts. First, we’ll do a worked example using the mite data from the package vegan. In the second, we’re going to split into an exercise section where we will look at different examples from some long-term datasets that I’ve found. And third, we will have time to work on our own datasets."
+ "text": "In many ways, ecology is the discipline of replication. Most ecological studies have multiple species, multiple sites, or multiple time points in their study – almost by definition. As a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). These models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\nAside from the technical complexity, there’s also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what’s happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\nEnter the secret weapon. The secret weapon is the name Andrew Gelman. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called “statistics on statistics”, or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who’s beginning a quantitative analysis project.\nThis workshop will be divided into 3 parts. First, we’ll do a worked example using the mite data from the package vegan. In the second, we’re going to split into an exercise section where we will look at different examples from some long-term datasets that I’ve found. And third, we will have time to work on our own datasets."
},
{
"objectID": "index.html#what-is-the-secret-weapon",
"href": "index.html#what-is-the-secret-weapon",
"title": "Secret weapon workshop",
"section": "",
- "text": "In many ways, ecology is the discipline of replication. Most ecological studies have multiple species, multiple sites, or multiple time points in their study – almost by definition. As a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). These models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\nAside from the technical complexity, there’s also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what’s happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\nEnter the secret weapon. The secret weapon is the name Andrew Gelman gives to a very simple idea. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called “statistics on statistics”, or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who’s beginning a quantitative analysis project.\nThis workshop will be divided into 3 parts. First, we’ll do a worked example using the mite data from the package vegan. In the second, we’re going to split into an exercise section where we will look at different examples from some long-term datasets that I’ve found. And third, we will have time to work on our own datasets."
+ "text": "In many ways, ecology is the discipline of replication. Most ecological studies have multiple species, multiple sites, or multiple time points in their study – almost by definition. As a result, our analyses are almost always designed to include replication. The most common tool that people use is the multi-level model (AKA the hierarchical model). These models are populuare, useful, and very flexible; but they have the weakness of being very complex, difficult to interpret, and often difficult to write correctly.\nAside from the technical complexity, there’s also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what’s happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.\nEnter the secret weapon. The secret weapon is the name Andrew Gelman. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.\nThen we can either analyze the summary statistics in a second study, which is a technique sometimes called “statistics on statistics”, or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.\nThis is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who’s beginning a quantitative analysis project.\nThis workshop will be divided into 3 parts. First, we’ll do a worked example using the mite data from the package vegan. In the second, we’re going to split into an exercise section where we will look at different examples from some long-term datasets that I’ve found. And third, we will have time to work on our own datasets."
},
{
"objectID": "about.html",
@@ -19,5 +19,12 @@
"title": "About",
"section": "",
"text": "About this site\n\n1 + 1\n\n[1] 2"
+ },
+ {
+ "objectID": "index.html#worked-example",
+ "href": "index.html#worked-example",
+ "title": "Secret weapon workshop",
+ "section": "Worked example",
+ "text": "Worked example\n\nData preparation – Center and scale your predictors!\nThis example uses the famous mite dataset, collected by Daniel Borcard in 1989 and included in the famous vegan package.\nSlides introducing the dataset\nFirst, we’ll load the data and do some reorganization and preparation.\n\ndata(mite, package = \"vegan\")\ndata(\"mite.env\", package = \"vegan\")\nlibrary(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(cmdstanr)\n\nThis is cmdstanr version 0.7.1\n- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr\n- CmdStan path: /home/andrew/software/cmdstan\n- CmdStan version: 2.34.1\n\n# combine data and environment\nmite_data_long <- bind_cols(mite.env, mite) |> \n pivot_longer(Brachy:Trimalc2, names_to = \"spp\", values_to = \"abd\")\n\nBecause we’re ecologists, we’ll do one of the most classic ecological models: a Binomial GLM, modelling the presence or absence of a species based on the water content of the soil.\n\\[\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_xx\n\\end{align}\n\\]\nWhere \\(x\\) is some predictor variable and \\(\\frac{e^a}{1 + e^a}\\) is the logit link function. To keep things simple and univariate, let’s consider only water for \\(x\\).\nFirst, a quick word about centering and scaling a predictor variable:\n\nI center the predictor by subtracting the mean. This changes the intercept of my linear predictor. It becomes the mean log-odds of occurrance when the water content is average\nI divide water content by 100. The dataset has units of grams per Litre of water (see ?vegan::mite.env for more details). This is fine, but I don’t think mites are able to sense differences as precise as a millimeter of water either way. by dividing by 10 I transform this into centilitres, which is more informative.\n\n\nmite_data_long_transformed <- mite_data_long |> \n mutate(presabs = as.numeric(abd>0),\n # center predictors\n water = (WatrCont - mean(WatrCont)) / 100\n )\n\nThis makes the model:\n\\[\n\\begin{align}\n\\text{Presence} &\\sim \\text{Binomial}\\left(\\frac{e^a}{1 + e^a}, 1\\right) \\\\\na &= \\beta_0 + \\beta_{\\text{water}}\\frac{\\text{water} - \\overline{\\text{water}}}{100}\n\\end{align}\n\\]\n\n\nThe secret weapon: visually\nWe can already get a very good visual expression of the Secret Weapon approach using the power of ggplot2 and stat_smooth()\n\nmite_data_long_transformed |> \n ggplot(aes(x = water, y = presabs)) + \n geom_point() + \n stat_smooth(method = \"glm\", method.args = list(family = \"binomial\")) + \n facet_wrap(~spp)\n\n`geom_smooth()` using formula = 'y ~ x'\n\n\n\n\n\n\n\n\n\nsome things to notice about this figure:\n\nthe x-axis scale has been transformed from “grams per litre” to “centilitres away from average”\nthere is a ton of variation in how different species respond to water!\n\n\n\nThe secret weapon: in tidyverse code\n\nmite_many_glms <- mite_data_long_transformed |> \n nest_by(spp) |> \n mutate(logistic_regressions = list(\n glm(presabs ~ water,\n family = \"binomial\",\n data = data))) |> \n mutate(coefs = list(broom::tidy(logistic_regressions)))\n\n\n\n\n\n\n\nSplit-Apply-Combine\n\n\n\nTo explore this kind of thinking, we are going to use an approach sometimes called “split-apply-combine”\nThere are many possible ways to do this in practice. We are using a technique here from the tidyverse, which you can read more about.\n\n\n\nmite_many_glm_coefs <- mite_many_glms |> \n select(-data, -logistic_regressions) |> \n unnest(coefs)\n\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate, y = spp,\n xmin = estimate - std.error,\n xmax = estimate + std.error)) + \n geom_pointrange() + \n facet_wrap(~term, scales = \"free\")\n\n\n\n\n\n\n\n\nAs you can see, some of these estimates are high, others low. We could also plot these as histograms to see this distribution.\n\nmite_many_glm_coefs |> \n ggplot(aes(x = estimate)) + \n geom_histogram(binwidth = .5) + \n facet_wrap(~term, scales = \"free\")\n\n\n\n\n\n\n\n\nOnce again, the two parameters of this model represent:\n\nIntercept The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.\nwater this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate."
}
]
\ No newline at end of file
diff --git a/index.qmd b/index.qmd
index 882027b..5aa768d 100644
--- a/index.qmd
+++ b/index.qmd
@@ -13,10 +13,132 @@ These models are populuare, useful, and very flexible; but they have the weaknes
Aside from the technical complexity, there's also conceptual complexity when ecological data has many dimensions of space, time, species, etc. It can be quite hard to visualize and understand what's happening. Also, we may have hypotheses that operate at multiple levels. We may have one hypothesis for what happens within any group, for example, across individuals or species, and another hypothesis for what happens across sites or over time.
-Enter the secret weapon. The secret weapon is the name Andrew Gelman gives to a very simple idea. Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.
+Enter the secret weapon. The secret weapon is the name [Andrew Gelman ](https://statmodeling.stat.columbia.edu/2005/03/07/the_secret_weap/). Instead of conducting one large analysis, divide your data into multiple small groups, fit a model to each small group, and compare one or more summary statistics across groups.
Then we can either analyze the summary statistics in a second study, which is a technique sometimes called "statistics on statistics", or go on to build a multilevel model. We may even decide that this is a sufficient endpoint for our study.
This is a technique that everyone in this room can use right now and which I can confidently recommend as advice to anybody in college who's beginning a quantitative analysis project.
-This workshop will be divided into 3 parts. First, we'll do a worked example using the `mite` data from the package `vegan`. In the second, we're going to split into an exercise section where we will look at different examples from some long-term datasets that I've found. And third, we will have time to work on our own datasets.
\ No newline at end of file
+This workshop will be divided into 3 parts. First, we'll do a worked example using the `mite` data from the package `vegan`. In the second, we're going to split into an exercise section where we will look at different examples from some long-term datasets that I've found. And third, we will have time to work on our own datasets.
+
+
+## Worked example
+
+### Data preparation -- Center and scale your predictors!
+
+This example uses the famous `mite` dataset, collected by Daniel Borcard in 1989 and included in the famous `vegan` package.
+
+[Slides introducing the dataset](https://bios2.github.io/hiermod/slides/01_Data/#/oribatid-mite)
+
+First, we'll load the data and do some reorganization and preparation.
+
+```{r}
+data(mite, package = "vegan")
+data("mite.env", package = "vegan")
+library(tidyverse)
+library(cmdstanr)
+
+# combine data and environment
+mite_data_long <- bind_cols(mite.env, mite) |>
+ pivot_longer(Brachy:Trimalc2, names_to = "spp", values_to = "abd")
+```
+
+Because we're ecologists, we'll do one of the most classic ecological models: a Binomial GLM, modelling the presence or absence of a species based on the water content of the soil.
+
+$$
+\begin{align}
+\text{Presence} &\sim \text{Binomial}\left(\frac{e^a}{1 + e^a}, 1\right) \\
+a &= \beta_0 + \beta_xx
+\end{align}
+$$
+
+Where $x$ is some predictor variable and $\frac{e^a}{1 + e^a}$ is the logit link function.
+To keep things simple and univariate, let's consider only water for $x$.
+
+First, a quick word about centering and scaling a predictor variable:
+
+1. I center the predictor by subtracting the mean. This changes the *intercept* of my linear predictor. It becomes the mean log-odds of occurrance when the water content is average
+2. I divide water content by 100. The dataset has units of **grams per Litre** of water (see `?vegan::mite.env` for more details). This is fine, but I don't think mites are able to sense differences as precise as a millimeter of water either way. by dividing by 10 I transform this into centilitres, which is more informative.
+
+```{r}
+mite_data_long_transformed <- mite_data_long |>
+ mutate(presabs = as.numeric(abd>0),
+ # center predictors
+ water = (WatrCont - mean(WatrCont)) / 100
+ )
+```
+
+
+This makes the model:
+
+$$
+\begin{align}
+\text{Presence} &\sim \text{Binomial}\left(\frac{e^a}{1 + e^a}, 1\right) \\
+a &= \beta_0 + \beta_{\text{water}}\frac{\text{water} - \overline{\text{water}}}{100}
+\end{align}
+$$
+
+
+### The secret weapon: visually
+
+We can already get a very good visual expression of the Secret Weapon approach using the power of ggplot2 and `stat_smooth()`
+
+```{r}
+mite_data_long_transformed |>
+ ggplot(aes(x = water, y = presabs)) +
+ geom_point() +
+ stat_smooth(method = "glm", method.args = list(family = "binomial")) +
+ facet_wrap(~spp)
+```
+
+some things to notice about this figure:
+
+- the x-axis scale has been transformed from "grams per litre" to "centilitres away from average"
+- there is a ton of variation in how different species respond to water!
+
+### The secret weapon: in tidyverse code
+
+```{r}
+mite_many_glms <- mite_data_long_transformed |>
+ nest_by(spp) |>
+ mutate(logistic_regressions = list(
+ glm(presabs ~ water,
+ family = "binomial",
+ data = data))) |>
+ mutate(coefs = list(broom::tidy(logistic_regressions)))
+```
+
+:::{.callout-note}
+## Split-Apply-Combine
+
+To explore this kind of thinking, we are going to use an approach sometimes called ["split-apply-combine"](https://vita.had.co.nz/papers/plyr.pdf)
+
+There are many possible ways to do this in practice. We are using a technique here from the tidyverse, which you can [read more about](https://dplyr.tidyverse.org/articles/rowwise.html).
+:::
+
+```{r}
+mite_many_glm_coefs <- mite_many_glms |>
+ select(-data, -logistic_regressions) |>
+ unnest(coefs)
+
+mite_many_glm_coefs |>
+ ggplot(aes(x = estimate, y = spp,
+ xmin = estimate - std.error,
+ xmax = estimate + std.error)) +
+ geom_pointrange() +
+ facet_wrap(~term, scales = "free")
+```
+
+As you can see, some of these estimates are high, others low. We could also plot these as histograms to see this distribution.
+
+```{r}
+mite_many_glm_coefs |>
+ ggplot(aes(x = estimate)) +
+ geom_histogram(binwidth = .5) +
+ facet_wrap(~term, scales = "free")
+```
+
+Once again, the two parameters of this model represent:
+
+- *Intercept* The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.
+- *water* this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate.