diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/.quarto/xref/1893bfdc b/.quarto/xref/1893bfdc
index 6abc991..8cb276d 100644
--- a/.quarto/xref/1893bfdc
+++ b/.quarto/xref/1893bfdc
@@ -1 +1 @@
-{"entries":[],"headings":["instructor","outline","get-course-materials","install-required-software","get-the-notes","useful-resources","references","license"]}
\ No newline at end of file
+{"headings":["instructor","outline","get-course-materials","install-required-software","get-the-notes","useful-resources","references","license"],"entries":[]}
\ No newline at end of file
diff --git a/.quarto/xref/1b4954b9 b/.quarto/xref/1b4954b9
index 830aec5..13dafd2 100644
--- a/.quarto/xref/1b4954b9
+++ b/.quarto/xref/1b4954b9
@@ -1 +1 @@
-{"headings":["mise-en-place-development-environment","the-structure-flour-and-sugar","the-description-file","keeping-notes","useful-links"],"entries":[]}
\ No newline at end of file
+{"entries":[],"headings":["mise-en-place-development-environment","the-structure-flour-and-sugar","the-description-file","keeping-notes","useful-links"]}
\ No newline at end of file
diff --git a/.quarto/xref/a07a8c4b b/.quarto/xref/a07a8c4b
index 1acf430..9d3d9c7 100644
--- a/.quarto/xref/a07a8c4b
+++ b/.quarto/xref/a07a8c4b
@@ -1 +1 @@
-{"headings":["why-do-you-want-to-use-shiny","hello-shiny","how-a-shiny-app-works","building-blocks","ids","organisation","plots","user-interface","server","the-shiny-app","customising-the-theme","using-built-in-themes","using-a-custom-theme","customizing-a-theme","constructing-a-shiny-app-using-shinydashboards","taking-advantage-of-good-defaults","using-shinydashboard","populating-the-layout","challenge","see-the-completed-app","constructing-a-shiny-app-using-golem","golem-modules","selecting-the-volcanoes","barplot-of-continents"],"entries":[]}
\ No newline at end of file
+{"entries":[],"headings":["why-do-you-want-to-use-shiny","hello-shiny","how-a-shiny-app-works","building-blocks","ids","organisation","plots","user-interface","server","the-shiny-app","customising-the-theme","using-built-in-themes","using-a-custom-theme","customizing-a-theme","constructing-a-shiny-app-using-shinydashboards","taking-advantage-of-good-defaults","using-shinydashboard","populating-the-layout","challenge","see-the-completed-app","constructing-a-shiny-app-using-golem","golem-modules","selecting-the-volcanoes","barplot-of-continents"]}
\ No newline at end of file
diff --git a/_quarto.yml b/_quarto.yml
index 189b7d4..fe29c00 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -1,5 +1,6 @@
project:
type: website
+ output-dir: docs
website:
title: "BIOS2 Education resources"
diff --git a/docs/.nojekyll b/docs/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/docs/Bios2_reverse.png b/docs/Bios2_reverse.png
new file mode 100644
index 0000000..2ed28ee
Binary files /dev/null and b/docs/Bios2_reverse.png differ
diff --git a/docs/about.html b/docs/about.html
new file mode 100644
index 0000000..745bc86
--- /dev/null
+++ b/docs/about.html
@@ -0,0 +1,290 @@
+
+
This workshop will introduce participants to the logic behind modeling in biology, focusing on developing equations, finding equilibria, analyzing stability, and running simulations.Techniques will be illustrated with the software tools, Mathematica and Maxima. This workshop was held in two parts: January 14 and January 16, 2020.
+
+
+
+
Technical
+
EN
+
+
+
+
+
+
Author
+
+
+
+ Dr Sarah P. Otto
+
+
+
+ University of British Columbia
+
+
+
+
+
+
+
+
+
Published
+
+
January 14, 2020
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
In this workshop, I introduce various modelling techniques, using mostly ecological and evolutionary examples, with a focus on how computer software programs can help biologists analyze such models.
+
+
Content
+
Part 1: Classic one-variable models in ecology and evolution
+Part 2: Equilibria and their stability
+Part 3: Beyond equilibria
+Part 4: Example of building a model from scratch
+Part 5: Extending to models with more than one variable
+Part 6: Another example of building a model from scratch
+
+
+
Software
+
In my research, I primarily use Mathematica, which is a powerful software package to organize and conduct analytical modelling, but it is not free (at UBC, we have some licenses available). I will also show some example code and provide translation of most of what I present in a free software package called Maxima.
+
+
Mathematica installation
+
There is a free trial version that you can use for 15 days, if you don’t have a copy (click here to access), or you can buy a student version online. If you want to make sure that all is working, copy the code below, put your cursor over each of the following lines and press enter (on some computers, “enter” is a separate button, on others, press “shift” and “return” at the same time):
You should see (a) \(3x^2\), (b) a plot of a line, (c) \({{x[t]->A^t x0}}\), and (d) \(\frac{e^\frac{-x^2}{2}}{\sqrt{2\pi }}\).
+
+
+
Maxima installation:
+
On a Mac, install using the instructions here. For other file systems, download here.
+
+
+
Maxima testing
+
When you first open Maxima, it will give you a choice of GUIs, chose wxMaxima. Once wxMaxima is launched type this command and hit return to see if it answers 4:
+
2+2;
+
If it doesn’t, then scan the installation document for the error that you run into.
+
If it does return 4, then type in and enter these commands:
This PDF was generated from the Mathematica notebook linked above. It doesn’t include dynamic plots, but it’s a good alternative if you want to print out or have a quick reference at hand.
+
+
+
+
+
+Stability analysis of a recursion equation in a discrete-time model.
+
+
+
+
+
+
Other resources
+
+
An Introduction to Mathematical Modeling in Ecology and Evolution(Otto and Day 2007).
Niki Love and Gil Henriques did a great job of translating the code into wxMaxima, with limited help from me. Thanks, Niki and Gil!!
+
+
+
+
+
+
References
+
+Otto, Sarah P, and Troy Day. 2007. A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution. Vol. 13. Princeton University Press.
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/image.jpg b/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/image.jpg
new file mode 100644
index 0000000..24253d9
Binary files /dev/null and b/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/image.jpg differ
diff --git a/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html b/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html
new file mode 100644
index 0000000..94c8681
--- /dev/null
+++ b/docs/posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html
@@ -0,0 +1,414 @@
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Sensibilisation aux réalités autochtones et recherche collaborative
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Sensibilisation aux réalités autochtones et recherche collaborative
+
+
+
Série de deux webinaires sur la sensibilisation aux réalités autochtones et la recherche en collaboration avec les Autochtones, offert du 28 au 30 avril 2020 par Catherine-Alexandra Gagnon, PhD.
+
+
+
+
Transversal competencies
+
FR
+
+
+
+
+
+
+
+
+
Author
+
+
Dr Catherine-Alexandra Gagnon
+
+
+
+
+
Published
+
+
April 28, 2020
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
1 Partie 1 - Sensibilisation aux réalités autochtones
+
+
Objectifs de la formation :
+
+
Améliorer notre compréhension du passé et de ses impacts sur nos relations entre le avec les Peuples Autochtones.
+
Développer des notions et compétences afin d’agir contre les préjugés et le racisme.
+
+
+
+
Durant ce webminaire, nous allons:
+
+
Faire un survol des événements historiques importants et de leurs impacts à ce jour (Loi sur les Indiens, politiques d’assimilation, les pensionnats, etc.).
+
Acquérir des connaissances sur la terminologie autochtone.
+
Faire un survol de certains procès et contextes légaux et voir comment ils affectent notre travail en territoire autochtone.
+
Dans une optique de réconciliation, faire une prise de conscience des préjugés persistants et discuter de stratégies pour améliorer nos relations avec les communautés.
+
+
+
+
+
2 Partie 2 - Recherche en collaboration avec les communautés autochtones
+
+
Objectifs de la formation :
+
+
Entamer une réflexion collective envers nos pratiques de recherche et comment s’engager de manière significative avec les communautés autochtones.
+
Développer une meilleure compréhension des perceptions et attentes des communautés envers la recherche et les chercheurs.
+
+
+
+
Durant ce webminaire, nous allons:
+
+
Mieux comprendre la nécessité de prendre en compte les connaissances autochtones dans divers aspects de la gestion environnementale au Canada;
+
Discuter du désir des communauté d’avoir une présence accrue dans le milieu de la recherche : comment faire?
+
Aborder et débattre des différentes approches méthodologiques pour établir des ponts en les connaissances autochtones et scientifiques.
Catherine-Alexandra Gagnon possède une expertise dans le travail collaboratif en milieux autochtones. Elle s’intéresse particulièrement à la mise en commun des savoirs locaux, autochtones et scientifiques. Elle détient un doctorat en Sciences de l’environnement et une maîtrise en Gestion de la faune de l’Université du Québec à Rimouski, un baccalauréat en biologie faunique de l’université McGill ainsi qu’un certificat en Études autochtones de l’université de Montréal. Durant ses études, elle a travaillé sur les connaissances locales et ancestrales des Aîné(e)s et chasseurs Inuit, Inuvialuit et Gwich’in du Nunavut, des Territoires du Nord-Ouest et du Yukon.
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2020-06-15-science-communication/image.jpg b/docs/posts/2020-06-15-science-communication/image.jpg
new file mode 100644
index 0000000..5cc1ef2
Binary files /dev/null and b/docs/posts/2020-06-15-science-communication/image.jpg differ
diff --git a/docs/posts/2020-06-15-science-communication/index.html b/docs/posts/2020-06-15-science-communication/index.html
new file mode 100644
index 0000000..7db424a
--- /dev/null
+++ b/docs/posts/2020-06-15-science-communication/index.html
@@ -0,0 +1,380 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Science Communication
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Science Communication
+
+
+
Recordings, content and handouts from a 6-hour Science Communication workshop held over two days on 15 and 16 June 2020.
+
+
+
+
Career
+
Fellow contributed
+
EN
+
+
+
+
+
+
+
+
+
Authors
+
+
Gracielle Higino
+
Katherine Hébert
+
+
+
+
+
Published
+
+
June 15, 2020
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The objective of this training is to share and discuss the concepts and tools that contribute to effective science communication. The training is split into two sessions, which cover the basic concepts of effective science communication and how social media tools can be used to boost the signal of your research and extend your research network. Each training takes the form of a presentation interspersed with several short activity modules, where participants are invited to use the tools we will be discussing to kickstart their own science communication.
+
This training was given on June 1 and 2, 2020. You can view recordings of each session here:
+
+
Day 1
+
+
+
+
Day 2
+
+
+
+
Session 1: The basics of science communication
+
+
Objectives:
+
+
Discuss what science communication (or SciComm) can be, and its potential role in boosting the signal of your research
+
Make an overview of basic concepts and tools that you can use in any medium (blog posts, presentations, conversations, twitter, etc.) to do effective science communication
+
+
During this session, we:
+
+
Discuss the potential pitfalls of science communication (notably, diversity and inclusivity problems).
+
Cover the basic concepts of science communication, including the Golden Circle method, the creation of personas, and storytelling techniques.
+
Have short activities where participants can try to use some of the techniques we will be covering, such as filling in their own Golden Circle and explaining a blog post as a storyboard.
+
+
+
+
+
+
+
Session 2: Social media as a science communication tool
+
+
Objectives:
+
+
Rethink the way we write about science by exploring the world of blog posts
+
Clarify the mechanics of Twitter and how it can be used effectively for science communication
+
+
During this session, we:
+
+
Discuss how to create a story structure using titles and the flow of ideas in blog posts, especially when we are used to writing scientific articles
+
Cover the basics of how Twitter works (retweets, threads, replies, hashtags, photo captions, etc.) and how to find helpful connections
+
Have short activities where participants will be invited to write their own Twitter biographies and to create a Twitter thread explaining a project of their choice.
General principles of visualization and graphic design, and techniques of tailored visualization. This training was developed and delivered by Alex Arkilanian and Katherine Hébert on September 21st and 22nd, 2020.
+
+
+
+
Technical
+
Fellow contributed
+
EN
+
+
+
+
+
+
+
+
+
Authors
+
+
Alex Arkilanian
+
Katherine Hébert
+
+
+
+
+
Published
+
+
September 21, 2020
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Welcome!
+
This training covers the general principles of visualization and graphic design, and techniques of tailored visualization. More specifically, the objectives of the training are:
+
+
Make an overview of basic data visualization principles, including shapes, sizes, colours, and fonts.
+
Discuss how to choose the right visualization for your data, what you want to communicate, and who you want to communicate to.
+
Tools and principles to tailor visualizations, particularly in making interpretable, interactive, and honest visualizations.
+
+
+
Training material
+
Click on “Show code” to learn how to do each plot!
+
+
Interactive examples
+
+
+
+
+
Streamgraph
+
+
+Show code
+
# Script to make a streamgraph of the top 10 most popular dog breeds in
+# New York City from 1999 to 2015
+
+# load libraries
+library(lubridate) # dealing with dates
+library(dplyr) # data manipulation
+library(streamgraph) #devtools::install_github("hrbrmstr/streamgraph")
+library(htmlwidgets) # to save the widget!
+
+# load the dataset
+# more information about this dataset can be found here:
+# https://www.kaggle.com/smithaachar/nyc-dog-licensing-clean
+nyc_dogs <-read.csv("data/nyc_dogs.csv")
+
+# convert birth year to date format (and keep only the year)
+nyc_dogs$AnimalBirthYear <-mdy_hms(nyc_dogs$AnimalBirthMonth) %>%year()
+
+# identify 10 most common dogs
+topdogs <- nyc_dogs %>%count(BreedName)
+topdogs <- topdogs[order(topdogs$n, decreasing =TRUE),]
+# keep 10 most common breeds (and remove last year of data which is incomplete)
+df <-filter(nyc_dogs, BreedName %in% topdogs$BreedName[2:11] & AnimalBirthYear <2016) %>%
+group_by(AnimalBirthYear) %>%
+count(BreedName) %>%ungroup()
+
+# get some nice colours from viridis (magma)
+cols <- viridis::viridis_pal(option ="magma")(length(unique(df$BreedName)))
+
+# make streamgraph!
+pp <-streamgraph(df,
+key = BreedName, value = n, date = AnimalBirthYear,
+height="600px", width="1000px") %>%
+sg_legend(show=TRUE, label="names: ") %>%
+sg_fill_manual(values = cols)
+# saveWidget(pp, file=paste0(getwd(), "/figures/dogs_streamgraph.html"))
+
+# plot
+pp
+
+
+
+
+
+
+
+
+
+
+
Interactive plot
+
+
+Show code
+
# Script to generate plots to demonstrate how combinations of information dimensions
+# can become overwhelming and difficult to interpret.
+
+# set-up & data manipulation ---------------------------------------------------
+
+# load packages
+library(ggplot2) # for plots, built layer by layer
+library(dplyr) # for data manipulation
+library(magrittr) # for piping
+library(plotly) # interactive plots
+
+# set ggplot theme
+theme_set(theme_classic() +
+theme(axis.title =element_text(size =11, face ="bold"),
+axis.text =element_text(size =11),
+plot.title =element_text(size =13, face ="bold"),
+legend.title =element_text(size =11, face ="bold"),
+legend.text =element_text(size =10)))
+
+# import data
+# more info on this dataset: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md
+penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
+
+# get some nice colours from viridis (magma)
+sp_cols <- viridis::viridis_pal(option ="magma")(5)[2:4]
+
+
+#### Day 1 ####
+
+# 1. Similarity
+
+ggplot(penguins) +
+geom_point(aes(y = bill_length_mm, x = bill_depth_mm, col = species), size =2.5) +
+labs(x ="Bill depth (mm)", y ="Bill length (mm)", col ="Species") +# labels
+scale_color_manual(values = sp_cols) # sets the colour scale we created above
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_similarity.png", width =6, height =3, units ="in")
+
+# 2. Proximity
+
+df <- penguins %>%group_by(sex, species) %>%
+summarise(mean_mass =mean(body_mass_g, na.rm =TRUE)) %>%na.omit()
+ggplot(df) +
+geom_bar(aes(y = mean_mass, x = species, fill = sex),
+position ="dodge", stat ="identity") +
+labs(x ="Species", y ="Mean body mass (g)", col ="Sex") +# labels
+scale_fill_manual(values = sp_cols) # sets the colour scale we created above
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_proximity.png", width =6, height =3, units ="in")
+
+# 3. Enclosure (Ellipses over a fake PCA)
+ggplot(data = penguins,
+aes(y = bill_length_mm, x = bill_depth_mm)) +
+geom_point(size =2.1, col ="grey30") +
+stat_ellipse(aes(col = species), lwd = .7) +
+labs(x ="PCA1", y ="PCA2", col ="Species") +# labels
+scale_color_manual(values = sp_cols) +# sets the colour scale we created above
+theme(axis.text =element_blank(), axis.ticks =element_blank())
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_enclosure.png", width =6, height =3, units ="in")
+
+# 4. Mismatched combination of principles
+temp_palette <-rev(c(sp_cols, "#1f78b4", "#33a02c"))
+ggplot(data = penguins,
+aes(y = bill_length_mm, x = bill_depth_mm)) +
+geom_point(aes(col = sex), size =2.1) +
+stat_ellipse(aes(col = species), lwd = .7) +
+labs(x ="Bill depth (mm)", y ="Bill length (mm)", col ="?") +# labels
+scale_color_manual(values = temp_palette) # sets the colour scale we created above
ggsave("figures/penguins_incompatible1.png", width =6, height =3, units ="in")
+
+# 2. Ineffective combinations: Sizes and shapes --------------------------------
+
+ggplot(penguins) +
+geom_point(aes(y = bill_length_mm, x = bill_depth_mm,
+shape = species, # shape
+size =log(body_mass_g)), alpha = .7) +# size
+scale_size(range =c(.1, 5)) +# make sure the sizes are scaled by area and not by radius
+labs(x ="Bill depth (mm)", y ="Bill length (mm)",
+shape ="Species", size ="Body mass (g)")
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_incompatible2.png", width =6, height =3, units ="in")
+
+# 3. Cognitive overload --------------------------------------------------------
+
+# get some nice colours from viridis (magma)
+sex_cols <- viridis::viridis_pal(option ="magma")(8)[c(3,6)]
+
+ggplot(na.omit(penguins)) +
+geom_point(aes(y = bill_length_mm, # dimension 1: position along y scale
+x = bill_depth_mm, # dimension 2: position along x scale
+shape = species, # dimension 3: shape
+size =log(body_mass_g), # dimension 4: size
+col = sex), # dimension 5: hue
+alpha = .7) +# size
+scale_size(range =c(.1, 5)) +# make sure the sizes are scaled by area and not by radius
+labs(x ="Bill depth (mm)", y ="Bill length (mm)",
+shape ="Species", size ="Body mass (g)", col ="Sex") +
+scale_color_manual(values = sex_cols)
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_5dimensions.png", width =7, height =4, units ="in")
+
+
+# 4. Panels -------------------------------------------------------------------
+
+ggplot(na.omit(penguins)) +
+geom_point(aes(y = bill_length_mm, # dimension 1: position along y scale
+x = bill_depth_mm, # dimension 2: position along x scale
+col =log(body_mass_g)), # dimension 3: hue
+alpha = .7, size =2) +
+facet_wrap(~ species) +# dimension 4: species!
+# this will create a separate panel for each species
+# note: this also automatically uses the same axes for all panels! If you want
+# axes to vary between panels, use the argument scales = "free"
+labs(x ="Bill depth (mm)", y ="Bill length (mm)", col ="Body mass (g)") +
+scale_color_viridis_c(option ="magma", end = .9, direction =-1) +
+theme_linedraw() +theme(panel.grid =element_blank()) # making the panels prettier
+
+
+
+
+
+Show code
+
ggsave("figures/penguins_dimensions_facets.png", width =7, height =4, units ="in")
+
+
+# 5. Interactive ---------------------------------------------------------------
+
+p <-na.omit(penguins) %>%
+ggplot(aes(y = bill_length_mm,
+x = bill_depth_mm,
+col =log(body_mass_g))) +
+geom_point(size =2, alpha = .7) +
+facet_wrap(~ species) +
+labs(x ="Bill depth (mm)", y ="Bill length (mm)", col ="Body mass (g)") +
+scale_color_viridis_c(option ="magma", end = .9, direction =-1) +
+theme_linedraw() +theme(panel.grid =element_blank()) # making the panels prettier
+p <-ggplotly(p)
+#setwd("figures")
+htmlwidgets::saveWidget(as_widget(p), "figures/penguins_interactive.html")
+p
+
+
+
+
+
+
+
+
+
+
Example figures
+
+
+Show code
+
# Script to make animated plot of volcano eruptions over time
+
+# Load libraries:
+library(dplyr) # data manipulation
+library(ggplot2) # plotting
+library(gganimate) # animation
+library(gifski) # creating gifs
+
+# set ggplot theme
+theme_set(theme_classic() +
+theme(axis.title =element_text(size =11, face ="bold"),
+axis.text =element_text(size =11),
+plot.title =element_text(size =13, face ="bold"),
+legend.title =element_text(size =11, face ="bold"),
+legend.text =element_text(size =10)))
+
+# function to floor a year to the decade
+floor_decade =function(value){return(value - value %%10)}
+
+# read data
+eruptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')
+
+# select top 5 most frequently exploding volcanoes
+temp <-group_by(eruptions, volcano_name) %>%tally()
+temp <- temp[order(temp$n, decreasing =TRUE),]
+
+# make a time series dataset (number of eruptions per year)
+eruptions$start_decade =floor_decade(eruptions$start_year)
+
+# filter dataset to subset we want to visualize
+df <- eruptions %>%
+filter(between(start_decade, 1900, 2019)) %>%
+filter(volcano_name %in% temp$volcano_name[1:5]) %>%
+group_by(start_decade) %>%
+count(volcano_name) %>%ungroup()
+
+# plot!
+p <-ggplot(df, aes(x = start_decade, y = n, fill = volcano_name)) +
+geom_area() +
+geom_vline(aes(xintercept = start_decade)) +# line that follows the current decade
+scale_fill_viridis_d(option ="magma", end = .8) +
+labs(x ="", y ="Number of eruptions", fill ="Volcano",
+title ='Eruptions of the top 5 most frequently erupting volcanos worldwide') +
+# gganimate part: reveals each decade
+transition_reveal(start_decade)
+animate(p, duration =5, fps =20, width =800, height =300, renderer =gifski_renderer())
+
+
+
+
+
+Show code
+
#anim_save("figures/volcano_eruptions.gif")
+
+
+
+
+Show code
+
# Script to generate plots with various ways of representing uncertainty, based
+# Coffee & Code dataset from https://www.kaggle.com/devready/coffee-and-code/data
+
+# set-up & data manipulation ---------------------------------------------------
+
+# load packages
+library(ggplot2) # for plots, built layer by layer
+library(dplyr) # for data manipulation
+library(magrittr) # for piping
+library(ggridges) # for density ridge plots
+library(patchwork) # great package for "patching" plots together!
+
+# set ggplot theme
+theme_set(theme_classic() +
+theme(axis.title =element_text(size =11, face ="bold"),
+axis.text =element_text(size =11),
+plot.title =element_text(size =13, face ="bold"),
+legend.title =element_text(size =11, face ="bold"),
+legend.text =element_text(size =10)))
+
+# import data
+df <-read.csv("data/coffee_code.csv")
+
+# set labels to be used in all plots
+coffee_labels <-labs(title ="Does coffee help programmers code?",
+x ="Coffee while coding",
+y ="Time spent coding \n(hours/day)")
+
+# the variable CodingWithoutCoffee is negative, which is harder to understand
+# (i.e. "No" means they drink coffee...). So, let's transform it into a more
+# intuitive variable!
+df$CodingWithCoffee <-gsub("No", "Usually", df$CodingWithoutCoffee)
+df$CodingWithCoffee <-gsub("Yes", "Rarely\n or never", df$CodingWithCoffee)
+# convert to factor and set levels so they show up in a logical order
+df$CodingWithCoffee <-factor(df$CodingWithCoffee,
+levels =c("Rarely\n or never",
+"Sometimes",
+"Usually"))
+
+# calculate summary statistics for the variable of choice
+df_summary <-group_by(df, CodingWithCoffee) %>%
+summarise(
+# mean
+mean_codinghours =mean(CodingHours),
+# standard deviation
+sd_codinghours =sd(CodingHours),
+# standard error
+se_codinghours =sd(CodingHours)/sqrt(length(CodingHours)))
+
+
+# 1. Error bars (standard error) -----------------------------------------------
+
+ggplot(df_summary) +
+geom_errorbar(aes(x = CodingWithCoffee,
+ymin = mean_codinghours - se_codinghours,
+ymax = mean_codinghours + se_codinghours),
+width = .2) +
+geom_point(aes(x = CodingWithCoffee, y = mean_codinghours),
+size =3) +
+ coffee_labels +ylim(0,10)
ggsave("figures/coffee_violin_jitter.png", width =5, height =3, units ="in")
+
+
+# 5. Density ridge plot --------------------------------------------------------
+
+ggplot(df) +
+aes(y = CodingWithCoffee, x = CodingHours, fill =stat(x)) +
+geom_density_ridges_gradient(scale =1.9, size = .2, rel_min_height =0.005) +
+# colour palette (gradient according to CodingHours)
+scale_fill_viridis_c(option ="magma", direction =-1) +
+# remove legend - it's not necessary here!
+theme(legend.position ="none") +
+labs(title = coffee_labels$title,
+x = coffee_labels$y,
+y ="Coffee \nwhile coding") +
+theme(axis.title.y =element_text(angle=0, hjust =1, vjust = .9,
+margin =margin(t =0, r =-50, b =0, l =0)))
+
+
+
+
+
+Show code
+
ggsave("figures/coffee_density_ridges.png", width =5, height =3, units ="in")
+
+# 6. Jitter vs. Rug plot ------------------------------------------------------------------
+
+jitterplot <-ggplot(df, aes(x = CoffeeCupsPerDay, y = CodingHours)) +
+geom_jitter(alpha = .2) +
+geom_smooth(fill = error_cols[1], col ="black", method = lm, lwd = .7) +
+ coffee_labels +ylim(c(0,11)) +labs(x ="Cups of coffee (per day)")
+
+rugplot <-ggplot(df, aes(x = CoffeeCupsPerDay, y = CodingHours)) +
+geom_smooth(fill = error_cols[1], col ="black", method = lm, lwd = .7) +
+geom_rug(position="jitter", alpha = .7) +ylim(c(0,11)) +
+ coffee_labels +labs(x ="Cups of coffee (per day)")
+
+# patch the two plots together
+jitterplot + rugplot
+
+
+
+
+
+Show code
+
#ggsave("figures/coffee_jitter_vs_rugplot.png", width = 10, height = 4, units = "in")
+
+
+
+
+Show code
+
# Script to generate 95% confidence intervals of a generated random normal distribution
+# as an example in Day 2: Visualizing uncertainty.
+
+# load library
+library(ggplot2)
+library(magrittr)
+library(dplyr)
+
+# set ggplot theme
+theme_set(theme_classic() +
+theme(axis.title =element_text(size =11, face ="bold"),
+axis.text =element_text(size =11),
+plot.title =element_text(size =13, face ="bold"),
+legend.title =element_text(size =11, face ="bold"),
+legend.text =element_text(size =10)))
+
+# set random seed
+set.seed(22)
+
+# generate population (random normal distribution)
+df <-data.frame("value"=rnorm(50, mean =0, sd =1))
+
+# descriptive stats for each distribution
+desc_stats = df %>%
+summarise(mean_val =mean(value, na.rm =TRUE),
+se_val =sqrt(var(value)/length(value)))
+
+# build density plot!
+p <-ggplot(data = df, aes(x = value, y = ..count..)) +
+geom_density(alpha = .2, lwd = .3) +
+xlim(c(min(df$value-1), max(df$value+1)))
+# extract plotted values
+base_p <-ggplot_build(p)$data[[1]]
+# shade the 95% confidence interval
+p +
+geom_area(data =subset(base_p,
+between(x,
+left = (desc_stats$mean_val -1.96*desc_stats$se_val),
+right = (desc_stats$mean_val +1.96*desc_stats$se_val))),
+aes(x = x, y = y), fill ="cadetblue3", alpha = .6) +
+# add vertical line to show population mean
+geom_vline(aes(xintercept =0), lty =2) +
+annotate("text", x =0.9, y =19, label ="True mean", fontface ="italic") +
+# label axis!
+labs(x ="Variable of interest", y ="")
+
+
+
+
+
+Show code
+
#ggsave("figures/confidenceinterval_example.png", width = 5, height = 3.5, units = "in")
+
+
+
+
+
Annotated resource library
+
This is an annotated library of data visualization resources we used to build the BIOS² Data Visualization Training, as well as some bonus resources we didn’t have the time to include. Feel free to save this page as a reference for your data visualization adventures!
+
+
+
Books & articles
+
Fundamentals of Data Visualization A primer on making informative and compelling figures. This is the website for the book “Fundamentals of Data Visualization” by Claus O. Wilke, published by O’Reilly Media, Inc.
+
Data Visualization: A practical introduction An accessible primer on how to create effective graphics from data using R (mainly ggplot). This book provides a hands-on introduction to the principles and practice of data visualization, explaining what makes some graphs succeed while others fail, how to make high-quality figures from data using powerful and reproducible methods, and how to think about data visualization in an honest and effective way.
+
Data Science Design (Chapter 6: Visualizing Data) Covers the principles that make standard plot designs work, show how they can be misleading if not properly used, and develop a sense of when graphs might be lying, and how to construct better ones.
From Static to Interactive: Transforming Data Visualization to Improve Transparency Weissgerber TL, Garovic VD, Savic M, Winham SJ, Milic NM (2016) designed an interactive line graph that demonstrates how dynamic alternatives to static graphics for small sample size studies allow for additional exploration of empirical datasets. This simple, free, web-based tool demonstrates the overall concept and may promote widespread use of interactive graphics.
A collection of graphic pitfalls A collection of short articles about common issues with data visualizations that can mislead or obscure your message.
+
+
+
+
Choosing a visualization
+
Data Viz Project This is a great place to get inspiration and guidance about how to choose an appropriate visualization. There are many visualizations we are not used to seeing in ecology!
ColorBrewer: Color Advice for Maps Tool to generate colour palettes for visualizations with colorblind-friendly options. You can also use these palettes in R using the RColorBrewer package, and the scale_*_brewer() (for discrete palettes) or scale_*_distiller() (for continuous palettes) functions in ggplot2.
+
Color.review Tool to pick or verify colour palettes with high relative contrast between colours, to ensure your information is readable for everyone.
+
Coblis — Color Blindness Simulator Tool to upload an image and view it as they would appear to a colorblind person, with the option to simulate several color-vision deficiencies.
CartoDB/CartoColor CARTOColors are a set of custom color palettes built on top of well-known standards for color use on maps, with next generation enhancements for the web and CARTO basemaps. Choose from a selection of sequential, diverging, or qualitative schemes for your next CARTO powered visualization using their online module.
+
+
+
+
Tools
+
+
R
+
The R Graph Gallery A collection of charts made with the R programming language. Hundreds of charts are displayed in several sections, always with their reproducible code available. The gallery makes a focus on the tidyverse and ggplot2.
Customizing tick marks in base R Seems like a simple thing, but it can be so frustrating! This is a great post about customizing tick marks with base plot in R.
The Python Graph Gallery This website displays hundreds of charts, always providing the reproducible python code.
+
Python Tutorial: Intro to Matplotlib Introduction to basic functionalities of the Python’s library Matplotlib covering basic plots, plot attributes, subplots and plotting the iris dataset.
Chart Studio Web editor to create interactive plots with plotly. You can download the image as .html, or static images, without coding the figure yourself.
+
PhyloPic Vector images of living organisms. This is great for ecologists who want to add silhouettes of their organisms onto their plots - search anything, and you will likely find it!
+
Add icons on your R plot Add special icons to your plot as a great way to customize it, and save space with labels!
+
+
+
+
Inspiration (pretty things!)
+
Information is Beautiful Collection of beautiful original visualizations about a variety of topics!
+
TidyTuesday A weekly data project aimed at the R ecosystem, where people wrangle and visualize data in loads of creative ways. Browse what people have created (#TidyTuesday on Twitter is great too!), and the visualizations that have inspired each week’s theme.
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/animated.volcano-1.gif b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/animated.volcano-1.gif
new file mode 100644
index 0000000..3158806
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/animated.volcano-1.gif differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-1.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-1.png
new file mode 100644
index 0000000..70de925
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-1.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-2.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-2.png
new file mode 100644
index 0000000..618ac74
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-2.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-3.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-3.png
new file mode 100644
index 0000000..90ba722
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-3.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-4.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-4.png
new file mode 100644
index 0000000..654404e
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-4.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-5.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-5.png
new file mode 100644
index 0000000..f8cc82e
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-5.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-6.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-6.png
new file mode 100644
index 0000000..7588ffe
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/coffee.uncertainty-6.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/densiplot-1.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/densiplot-1.png
new file mode 100644
index 0000000..52825a0
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/densiplot-1.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-1.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-1.png
new file mode 100644
index 0000000..2060854
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-1.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-2.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-2.png
new file mode 100644
index 0000000..8be7dbd
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-2.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-3.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-3.png
new file mode 100644
index 0000000..56634af
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-3.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-4.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-4.png
new file mode 100644
index 0000000..0163852
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-4.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-5.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-5.png
new file mode 100644
index 0000000..3ac330a
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-5.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-6.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-6.png
new file mode 100644
index 0000000..d4bc38c
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-6.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-7.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-7.png
new file mode 100644
index 0000000..435335f
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-7.png differ
diff --git a/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-8.png b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-8.png
new file mode 100644
index 0000000..f85acfe
Binary files /dev/null and b/docs/posts/2020-09-21-data-visualization/index_files/figure-html/interactive-plot-8.png differ
diff --git a/docs/posts/2020-12-07-making-websites-with-hugo/image.jpg b/docs/posts/2020-12-07-making-websites-with-hugo/image.jpg
new file mode 100644
index 0000000..d32a2fe
Binary files /dev/null and b/docs/posts/2020-12-07-making-websites-with-hugo/image.jpg differ
diff --git a/docs/posts/2020-12-07-making-websites-with-hugo/index.html b/docs/posts/2020-12-07-making-websites-with-hugo/index.html
new file mode 100644
index 0000000..a87d7aa
--- /dev/null
+++ b/docs/posts/2020-12-07-making-websites-with-hugo/index.html
@@ -0,0 +1,850 @@
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Making websites with HUGO
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Making websites with HUGO
+
+
+
This workshop provides a general introduction to HUGO, a popular open source framework for building websites without requiring a knowledge of HTML/CSS or web programming.
+
+
+
+
Technical
+
Transversal competencies
+
EN
+
+
+
+
+
+
+
+
+
Authors
+
+
Dominique Gravel
+
Guillaume Larocque
+
+
+
+
+
Published
+
+
December 7, 2020
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
1 Why this training workshop ?
+
I am only 10 hours of a crash course in web development ahead of you. As part of a major research project on setting a biodiversity observation network, I had to develop a prototype of a portal for the project, for biodiversity information and bunch of dashboards on biodiversity trends. Never made a website before. I know how to code in a few langages, and I know that I hate playing with boxes, menus, importing images manually, and most of all, dealing with a crash of the system and having to redo the whole thing because I made a mistake somewhere. Not that a bug when I try to compile is better, but at least it is more tractable.
+
Hugo made it very easily because of its fundamental feature (which is the same reason I edit papers with LaTeX): the distinction between the view and the content. Once you have set up the rules defining the visual aspects of the pages, then you can focus on the content and let the software automatically constructing the html code for you. It’s fast, accessible, scriptable and could be version-controlled. All qualities for an open and reproducible science.
+
Took me a few hours to learn the basics (much harder to get the higher level skills, especially to write your own Go scripts), I took some tricks here and there in different templates and at looking what others do, and that was it I had my website. Realized that it could be a good entry level course to BIOS2 fellows and decided to turn that experience into a training workshop.
+
You will find below basic instructions to install and run a template. The following is not a full tutorial, for that I recommend simply to take time looking at the documentation provided on the Hugo page (https://gohugo.io/). I also consulted the online book Hugo in action (https://www.manning.com/books/hugo-in-action). There are many other references, all of them with goods and bads. But it’s nice to have multiple ones because sometimes the description of a concept may be obscure in one reference but better in the other and it’s by comparing and switching between them that you can make progress.
+
+
+
2 Make sure Hugo is installed and check version
+
First step, you have to make sure that it is properly installed on you computer. Type the following command in terminal to make sure :
+
hugo version
+
You can access to the help menu with the simple command :
+
hugo help
+
+
+
3 Be Timothée Poisot for fun
+
We will use Tim’s website, which is a simple but efficient example of what we could achieve with Hugo. The strenght of the website is that it automatically updates with the addition of new content, such as publications, lab members and projects. The only thing you have to do, once the template is properly set up, is to update the content. That way, yo can focus on the material you want to put on, without struggling on how to place the boxes, format the police and all of the complicate stuff that comes with html and css. The content, written in markdown, is human readable and therefore could be easily edited by lab members. Further, since it’s all scripted, it’s easy to maintain and control versions.
+
Take few minutes to look at the final webpage at https://poisotlab.io/
+
Now you will clone the repository on your own computer so that you could start playing with the content, edit the files, modify list of papers and so on.
+
You can either use the clone button on the top of the page or the following command :
We will take a few minutes to look at the content of the different folders. This structure is common to most of the Hugo templates. You will find multiple folders, it’s useful to understand what’s located where because the compiler expects this structure when it looks for specific information.
+
archetypes (not in here, but usually in most templates). These are basic instructions to generate new content with the hugo new command. We won’t use this feature today, but information about this feature is easy to find.
+
assets contains the css files where the controls for visual aspect of the pages are specified. That’s where you’ll search for the different items and how to specify things such as box sizes, font colors and dimensions etc…. Note: assets directory is not created by default.
+
content holds all of the .md files where the main content of the pages is provided. It’s divided in several subfolders, corresponding to the different pages from the menu. Each top-level folder in Hugo is considered a content section (which is described usually in the config file). For instance, you have one folder called Research where the projects are described. You’ll find one .md file per projec tin this folder. Note also that the folders contain systematically a _index.md file where the metadata and the top level information of the page are specified. We’ll come back to that later.
+
data stores specific information that will be consulted by the parser during compilation (configurationfiles). There are also data templates, and at the moment, there is one json file where the papers are listed and two toml files with a list of the students, past and present. json files could be edited with a text editor (not so fun), but there are some tools to do it efficiently.
+
layouts contains the core files to compile the website. You will find in them instructions, in a strange blend of html and Go langages. No so easy and pleasant to play with, but looking at them tells you a bit about what the compiler does (a good example is for people). list.html for instance contains a loop that goes through the toml files in order to create the icons, the text and the link to the full markdown page where you have description for each student. You will find layouts for the main pages, as well as for partials (like the header menu).
+
resources also contains css instructions for the template. We won’t work with this one.
+
static contains bunch of little things that are called during compilation. You’ll find the logo for the lab, the pictures for students, pdf files for applications, images for each research project …
+
There is also one very important file in the main folder the config.toml file. Inside, you will find a lot of the metadata that will control the structure of the main page. This find can be very simple for some templates, much more complicated for other ones. Note that for some templates, the config file may be in a distinct folder. Not all templates have exactly the same folder structure.
+
toml is a file format for configuration files, it contains key parameters for the webpage. It consists of key = “value” pairs, [section names], and # comments. Let’s open this one to have a closer look.
+
+
Exercise : Edit the toml file to include your own information.
+
You may want to change the section People to Collaborators and also provide a proper reference to your on github page. You can also add or remove sections, this will affect the menu at the top of the page. For instance, you can add a blog section.
+
+
+
+
4 Build the static html files
+
+
Build for local development
+
Hugo will use all of the material to generate static html files that will be displayed on your browser. The command is really easy to use to run it on your own computer, you simply have to type the following in the main folder :
+
hugo server
+
And that’s it, it compiles and you can simply open it in your browser by clicking on the adress indicated in the terminal. Congratulations for your first Hugo webste !
+
There are useful information in the terminal about the building process.
+
+
+
Build for publishing your website
+
The command hugo server is very fast and useful to test your website while you develop it. But once you’ll be ready to distribute it, you’ll need all of the html files and related material to distribute the website. This is easily done with the even simpler command
+
hugo
+
You will find in the directory that a new folder named public appeared, with all of the material needed to deploy the website. If you click on the index.html file, you’ll get to the home page of the website. It is interesting to open this file in your text editor, you’ll get a sense of the html code that hugo generated automatically for you. You can also take a look at other files.
+
+
+
+
5 Edit content
+
Editing content is the easier thing to do. First thing to do, is to modify the content of the introduction paragraph on the main page. You’ll find it in the *_index.md* file in the content folder. Open it and modify the text. You can after build the main page again to see the update.
+
You can also add material, with new md files. We will do so with a new research project (note the following could be done manually):
+
hugo new research/chapter1.md
+
This will generate a new markdown file, in which you can start adding material. But those files do have a particular structure, so before editing it, we’ll take a quick look at another one, datascience.md.
+
The header section is typical of a markdown file with metadata (in toml or yaml format). You have to specify information to the parser about the title, the image and associated papers. Note that it will work if some of these (e.g. papers) are missing. You can modify the image as well.
+
The file here also a particular structure, with the marker between two paragraphs. This command indicates that only the first paragraph is displayed on the main page of the Research tab, and the full content follows if you click to know more about the project.
+
Note that here you can use the basic features of markdown, with headers, bold, italics and so on. You can also include html code directly into the markdown and it should work. That said, it may conflict with higher level instructions in the layout or in the theme and may cause difficulties at building. While it is feasible to add such command, it is not recommended to do so. People rather suggest to use shortcodes (Tomorrow) or to modify the layout of the website.
+
+
Exercise
+
Take 15 minutes to remove Tim’s material and replace it by the three chapters of your thesis.
+
+
+
+
6 Hosting the website on a server
+
There are many options to host your new website on a server. An easy one, free, and that could be coupled with version control is to run it on github. Full instructions are available here :
We will simply follow the instructions copied here for hosting a personal page. Note that you can also develop a page for a project.
+
+
GitHub User or Organization Pages
+
+
Step-by-step Instructions
+
+
Create a (e.g. blog) repository on GitHub. This repository will contain Hugo’s content and other source files.
+
Create a .github.io GitHub repository. This is the repository that will contain the fully rendered version of your Hugo website.
+
git clone && cd
+
Paste your existing Hugo project into the new local repository. Make sure your website works locally (hugo server or hugo server -t ) and open your browser to http://localhost:1313.
+
Once you are happy with the results: Press Ctrl+C to kill the server Before proceeding run rm -rf public to completely remove the public directory
+
git submodule add -b main https://github.com//.github.io.git public. This creates a git submodule. Now when you run the hugo command to build your site to public, the created public directory will have a different remote origin (i.e. hosted GitHub repository).
+
Make sure the baseURL in your config file is updated with: .github.io
+
+
+
+
Put it Into a Script
+
You’re almost done. In order to automate next steps create a deploy.sh script. You can also make it executable with chmod +x deploy.sh.
+
The following are the contents of the deploy.sh script:
+
#!/bin/sh
+
+# If a command fails then the deploy stops
+set-e
+
+printf"\033[0;32mDeploying updates to GitHub...\033[0m\n"
+
+# Build the project.
+hugo# if using a theme, replace with `hugo -t <YOURTHEME>`
+
+# Go To Public folder
+cd public
+
+# Add changes to git.
+git add .
+
+# Commit changes.
+msg="rebuilding site $(date)"
+if[-n"$*"];then
+msg="$*"
+fi
+git commit -m"$msg"
+
+
+
+
+
7 Push source and build repos.
+
git push origin main
+
You can then run ./deploy.sh "Your optional commit message" to send changes to .github.io. Note that you likely will want to commit changes to your repository as well.
+
That’s it! Your personal page should be up and running at https://.github.io within a couple minutes.
+
+
Using a theme
+
It is usually a good idea to not modify a template directly, but to have the template and the site in a separate folder. The basic concept when doing this is that the config.toml file of the site has to link to the proper folder of the theme.
+
For example
+
theme="template-site"
+themesDir="../.."
+
This means that the template site is in a folder named template-site which is a parent folder of the site folder. Other options are possible.
+
Usually, all the content should go in the site folder, not in the theme folder.
+
+
Exercise 1
+
+
Start modifying the theme to make it look like a website for a Zoo. Choose your preferred color scheme by changing the style= parameter in the config.toml file.
+
Feel free to download some images from unsplash and save them in the static/img folder. You can then use these images in the carrousel, as “testimonial” photos or as background images for some of the sections. You can add or remove sections from the home page by editing the config.toml file and changing the enable= parameter in the params. segment at the bottom.
+
You can also try to create a new blog entry by adding a new file in the content/blog folder. This file will have a .md extension and will be written in markdown format.
+
+
+
+
+
Customizing a theme
+
+
+
Basics of HTML
+
Core structure of an HTML page
+
<!DOCTYPE html>
+<html>
+<head>
+<title>This is my great website</title>
+<style>
+.css_goes_here{
+
+}
+</style>
+</head>
+<body>
+<h1>Main title</h1>
+<div>Main content goes here</div>
+</body>
+</html>
+
+
A divider, used to organize content into blocks
+
<div></div>
+
+
+
A span, used to organize content or text into sections with different styles. Usually on the same line.
should item be on the same line, or in a separate block?
+
inline, block, inline-block, flex, …
+
+
+
+
+
Exercise 2
+
+
Create a file named custom.css under template-site/my-site/static/css/.
+
Right-click on elements on the web page that you want to modify, then click on Inspect element and try to find CSS properties that you could modify to improve the look of the page. Then, choosing the proper class, add entries in the custom.css file that start with a dot (.) followed by the proper class names.
+
+
.this-class {
+font-size:28px;
+}
+
+
+
+
Partials
+
Partials are snippets of HTML code that could be reused on different places on the website. For example, you will see that the layouts/index.html file in the template-site folder lists all the partials that create the home page.
+
An important point to remember is that Hugo will look for files first in the site’s folders, and if it doesn’t find the files there, it will look for them in the theme’s folder. So site folder layouts and CSS take priority over the theme folder.
+
+
Exercise 3
+
+
Create a new folder template-site/my-site/layouts. In this folder, create a new file named index.html and copy the content of the template-site/layouts/index.html file into it. Remove the testimonials section from the newly created file.
+
Create a new folder template-site/my-site/layouts/partials. In this folder, create a new file named featured-species.html put the following content into it, replacing the information with the species you selected.
+
+
<divclass="featured-species">
+<imgsrc="img/species/frog.jpg"class="species-image"alt="">
+<divclass="species-description">
+<h3>Red-Eyed Tree Frog</h3>
+<p>This frog can be found in the tropical rain forests of Costa Rica.</p>
+</div>
+</div>
+
+
Then, add this section to the index.html file created above.
+
+
+
{{ partial "featured_species.html" . }}
+
+
You will probably need to restart the Hugo server to see the changes appear on the site.
+
Now, you need to edit the CSS! In your custom.css file, add the following lines.
Now, create a new folder /template-site/my-site/data/species.
+
In this folder, create new file named frog.yaml with the following content.
+
+
enable:true
+name:"Red-eyed tree frog"
+description:"This frog can be found in the forests of Costa Rica"
+image:"frog.jpg"
+
+
Find other species photos and add them to the img folder. Then you can add new .yaml files in the data/species folder for each species.
+
+
+
+
+
iFrames
+
An iFrame is a HTML tag that essentially allows you to embed another web page inside of your site.
+
+
Exercise 5
+
Find a Youtube video and click on the share option below the video. Find the Embed option and copy the code that starts with <iframe> to a new partial that will be shown on a new page. Surround the iframe with a div tag with class="video". For example:
4-Day Training in Spatial Statistics with Philippe Marchand
+
+
+
Training session about statistical analysis of spatial data in ecology, hosted by Philippe Marchand (UQAT). | Session de formation sur l’analyse statistique des données spatiales en écologie, animée par Pr. Philippe Marchand (UQAT).
BIOS² hosted an online training session about statistical analysis of spatial data in ecology, led by Pr. Philippe Marchand (UQAT). This 12-hour training was conducted in 4 sessions: January 12, 14, 19 & 21 (2021) from 1:00 to 4:00 pm EST.
+
The content included three types of spatial statistical analyses and their applications to ecology: (1) point pattern analysis to study the distribution of individuals or events in space; (2) geostatistical models to represent the spatial correlation of variables sampled at geolocated points; and (3) areal data models, which apply to measurements taken on areas in space and model spatial relationships as networks of neighbouring regions. The training also included practical exercises using the R statistical programming environment.
+
Philippe Marchand is a professor in ecology and biostatistics at Institut de recherche sur les forêts, Université du Québec en Abitibi-Témiscamingue (UQAT) and BIOS² academic member. His research focuses on modeling processes that influence the spatial distribution of populations, including: seed dispersal and seedling establishment, animal movement, and the spread of forest diseases.
+
If you wish to consult the lesson materials and follow the exercises at your own pace, you can access them through this link. Basic knowledge of linear regression models and experience fitting them in R is recommended. Original repository can be found here.
In this training, we will discuss three types of spatial analyses: point pattern analysis, geostatistical models and models for areal data.
+
In point pattern analysis, we have point data representing the position of individuals or events in a study area and we assume that all individuals or events have been identified in that area. That analysis focuses on the distribution of the positions of the points themselves. Here are some typical questions for the analysis of point patterns:
+
+
Are the points randomly arranged or clustered?
+
Are two types of points arranged independently?
+
+
Geostatistical models represent the spatial distribution of continuous variables that are measured at certain sampling points. They assume that measurements of those variables at different points are correlated as a function of the distance between the points. Applications of geostatistical models include the smoothing of spatial data (e.g., producing a map of a variable over an entire region based on point measurements) and the prediction of those variables for non-sampled points.
+
Areal data are measurements taken not at points, but for regions of space represented by polygons (e.g. administrative divisions, grid cells). Models representing these types of data define a network linking each region to its neighbours and include correlations in the variable of interest between neighbouring regions.
+
+
+
Stationarity and isotropy
+
Several spatial analyses assume that the variables are stationary in space. As with stationarity in the time domain, this property means that summary statistics (mean, variance and correlations between measures of a variable) do not vary with translation in space. For example, the spatial correlation between two points may depend on the distance between them, but not on their absolute position.
+
In particular, there cannot be a large-scale trend (often called gradient in a spatial context), or this trend must be taken into account before modelling the spatial correlation of residuals.
+
In the case of point pattern analysis, stationarity (also called homogeneity) means that point density does not follow a large-scale trend.
+
In a isotropic statistical model, the spatial correlations between measurements at two points depend only on the distance between the points, not on the direction. In this case, the summary statistics do not change under a spatial rotation of the data.
+
+
+
Georeferenced data
+
Environmental studies increasingly use data from geospatial data sources, i.e. variables measured over a large part of the globe (e.g. climate, remote sensing). The processing of these data requires concepts related to Geographic Information Systems (GIS), which are not covered in this workshop, where we focus on the statistical aspects of spatially varying data.
+
The use of geospatial data does not necessarily mean that spatial statistics are required. For example, we will often extract values of geographic variables at study points to explain a biological response observed in the field. In this case, the use of spatial statistics is only necessary when there is a spatial correlation in the residuals, after controlling for the effect of the predictors.
+
+
+
+
3 Point pattern analysis
+
+
Point pattern and point process
+
A point pattern describes the spatial position (most often in 2D) of individuals or events, represented by points, in a given study area, often called the observation “window”.
+
It is assumed that each point has a negligible spatial extent relative to the distances between the points. More complex methods exist to deal with spatial patterns of objects that have a non-negligible width, but this topic is beyond the scope of this workshop.
+
A point process is a statistical model that can be used to simulate point patterns or explain an observed point pattern.
+
+
+
Complete spatial randomness
+
Complete spatial randomness (CSR) is one of the simplest point patterns, which serves as a null model for evaluating the characteristics of real point patterns. In this pattern, the presence of a point at a given position is independent of the presence of points in a neighbourhood.
+
The process creating this pattern is a homogeneous Poisson process. According to this model, the number of points in any area \(A\) follows a Poisson distribution: \(N(A) \sim \text{Pois}(\lambda A)\), where \(\lambda\) is the intensity of the process (i.e. the density of points per unit area). \(N\) is independent between two disjoint regions, no matter how those regions are defined.
+
In the graph below, only the pattern on the right is completely random. The pattern on the left shows point aggregation (higher probability of observing a point close to another point), while the pattern in the center shows repulsion (low probability of observing a point very close to another).
+
+
+
+
+
+
+
+
Exploratory or inferential analysis for a point pattern
+
Several summary statistics are used to describe the characteristics of a point pattern. The simplest is the intensity \(\lambda\), which as mentioned above represents the density of points per unit area. If the point pattern is heterogeneous, the intensity is not constant, but depends on the position: \(\lambda(x, y)\).
+
Compared to intensity, which is a first-order statistic, second-order statistics describe how the probability of the presence of a point in a region depends on the presence of other points. The Ripley’s \(K\) function presented in the next section is an example of a second-order summary statistic.
+
Statistical inferences on point patterns usually consist of testing the hypothesis that the point pattern corresponds to a given null model, such as CSR or a more complex null model. Even for the simplest null models, we rarely know the theoretical distribution for a summary statistic of the point pattern under the null model. Hypothesis tests on point patterns are therefore performed by simulation: a large number of point patterns are simulated from the null model and the distribution of the summary statistics of interest for these simulations is compared to their values for the observed point pattern.
+
+
+
Ripley’s K function
+
Ripley’s K function \(K(r)\) is defined as the mean number of points within a circle of radius \(r\) around a point in the pattern, standardized by the intensity \(\lambda\).
+
Under the CSR null model, the mean number of points in any circle of radius \(r\) is \(\lambda \pi r^2\), thus in theory \(K(r) = \pi r^2\) for that model. A higher value of \(K(r)\) means that there is an aggregation of points at the scale \(r\), whereas a lower value means that there is repulsion.
+
In practice, \(K(r)\) is estimated for a specific point pattern by the equation:
+
\[ K(r) = \frac{A}{n(n-1)} \sum_i \sum_{j > i} I \left( d_{ij} \le r \right) w_{ij}\]
+
where \(A\) is the area of the observation window and \(n\) is the number of points in the pattern, so \(n(n-1)\) is the number of distinct pairs of points. We take the sum for all pairs of points of the indicator function \(I\), which takes a value of 1 if the distance between points \(i\) and \(j\) is less than or equal to \(r\). Finally, the term \(w_{ij}\) is used to give extra weight to certain pairs of points to account for edge effects, as discussed in the next section.
+
For example, the graphs below show the estimated \(K(r)\) function for the patterns shown above, for values of \(r\) up to 1/4 of the window width. The red dashed curve shows the theoretical value for CSR and the gray area is an “envelope” produced by 99 simulations of that null pattern. The aggregated pattern shows an excess of neighbours up to \(r = 0.25\) and the pattern with repulsion shows a significant deficit of neighbours for small values of \(r\).
+
+
+
+
+
+
In addition to \(K\), there are other statistics to describe the second-order properties of point patterns, such as the mean distance between a point and its nearest \(N\) neighbours. You can refer to the Wiegand and Moloney (2013) textbook in the references to learn more about different summary statistics for point patterns.
+
+
+
Edge effects
+
In the context of point pattern analysis, edge effects are due to the fact that we have incomplete knowledge of the neighbourhood of points near the edge of the observation window, which can induce a bias in the calculation of statistics such as Ripley’s \(K\).
+
Different methods have been developed to correct the bias due to edge effects. In Ripley’s edge correction method, the contribution of a neighbour \(j\) located at a distance \(r\) from a point \(i\) receives a weight \(w_{ij} = 1/\phi_i(r)\), where \(\phi_i(r)\) is the fraction of the circle of radius \(r\) around \(i\) contained in the observation window. For example, if 2/3 of the circle is in the window, this neighbour counts as 3/2 neighbours in the calculation of a statistic like \(K\).
+
+
Ripley’s method is one of the simplest to correct for edge effects, but is not necessarily the most efficient; in particular, larger weights given to certain pairs of points tend to increase the variance of the calculated statistic. Other correction methods are presented in specialized textbooks, such as Wiegand and Moloney (2013).
+
+
+
Example
+
For this example, we use the dataset semis_xy.csv, which represents the \((x, y)\) coordinates for seedlings of two species (sp, B = birch and P = poplar) in a 15 x 15 m plot.
x y sp
+1 14.73 0.05 P
+2 14.72 1.71 P
+3 14.31 2.06 P
+4 14.16 2.64 P
+5 14.12 4.15 B
+6 9.88 4.08 B
+
+
+
The spatstat package provides tools for point pattern analysis in R. The first step consists in transforming our data frame into a ppp object (point pattern) with the function of the same name. In this function, we specify which columns contain the coordinates x and y as well as the marks, which here will be the species codes. We also need to specify an observation window (window) using the owin function, where we provide the plot limits in x and y.
+
+
library(spatstat)
+
+semis <-ppp(x = semis$x, y = semis$y, marks =as.factor(semis$sp),
+window =owin(xrange =c(0, 15), yrange =c(0, 15)))
+semis
+
+
Marked planar point pattern: 281 points
+Multitype, with levels = B, P
+window: rectangle = [0, 15] x [0, 15] units
+
+
+
Marks can be numeric or categorical. Note that for categorical marks as is the case here, the variable must be explicitly converted to a factor.
+
The plot function applied to a point pattern shows a diagram of the pattern.
+
+
plot(semis)
+
+
+
+
+
The intensity function calculates the density of points of each species by unit area (here, by \(m^2\)).
+
+
intensity(semis)
+
+
B P
+0.6666667 0.5822222
+
+
+
To first analyze the distribution of each species separately, we split the pattern with split. Since the pattern contains categorical marks, it is automatically split according to the values of those marks. The result is a list of two point patterns.
+
+
semis_split <-split(semis)
+plot(semis_split)
+
+
+
+
+
The Kest function calculates Ripley’s \(K\) for a series of distances up to (by default) 1/4 of the width of the window. Here we apply it to the first pattern (birch) by choosing semis_split[[1]]. Note that double square brackets are necessary to choose an item from a list in R.
+
The argument correction = "iso" tells the function to apply Ripley’s correction for edge effects.
+
+
k <-Kest(semis_split[[1]], correction ="iso")
+plot(k)
+
+
+
+
+
According to this graph, there seems to be an excess of neighbours for distances of 1 m and above. To check if this is a significant difference, we produce a simulation envelope with the envelope function. The first argument of envelope is a point pattern to which the simulations will be compared, the second one is a function to be computed (here, Kest) for each simulated pattern, then we add the arguments of the Kest function (here, only correction).
As indicated by the message, by default the function performs 99 simulations of the null model corresponding to complete spatial randomness (CSR).
+
The observed curve falls outside the envelope of the 99 simulations near \(r = 2\). We must be careful not to interpret too quickly a result that is outside the envelope. Although there is about a 1% probability of obtaining a more extreme result under the null hypothesis at a given distance, the envelope is calculated for a large number of values of \(r\) and is not corrected for multiple comparisons. Thus, a significant difference for a very small range of values of \(r\) may be simply due to chance.
+
+
Exercise 1
+
Looking at the graph of the second point pattern (poplar seedlings), can you predict where Ripley’s \(K\) will be in relation to the null hypothesis of complete spatial randomness? Verify your prediction by calculating Ripley’s \(K\) for this point pattern in R.
+
+
+
+
Effect of heterogeneity
+
The graph below illustrates a heterogeneous point pattern, i.e. it shows an density gradient (more points on the left than on the right).
+
+
+
+
+
+
A density gradient can be confused with an aggregation of points, as can be seen on the graph of the corresponding Ripley’s \(K\). In theory, these are two different processes:
+
+
Heterogeneity: The density of points varies in the study area, for example due to the fact that certain local conditions are more favorable to the presence of the species of interest.
+
Aggregation: The mean density of points is homogeneous, but the presence of one point increases the presence of other points in its vicinity, for example due to positive interactions between individuals.
+
+
However, it may be difficult to differentiate between the two in practice, especially since some patterns may be both heterogeneous and aggregated.
+
Let’s take the example of the poplar seedlings from the previous exercise. The density function applied to a point pattern performs a kernel density estimation of the density of the seedlings across the plot. By default, this function uses a Gaussian kernel with a standard deviation sigma specified in the function, which determines the scale at which density fluctuations are “smoothed”. Here, we use a value of 2 m for sigma and we first represent the estimated density with plot, before overlaying the points (add = TRUE means that the points are added to the existing plot rather than creating a new plot).
To measure the aggregation or repulsion of points in a heterogeneous pattern, we must use the inhomogeneous version of the \(K\) statistic (Kinhom in spatstat). This statistic is still equal to the mean number of neighbours within a radius \(r\) of a point in the pattern, but rather than standardizing this number by the overall intensity of the pattern, it is standardized by the local estimated density. As above, we specify sigma = 2 to control the level of smoothing for the varying density estimate.
Taking into account the heterogeneity of the pattern at a scale sigma of 2 m, there seems to be a deficit of neighbours starting at a radius of about 1.5 m. We can now check whether this deviation is significant.
+
As before, we use envelope to simulate the Kinhom statistic under the null model. However, the null model here is not a homogeneous Poisson process (CSR). It is instead a heterogeneous Poisson process simulated by the function rpoispp(dens_p), i.e. the points are independent of each other, but their density is heterogeneous and given by dens_p. The simulate argument of the envelope function specifies the function used for simulations under the null model; this function must have one argument, here x, even if it is not used.
+
Finally, in addition to the arguments needed for Kinhom, i.e. sigma and correction, we also specify nsim = 199 to perform 199 simulations and nrank = 5 to eliminate the 5 most extreme results on each side of the envelope, i.e. the 10 most extreme results out of 199, to achieve an interval containing about 95% of the probability under the null hypothesis.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
plot(khet_p)
+
+
+
+
+
Note: For a hypothesis test based on simulations of a null hypothesis, the \(p\)-value is estimated by \((m + 1)/(n + 1)\), where \(n\) is the number of simulations and \(m\) is the number of simulations where the value of the statistic is more extreme than that of the observed data. This is why the number of simulations is often chosen to be 99, 199, etc.
+
+
Exercise 2
+
Repeat the heterogeneous density estimation and Kinhom calculation with a standard deviation sigma of 5 rather than 2. How does the smoothing level for the density estimation influence the conclusions?
+
To differentiate between a variation in the density of points from an interaction (aggregation or repulsion) between these points with this type of analysis, it is generally assumed that the two processes operate at different scales. Typically, we can test whether the points are aggregated at a small scale after accounting for a variation in density at a larger scale.
+
+
+
+
Relationship between two point patterns
+
Let’s consider a case where we have two point patterns, for example the position of trees of two species in a plot (orange and green points in the graph below). Each of the two patterns may or may not present an aggregation of points.
+
+
+
+
+
+
Regardless of whether points are aggregated at the species level, we want to determine whether the two species are arranged independently. In other words, does the probability of observing a tree of one species depend on the presence of a tree of the other species at a given distance?
+
The bivariate version of Ripley’s \(K\) allows us to answer this question. For two patterns noted 1 and 2, the function \(K_{12}(r)\) calculates the mean number of points in pattern 2 within a radius \(r\) from a point in pattern 1, standardized by the density of pattern 2.
+
In theory, this function is symmetrical, so \(K_{12}(r) = K_{21}(r)\) and the result would be the same whether the points of pattern 1 or 2 are chosen as “focal” points for the analysis. However, the estimation of the two quantities for an observed pattern may differ, in particular because of edge effects. The variance of \(K_{12}\) and \(K_{21}\) between simulations of a null model may also differ, so the null hypothesis test may have more or less power depending on the choice of the focal species.
+
The choice of an appropriate null model is important here. In order to determine whether there is a significant attraction or repulsion between the two patterns, the position of one of the patterns must be randomly moved relative to that of the other pattern, while keeping the spatial structure of each pattern taken in isolation.
+
One way to do this randomization is to shift one of the two patterns horizontally and/or vertically by a random distance. The part of the pattern that “comes out” on one side of the window is attached to the other side. This method is called a toroidal shift, because by connecting the top and bottom as well as the left and right of a rectangular surface, we obtain the shape of a torus (a three-dimensional “donut”).
+
+
+
+
+
+
The graph above shows a translation of the green pattern to the right, while the orange pattern remains in the same place. The green points in the shaded area are brought back on the other side. Note that while this method generally preserves the structure of each pattern while randomizing their relative position, it can have some drawbacks, such as dividing point clusters that are near the cutoff point.
+
Let’s now check whether the position of the two species (birch and poplar) is independent in our plot. The function Kcross calculates the bivariate \(K_{ij}\), we must specify which type of point (mark) is considered as the focal species \(i\) and the neighbouring species \(j\).
+
+
plot(Kcross(semis, i ="P", j ="B", correction ="iso"))
+
+
+
+
+
Here, the observed \(K\) is lower than the theoretical value, indicating a possible repulsion between the two patterns.
+
To determine the envelope of the \(K\) under the null hypothesis of independence of the two patterns, we must specify that the simulations are based on a translation of the patterns. We indicate that the simulations use the function rshift (random translation) with the argument simulate = function(x) rshift(x, which = "B"); here, the x argument in simulate corresponds to the original point pattern and the which argument indicates which of the patterns is translated. As in the previous case, the arguments needed for Kcross, i.e. i, j and correction, must be repeated in the envelope function.
+
+
plot(envelope(semis, Kcross, i ="P", j ="B", correction ="iso",
+nsim =199, nrank =5, simulate =function(x) rshift(x, which ="B")))
+
+
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
Here, the observed curve is totally within the envelope, so we do not reject the null hypothesis of independence of the two patterns.
+
+
Questions
+
+
What would be one reason for our choice to translate the points of the birch rather than poplar?
+
Would the simulations generated by random translation be a good null model if the two patterns were heterogeneous?
+
+
+
+
+
Marked point patterns
+
The fir.csv dataset contains the \((x, y)\) coordinates of 822 fir trees in a 1 hectare plot and their status (A = alive, D = dead) following a spruce budworm outbreak.
+
+
fir <-read.csv("data/fir.csv")
+head(fir)
+
+
x y status
+1 31.50 1.00 A
+2 85.25 30.75 D
+3 83.50 38.50 A
+4 84.00 37.75 A
+5 83.00 33.25 A
+6 33.25 0.25 A
+
+
+
+
fir <-ppp(x = fir$x, y = fir$y, marks =as.factor(fir$status),
+window =owin(xrange =c(0, 100), yrange =c(0, 100)))
+plot(fir)
+
+
+
+
+
Suppose that we want to check whether fir mortality is independent or correlated between neighbouring trees. How does this question differ from the previous example, where we wanted to know if the position of the points of two species was independent?
+
In the previous example, the independence or interaction between the species referred to the formation of the pattern itself (whether or not seedlings of one species establish near those of the other species). Here, the characteristic of interest (survival) occurs after the establishment of the pattern, assuming that all those trees were alive at first and that some died as a result of the outbreak. So we take the position of the trees as fixed and we want to know whether the distribution of status (dead, alive) among those trees is random or shows a spatial pattern.
+
In Wiegand and Moloney’s textbook, the first situation (establishment of seedlings of two species) is called a bivariate pattern, so it is really two interacting patterns, while the second is a single pattern with a qualitative mark. The spatstat package in R does not differentiate between the two in terms of pattern definition (types of points are always represented by the marks argument), but the analysis methods applied to the two questions differ.
+
In the case of a pattern with a qualitative mark, we can define a mark connection function\(p_{ij}(r)\). For two points separated by a distance \(r\), this function gives the probability that the first point has the mark \(i\) and the second the mark \(j\). Under the null hypothesis where the marks are independent, this probability is equal to the product of the proportions of each mark in the entire pattern, \(p_{ij}(r) = p_i p_j\) independently of \(r\).
+
In spatstat, the mark connection function is computed with the markconnect function, where the marks \(i\) and \(j\) and the type of edge correction must be specified. In our example, we see that two closely spaced points are less likely to have a different status (A and D) than expected under the assumption of random and independent distribution of marks (red dotted line).
+
+
plot(markconnect(fir, i ="A", j ="D", correction ="iso"))
+
+
+
+
+
In this graph, the fluctuations in the function are due to the estimation error of a continuous \(r\) function from a limited number of discrete point pairs.
+
To simulate the null model in this case, we use the rlabel function, which randomly reassigns the marks among the points of the pattern, keeping the points’ positions fixed.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
Note that since the rlabel function has only one required argument corresponding to the original point pattern, it was not necessary to specify: simulate = function(x) rlabel(x).
+
Here are the results for tree pairs of the same status A or D:
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
It therefore appears that fir mortality due to this outbreak is spatially aggregated, since trees located in close proximity to each other have a greater probability of sharing the same status than predicted by the null hypothesis.
+
+
+
References
+
Fortin, M.-J. and Dale, M.R.T. (2005) Spatial Analysis: A Guide for Ecologists. Cambridge University Press: Cambridge, UK.
+
Wiegand, T. and Moloney, K.A. (2013) Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.
+
The dataset in the last example is a subet of the Lake Duparquet Research and Teaching Forest (LDRTF) data, available on Dryad here.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
plot(khet_p)
+
+
+
+
+
Here, as we estimate density variations at a larger scale, even after accounting for this variation, the poplar seedlings seem to be aggregated at a small scale.
+
+
+
+
5 Spatial correlation of a variable
+
Correlation between measurements of a variable taken at nearby points often occurs in environmental data. This principle is sometimes referred to as the “first law of geography” and is expressed in the following quote from Waldo Tobler: “Everything is related to everything else, but near things are more related than distant things”.
+
In statistics, we often refer to autocorrelation as the correlation between measurements of the same variable taken at different times (temporal autocorrelation) or places (spatial autocorrelation).
+
+
Intrinsic or induced dependence
+
There are two basic types of spatial dependence on a measured variable \(y\): an intrinsic dependence on \(y\), or a dependence induced by external variables influencing \(y\), which are themselves spatially correlated.
+
For example, suppose that the abundance of a species is correlated between two sites located near each other:
+
+
this spatial dependence can be induced if it is due to a spatial correlation of habitat factors that are favorable or unfavorable to the species;
+
or it can be intrinsic if it is due to the dispersion of individuals to nearby sites.
+
+
In many cases, both types of dependence affect a given variable.
+
If the dependence is simply induced and the external variables that cause it are included in the model explaining \(y\), then the model residuals will be independent and we can use all the methods already seen that ignore spatial correlation.
+
However, if the dependence is intrinsic or due to unmeasured external factors, then the spatial correlation of the residuals in the model will have to be taken into account.
+
+
+
Different ways to model spatial effects
+
In this training, we will directly model the spatial correlations of our data. It is useful to compare this approach to other ways of including spatial aspects in a statistical model.
+
First, we could include predictors in the model that represent position (e.g., longitude, latitude). Such predictors may be useful for detecting a systematic large-scale trend or gradient, whether or not the trend is linear (e.g., with a generalized additive model).
+
In contrast to this approach, the models we will see now serve to model a spatial correlation in the random fluctuations of a variable (i.e., in the residuals after removing any systematic effect).
+
Mixed models use random effects to represent the non-independence of data on the basis of their grouping, i.e., after accounting for systematic fixed effects, data from the same group are more similar (their residual variation is correlated) than data from different groups. These groups were sometimes defined according to spatial criteria (observations grouped into sites).
+
However, in the context of a random group effect, all groups are as different from each other, e.g., two sites within 100 km of each other are no more or less similar than two sites 2 km apart.
+
The methods we will see here and in the next parts of the training therefore allow us to model non-independence on a continuous scale (closer = more correlated) rather than just discrete (hierarchy of groups).
+
+
+
+
6 Geostatistical models
+
Geostatistics refers to a group of techniques that originated in the earth sciences. Geostatistics is concerned with variables that are continuously distributed in space and where a number of points are sampled to estimate this distribution. A classic example of these techniques comes from the mining field, where the aim was to create a map of the concentration of ore at a site from samples taken at different points on the site.
+
For these models, we will assume that \(z(x, y)\) is a stationary spatial variable measured at points with coordinates \(x\) and \(y\).
+
+
Variogram
+
A central aspect of geostatistics is the estimation of the variogram \(\gamma_z\) . The variogram is equal to half the mean square difference between the values of \(z\) for two points \((x_i, y_i)\) and \((x_j, y_j)\) separated by a distance \(h\).
In this equation, the \(\text{E}\) function with the index \(d_{ij}=h\) designates the statistical expectation (i.e., the mean) of the squared deviation between the values of \(z\) for points separated by a distance \(h\).
+
If we want instead to express the autocorrelation \(\rho_z(h)\) between measures of \(z\) separated by a distance \(h\), it is related to the variogram by the equation:
+
\[\gamma_z = \sigma_z^2(1 - \rho_z)\] ,
+
where \(\sigma_z^2\) is the global variance of \(z\).
+
Note that \(\gamma_z = \sigma_z^2\) when we reach a distance where the measurements of \(z\) are independent, so \(\rho_z = 0\). In this case, we can see that \(\gamma_z\) is similar to a variance, although it is sometimes called “semivariogram” or “semivariance” because of the 1/2 factor in the above equation.
+
+
+
Theoretical models for the variogram
+
Several parametric models have been proposed to represent the spatial correlation as a function of the distance between sampling points. Let us first consider a correlation that decreases exponentially:
+
\[\rho_z(h) = e^{-h/r}\]
+
Here, \(\rho_z = 1\) for \(h = 0\) and the correlation is multiplied by \(1/e \approx 0.37\) each time the distance increases by \(r\). In this context, \(r\) is called the range of the correlation.
+
From the above equation, we can calculate the corresponding variogram.
+
\[\gamma_z(h) = \sigma_z^2 (1 - e^{-h/r})\]
+
Here is a graphical representation of this variogram.
+
+
+
+
+
+
Because of the exponential function, the value of \(\gamma\) at large distances approaches the global variance \(\sigma_z^2\) without exactly reaching it. This asymptote is called a sill in the geostatistical context and is represented by the symbol \(s\).
+
Finally, it is sometimes unrealistic to assume a perfect correlation when the distance tends towards 0, because of a possible variation of \(z\) at a very small scale. A nugget effect, denoted \(n\), can be added to the model so that \(\gamma\) approaches \(n\) (rather than 0) if \(h\) tends towards 0. The term nugget comes from the mining origin of these techniques, where a nugget could be the source of a sudden small-scale variation in the concentration of a mineral.
+
By adding the nugget effect, the remainder of the variogram is “compressed” to keep the same sill, resulting in the following equation.
+
\[\gamma_z(h) = n + (s - n) (1 - e^{-h/r})\]
+
In the gstat package that we use below, the term \((s-n)\) is called a partial sill or psill for the exponential portion of the variogram.
+
+
+
+
+
+
In addition to the exponential model, two other common theoretical models for the variogram are the Gaussian model (where the correlation follows a half-normal curve), and the spherical model (where the variogram increases linearly at the start and then curves and reaches the plateau at a distance equal to its range \(r\)). The spherical model thus allows the correlation to be exactly 0 at large distances, rather than gradually approaching zero in the case of the other models.
* For the spherical model, \(\rho = 0\) and \(\gamma = s\) if \(h \ge r\).
+
+
+
+
+
+
+
+
Empirical variogram
+
To estimate \(\gamma_z(h)\) from empirical data, we need to define distance classes, thus grouping different distances within a margin of \(\pm \delta\) around a distance \(h\), then calculating the mean square deviation for the pairs of points in that distance class.
Here, \(v\) designates the response variable and \(u\) the predictors, to avoid confusion with the spatial coordinates \(x\) and \(y\).
+
In addition to the residual \(\epsilon\) that is independent between observations, the model includes a term \(z\) that represents the spatially correlated portion of the residual variance.
+
Here are suggested steps to apply this type of model:
+
+
Fit the regression model without spatial correlation.
+
Verify the presence of spatial correlation from the empirical variogram of the residuals.
+
Fit one or more regression models with spatial correlation and select the one that shows the best fit to the data.
+
+
+
+
+
7 Geostatistical models in R
+
The gstat package contains functions related to geostatistics. For this example, we will use the oxford dataset from this package, which contains measurements of physical and chemical properties for 126 soil samples from a site, along with their coordinates XCOORD and YCOORD.
Note that the \(x\) and \(y\) axes have been inverted to save space. The coord_fixed() function of ggplot2 ensures that the scale is the same on both axes, which is useful for representing spatial data.
+
We can immediately see that these measurements were taken on a 100 m grid. It seems that the magnesium concentration is spatially correlated, although it may be a correlation induced by another variable. In particular, we know that the concentration of magnesium is negatively related to the soil pH (PH1).
+
+
ggplot(oxford, aes(x = PH1, y = MG1)) +
+geom_point()
+
+
+
+
+
The variogram function of gstat is used to estimate a variogram from empirical data. Here is the result obtained for the variable MG1.
The formula MG1 ~ 1 indicates that no linear predictor is included in this model, while the argument locations indicates which variables in the data frame correspond to the spatial coordinates.
+
In the resulting table, gamma is the value of the variogram for the distance class centered on dist, while np is the number of pairs of points in that class. Here, since the points are located on a grid, we obtain regular distance classes (e.g.: 100 m for neighboring points on the grid, 141 m for diagonal neighbors, etc.).
+
Here, we limit ourselves to the estimation of isotropic variograms, i.e. the variogram depends only on the distance between the two points and not on the direction. Although we do not have time to see it today, it is possible with gstat to estimate the variogram separately in different directions.
+
We can illustrate the variogram with plot.
+
+
plot(var_mg, col ="black")
+
+
+
+
+
If we want to estimate the residual spatial correlation of MG1 after including the effect of PH1, we can add that predictor to the formula.
+
+
var_mg <-variogram(MG1 ~ PH1, locations =~ XCOORD + YCOORD, data = oxford)
+plot(var_mg, col ="black")
+
+
+
+
+
Including the effect of pH, the range of the spatial correlation seems to decrease, while the plateau is reached around 300 m. It even seems that the variogram decreases beyond 400 m. In general, we assume that the variance between two points does not decrease with distance, unless there is a periodic spatial pattern.
+
The function fit.variogram accepts as arguments a variogram estimated from the data, as well as a theoretical model described in a vgm function, and then estimates the parameters of that model according to the data. The fitting is done by the method of least squares.
+
For example, vgm("Exp") means we want to fit an exponential model.
+
+
vfit <-fit.variogram(var_mg, vgm("Exp"))
+vfit
+
+
model psill range
+1 Nug 0.000 0.00000
+2 Exp 1951.496 95.11235
+
+
+
There is no nugget effect, because psill = 0 for the Nug (nugget) part of the model. The exponential part has a sill at 1951 and a range of 95 m.
+
To compare different models, a vector of model names can be given to vgm. In the following example, we include the exponential, gaussian (“Gau”) and spherical (“Sph”) models.
model psill range
+1 Nug 0.000 0.00000
+2 Exp 1951.496 95.11235
+
+
+
The function gives us the result of the model with the best fit (lowest sum of squared deviations), which here is the same exponential model.
+
Finally, we can superimpose the theoretical model and the empirical variogram on the same graph.
+
+
plot(var_mg, vfit, col ="black")
+
+
+
+
+
+
Regression with spatial correlation
+
We have seen above that the gstat package allows us to estimate the variogram of the residuals of a linear model. In our example, the magnesium concentration was modeled as a function of pH, with spatially correlated residuals.
+
Another tool to fit this same type of model is the gls function of the nlme package, which is included with the installation of R.
+
This function applies the generalized least squares method to fit linear regression models when the residuals are not independent or when the residual variance is not the same for all observations. Since the estimates of the coefficients depend on the estimated correlations between the residuals and the residuals themselves depend on the coefficients, the model is fitted by an iterative algorithm:
+
+
A classical linear regression model (without correlation) is fitted to obtain residuals.
+
The spatial correlation model (variogram) is fitted with those residuals.
+
The regression coefficients are re-estimated, now taking into account the correlations.
+
+
Steps 2 and 3 are repeated until the estimates are stable at a desired precision.
+
Here is the application of this method to the same model for the magnesium concentration in the oxford dataset. In the correlation argument of gls, we specify an exponential correlation model as a function of our spatial coordinates and we include a possible nugget effect.
+
In addition to the exponential correlation corExp, the gls function can also estimate a Gaussian (corGaus) or spherical (corSpher) model.
Generalized least squares fit by REML
+ Model: MG1 ~ PH1
+ Data: oxford
+ AIC BIC logLik
+ 1278.65 1292.751 -634.325
+
+Correlation Structure: Exponential spatial correlation
+ Formula: ~XCOORD + YCOORD
+ Parameter estimate(s):
+ range nugget
+478.0322964 0.2944753
+
+Coefficients:
+ Value Std.Error t-value p-value
+(Intercept) 391.1387 50.42343 7.757084 0
+PH1 -41.0836 6.15662 -6.673079 0
+
+ Correlation:
+ (Intr)
+PH1 -0.891
+
+Standardized residuals:
+ Min Q1 Med Q3 Max
+-2.1846957 -0.6684520 -0.3687813 0.4627580 3.1918604
+
+Residual standard error: 53.8233
+Degrees of freedom: 126 total; 124 residual
+
+
+
To compare this result with the adjusted variogram above, the parameters given by gls must be transformed. The range has the same meaning in both cases and corresponds to 478 m for the result of gls. The global variance of the residuals is the square of Residual standard error. The nugget effect here (0.294) is expressed as a fraction of that variance. Finally, to obtain the partial sill of the exponential part, the nugget effect must be subtracted from the total variance.
+
After performing these calculations, we can give these parameters to the vgm function of gstat to superimpose this variogram estimated by gls on our variogram of the residuals of the classical linear model.
Does the model fit the data less well here? In fact, this empirical variogram represented by the points was obtained from the residuals of the linear model ignoring the spatial correlation, so it is a biased estimate of the actual spatial correlations. The method is still adequate to quickly check if spatial correlations are present. However, to simultaneously fit the regression coefficients and the spatial correlation parameters, the generalized least squares (GLS) approach is preferable and will produce more accurate estimates.
+
Finally, note that the result of the gls model also gives the AIC, which we can use to compare the fit of different models (with different predictors or different forms of spatial correlation).
+
+
+
Exercise
+
The bryo_belg.csv dataset is adapted from the data of this study:
+
+
Neyens, T., Diggle, P.J., Faes, C., Beenaerts, N., Artois, T. et Giorgi, E. (2019) Mapping species richness using opportunistic samples: a case study on ground-floor bryophyte species richness in the Belgian province of Limburg. Scientific Reports 9, 19122. https://doi.org/10.1038/s41598-019-55593-x
+
+
This data frame shows the specific richness of ground bryophytes (richness) for different sampling points in the Belgian province of Limburg, with their position (x, y) in km, in addition to information on the proportion of forest (forest) and wetlands (wetland) in a 1 km^2$ cell containing the sampling point.
For this exercise, we will use the square root of the specific richness as the response variable. The square root transformation often allows to homogenize the variance of the count data in order to apply a linear regression.
+
+
Fit a linear model of the transformed species richness to the proportion of forest and wetlands, without taking into account spatial correlations. What is the effect of the two predictors in this model?
+
Calculate the empirical variogram of the model residuals in (a). Does there appear to be a spatial correlation between the points?
+
+
Note: The cutoff argument to the variogram function specifies the maximum distance at which the variogram is calculated. You can manually adjust this value to get a good view of the sill.
+
+
Re-fit the linear model in (a) with the gls function in the nlme package, trying different types of spatial correlations (exponential, Gaussian, spherical). Compare the models (including the one without spatial correlation) with the AIC.
+
What is the effect of the proportion of forests and wetlands according to the model in (c)? Explain the differences between the conclusions of this model and the model in (a).
+
+
+
+
+
8 Kriging
+
As mentioned before, a common application of geostatistical models is to predict the value of the response variable at unsampled locations, a form of spatial interpolation called kriging (pronounced with a hard “g”).
+
There are three basic types of kriging based on the assumptions made about the response variable:
+
+
Ordinary kriging: Stationary variable with an unknown mean.
+
Simple kriging: Stationary variable with a known mean.
+
Universal kriging: Variable with a trend given by a linear or non-linear model.
+
+
For all kriging methods, the predictions at a new point are a weighted mean of the values at known points. These weights are chosen so that kriging provides the best linear unbiased prediction of the response variable, if the model assumptions (in particular the variogram) are correct. That is, among all possible unbiased predictions, the weights are chosen to give the minimum mean square error. Kriging also provides an estimate of the uncertainty of each prediction.
+
While we will not present the detailed kriging equations here, the weights depend on both the correlations (estimated by the variogram) between the sampled points and the new point, as well of the correlations between the sampled points themselves. In other words, sampled points near the new point are given more weight, but isolated sampled points are also given more weight, because sample points close to each other provide redundant information.
+
Kriging is an interpolation method, so the prediction at a sampled point will always be equal to the measured value (the measurement is supposed to have no error, just spatial variation). However, in the presence of a nugget effect, any small displacement from the sampled location will show variability according to the nugget.
+
In the example below, we generate a new dataset composed of randomly-generated (x, y) coordinates within the study area as well as randomly-generated pH values based on the oxford data. We then apply the function krige to predict the magnesium values at these new points. Note that we specify the variogram derived from the GLS results in the model argument to krige.
The result of krige includes the new point coordinates, the prediction of the variable var1.pred along with its estimated variance var1.var. In the graph below, we show the mean MG1 predictions from kriging (triangles) along with the measurements (circles).
The estimated mean and variance from kriging can be used to simulate possible values of the variable at each new point, conditional on the sampled values. In the example below, we performed 4 conditional simulations by adding the argument nsim = 4 to the same krige instruction.
+
+
sim_mg <-krige(MG1 ~ PH1, locations =~ XCOORD + YCOORD, data = oxford,
+newdata = new_points, model = gls_vgm, nsim =4)
+
+
drawing 4 GLS realisations of beta...
+[using conditional Gaussian simulation]
bryo_lm <-lm(sqrt(richness) ~ forest + wetland, data = bryo_belg)
+summary(bryo_lm)
+
+
+Call:
+lm(formula = sqrt(richness) ~ forest + wetland, data = bryo_belg)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-1.8847 -0.4622 0.0545 0.4974 2.3116
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 2.34159 0.08369 27.981 < 2e-16 ***
+forest 1.11883 0.13925 8.034 9.74e-15 ***
+wetland -0.59264 0.17216 -3.442 0.000635 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.7095 on 417 degrees of freedom
+Multiple R-squared: 0.2231, Adjusted R-squared: 0.2193
+F-statistic: 59.86 on 2 and 417 DF, p-value: < 2.2e-16
+
+
+
The proportion of forest has a significant positive effect and the proportion of wetlands has a significant negative effect on bryophyte richness.
+
+
plot(variogram(sqrt(richness) ~ forest + wetland, locations =~ x + y,
+data = bryo_belg, cutoff =50), col ="black")
+
+
+
+
+
The variogram is increasing from 0 to at least 40 km, so there appears to be spatial correlations in the model residuals.
+
+
bryo_exp <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corExp(form =~ x + y, nugget =TRUE))
+bryo_gaus <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corGaus(form =~ x + y, nugget =TRUE))
+bryo_spher <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corSpher(form =~ x + y, nugget =TRUE))
+
+
+
AIC(bryo_lm)
+
+
[1] 908.6358
+
+
AIC(bryo_exp)
+
+
[1] 867.822
+
+
AIC(bryo_gaus)
+
+
[1] 870.9592
+
+
AIC(bryo_spher)
+
+
[1] 866.9117
+
+
+
The spherical model has the smallest AIC.
+
+
summary(bryo_spher)
+
+
Generalized least squares fit by REML
+ Model: sqrt(richness) ~ forest + wetland
+ Data: bryo_belg
+ AIC BIC logLik
+ 866.9117 891.1102 -427.4558
+
+Correlation Structure: Spherical spatial correlation
+ Formula: ~x + y
+ Parameter estimate(s):
+ range nugget
+43.1727664 0.6063187
+
+Coefficients:
+ Value Std.Error t-value p-value
+(Intercept) 2.0368769 0.2481636 8.207800 0.000
+forest 0.6989844 0.1481690 4.717481 0.000
+wetland -0.2441130 0.1809118 -1.349348 0.178
+
+ Correlation:
+ (Intr) forest
+forest -0.251
+wetland -0.235 0.241
+
+Standardized residuals:
+ Min Q1 Med Q3 Max
+-1.75204183 -0.06568688 0.61415597 1.15240370 3.23322743
+
+Residual standard error: 0.7998264
+Degrees of freedom: 420 total; 417 residual
+
+
+
Both effects are less important in magnitude and the effect of wetlands is not significant anymore. As is the case for other types of non-independent residuals, the “effective sample size” here is less than the number of points, since points close to each other provide redundant information. Therefore, the relationship between predictors and response is less clear than given by the model assuming all these points were independent.
+
Note that the results for all three gls models are quite similar, so the choice to include spatial correlations was more important than the exact shape assumed for the variogram.
+
+
+
10 Areal data
+
Areal data are variables measured for regions of space, defined by polygons. This type of data is more common in the social sciences, human geography and epidemiology, where data is often available at the scale of administrative divisions.
+
This type of data also appears frequently in natural resource management. For example, the following map shows the forest management units of the Ministère de la Forêt, de la Faune et des Parcs du Québec.
+
+
Suppose that a variable is available at the level of these management units. How can we model the spatial correlation between units that are spatially close together?
+
One option would be to apply the geostatistical methods seen before, for example by calculating the distance between the centers of the polygons.
+
Another option, which is more adapted for areal data, is to define a network where each region is connected to neighbouring regions by a link. It is then assumed that the variables are directly correlated between neighbouring regions only. (Note, however, that direct correlations between immediate neighbours also generate indirect correlations for a chain of neighbours).
+
In this type of model, the correlation is not necessarily the same from one link to another. In this case, each link in the network can be associated with a weight representing its importance for the spatial correlation. We represent these weights by a matrix \(W\) where \(w_{ij}\) is the weight of the link between regions \(i\) and \(j\). A region has no link with itself, so \(w_{ii} = 0\).
+
A simple choice for \(W\) is to assign a weight equal to 1 if the regions are neighbours, otherwise 0 (binary weight).
+
In addition to land divisions represented by polygons, another example of areal data consists of a grid where the variable is calculated for each cell of the grid. In this case, a cell generally has 4 or 8 neighbouring cells, depending on whether diagonals are included or not.
+
+
+
11 Moran’s I
+
Before discussing spatial autocorrelation models, we present Moran’s \(I\) statistic, which allows us to test whether a significant correlation is present between neighbouring regions.
+
Moran’s \(I\) is a spatial autocorrelation coefficient of \(z\), weighted by the \(w_{ij}\). It therefore takes values between -1 and 1.
In this equation, we recognize the expression of a correlation, which is the product of the deviations from the mean for two variables \(z_i\) and \(z_j\), divided by the product of their standard deviations (it is the same variable here, so we get the variance). The contribution of each pair \((i, j)\) is multiplied by its weight \(w_{ij}\) and the term on the left (the number of regions \(N\) divided by the sum of the weights) ensures that the result is bounded between -1 and 1.
+
Since the distribution of \(I\) is known in the absence of spatial autocorrelation, this statistic serves to test the null hypothesis that there is no spatial correlation between neighbouring regions.
+
Although we will not see an example in this course, Moran’s \(I\) can also be applied to point data. In this case, we divide the pairs of points into distance classes and calculate \(I\) for each distance class; the weight \(w_{ij} = 1\) if the distance between \(i\) and \(j\) is in the desired distance class, otherwise 0.
+
+
+
12 Spatial autoregression models
+
Let us recall the formula for a linear regression with spatial dependence:
where \(z\) is the portion of the residual variance that is spatially correlated.
+
There are two main types of autoregressive models to represent the spatial dependence of \(z\): conditional autoregression (CAR) and simultaneous autoregressive (SAR).
+
+
Conditional autoregressive (CAR) model
+
In the conditional autoregressive model, the value of \(z_i\) for the region \(i\) follows a normal distribution: its mean depends on the value \(z_j\) of neighbouring regions, multiplied by the weight \(w_{ij}\) and a correlation coefficient \(\rho\); its standard deviation \(\sigma_{z_i}\) may vary from one region to another.
In this model, if \(w_{ij}\) is a binary matrix (0 for non-neighbours, 1 for neighbours), then \(\rho\) is the coefficient of partial correlation between neighbouring regions. This is similar to a first-order autoregressive model in the context of time series, where the autoregression coefficient indicates the partial correlation.
+
+
+
Simultaneous autoregressive (SAR) model
+
In the simultaneous autoregressive model, the value of \(z_i\) is given directly by the sum of contributions from neighbouring values \(z_j\), multiplied by \(\rho w_{ij}\), with an independent residual \(\nu_i\) of standard deviation \(\sigma_z\).
+
\[z_i = \sum_j \rho w_{ij} z_j + \nu_i\]
+
At first glance, this looks like a temporal autoregressive model. However, there is an important conceptual difference. For temporal models, the causal influence is directed in only one direction: \(v(t-2)\) affects \(v(t-1)\) which then affects \(v(t)\). For a spatial model, each \(z_j\) that affects \(z_i\) depends in turn on \(z_i\). Thus, to determine the joint distribution of \(z\), a system of equations must be solved simultaneously (hence the name of the model).
+
For this reason, although this model resembles the formula of CAR model, the solutions of the two models differ and in the case of SAR, the coefficient \(\rho\) is not directly equal to the partial correlation due to each neighbouring region.
+
For more details on the mathematical aspects of these models, see the article by Ver Hoef et al. (2018) suggested in reference.
+
For the moment, we will consider SAR and CAR as two types of possible models to represent a spatial correlation on a network. We can always fit several models and compare them with the AIC to choose the best form of correlation or the best weight matrix.
+
The CAR and SAR models share an advantage over geostatistical models in terms of efficiency. In a geostatistical model, spatial correlations are defined between each pair of points, although they become negligible as distance increases. For a CAR or SAR model, only neighbouring regions contribute and most weights are equal to 0, making these models faster to fit than a geostatistical model when the data are massive.
+
+
+
+
13 Analysis of areal data in R
+
To illustrate the analysis of areal data in R, we load the packages sf (to read geospatial data), spdep (to define spatial networks and calculate Moran’s \(I\)) and spatialreg (for SAR and CAR models).
+
+
library(sf)
+library(spdep)
+library(spatialreg)
+
+
As an example, we will use a dataset that presents some of the results of the 2018 provincial election in Quebec, with population characteristics of each riding. This data is included in a shapefile (.shp) file type, which we can read with the read_sf function of the sf package.
Note: The dataset is actually composed of 4 files with the extensions .dbf, .prj, .shp and .shx, but it is sufficient to write the name of the .shp file in read_sf.
+
The columns of the dataset are, in order:
+
+
the name of the electoral riding (circ);
+
four characteristics of the population (age_moy = mean age, pct_frn = fraction of the population that speaks mainly French at home, pct_prp = fraction of households that own their home, rev_med = median income);
+
four columns showing the fraction of votes obtained by the main parties (CAQ, PQ, PLQ, QS);
+
a geometry column that contains the geometric object (multipolygon) corresponding to the riding.
+
+
To illustrate one of the variables on a map, we call the plot function with the name of the column in square brackets and quotation marks.
+
+
plot(elect2018["rev_med"])
+
+
+
+
+
In this example, we want to model the fraction of votes obtained by the CAQ based on the characteristics of the population in each riding and taking into account the spatial correlations between neighbouring ridings.
+
+
Definition of the neighbourhood network
+
The poly2nb function of the spdep package defines a neighbourhood network from polygons. The result vois is a list of 125 elements where each element contains the indices of the neighbouring (bordering) polygons of a given polygon.
+
+
vois <-poly2nb(elect2018)
+vois[[1]]
+
+
[1] 2 37 63 88 101 117
+
+
+
Thus, the first riding (Abitibi-Est) has 6 neighbouring ridings, for which the names can be found as follows:
We can illustrate this network by extracting the coordinates of the center of each district, creating a blank map with plot(elect2018["geometry"]), then adding the network as an additional layer with plot(vois, add = TRUE, coords = coords).
We still have to add weights to each network link with the nb2listw function. The style of weights “B” corresponds to binary weights, i.e. 1 for the presence of link and 0 for the absence of link between two ridings.
+
Once these weights are defined, we can verify with Moran’s test whether there is a significant autocorrelation of votes obtained by the CAQ between neighbouring ridings.
+ Moran I test under randomisation
+
+data: elect2018$propCAQ
+weights: poids
+
+Moran I statistic standard deviate = 13.148, p-value < 2.2e-16
+alternative hypothesis: greater
+sample estimates:
+Moran I statistic Expectation Variance
+ 0.680607768 -0.008064516 0.002743472
+
+
+
The value \(I = 0.68\) is very significant judging by the \(p\)-value of the test.
+
Let’s verify if the spatial correlation persists after taking into account the four characteristics of the population, therefore by inspecting the residuals of a linear model including these four predictors.
+Call:
+lm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-30.9890 -4.4878 0.0562 6.2653 25.8146
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 1.354e+01 1.836e+01 0.737 0.463
+age_moy -9.170e-01 3.855e-01 -2.378 0.019 *
+pct_frn 4.588e+01 5.202e+00 8.820 1.09e-14 ***
+pct_prp 3.582e+01 6.527e+00 5.488 2.31e-07 ***
+rev_med -2.624e-05 2.465e-04 -0.106 0.915
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 9.409 on 120 degrees of freedom
+Multiple R-squared: 0.6096, Adjusted R-squared: 0.5965
+F-statistic: 46.84 on 4 and 120 DF, p-value: < 2.2e-16
+
+
moran.test(residuals(elect_lm), poids)
+
+
+ Moran I test under randomisation
+
+data: residuals(elect_lm)
+weights: poids
+
+Moran I statistic standard deviate = 6.7047, p-value = 1.009e-11
+alternative hypothesis: greater
+sample estimates:
+Moran I statistic Expectation Variance
+ 0.340083290 -0.008064516 0.002696300
+
+
+
Moran’s \(I\) has decreased but remains significant, so some of the previous correlation was induced by these predictors, but there remains a spatial correlation due to other factors.
+
+
+
Spatial autoregression models
+
Finally, we fit SAR and CAR models to these data with the spautolm (spatial autoregressive linear model) function of spatialreg. Here is the code for a SAR model including the effect of the same four predictors.
+Call: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018, listw = poids)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-23.08342 -4.10573 0.24274 4.29941 23.08245
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 15.09421119 16.52357745 0.9135 0.36098
+age_moy -0.70481703 0.32204139 -2.1886 0.02863
+pct_frn 39.09375061 5.43653962 7.1909 6.435e-13
+pct_prp 14.32329345 6.96492611 2.0565 0.03974
+rev_med 0.00016730 0.00023209 0.7208 0.47101
+
+Lambda: 0.12887 LR test value: 42.274 p-value: 7.9339e-11
+Numerical Hessian standard error of lambda: 0.012069
+
+Log likelihood: -433.8862
+ML residual variance (sigma squared): 53.028, (sigma: 7.282)
+Number of observations: 125
+Number of parameters estimated: 7
+AIC: 881.77
+
+
+
The value given by Lambda in the summary corresponds to the coefficient \(\rho\) in our description of the model. The likelihood-ratio test (LR test) confirms that this residual spatial correlation (after controlling for the effect of predictors) is significant.
+
The estimated effects for the predictors are similar to those of the linear model without spatial correlation. The effects of mean age, fraction of francophones and fraction of homeowners remain significant, although their magnitude has decreased somewhat.
+
To fit a CAR rather than SAR model, we must specify family = "CAR".
+Call: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018, listw = poids, family = "CAR")
+
+Residuals:
+ Min 1Q Median 3Q Max
+-21.73315 -4.24623 -0.24369 3.44228 23.43749
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 16.57164696 16.84155327 0.9840 0.325128
+age_moy -0.79072151 0.32972225 -2.3981 0.016478
+pct_frn 38.99116707 5.43667482 7.1719 7.399e-13
+pct_prp 17.98557474 6.80333470 2.6436 0.008202
+rev_med 0.00012639 0.00023106 0.5470 0.584364
+
+Lambda: 0.15517 LR test value: 40.532 p-value: 1.9344e-10
+Numerical Hessian standard error of lambda: 0.0026868
+
+Log likelihood: -434.7573
+ML residual variance (sigma squared): 53.9, (sigma: 7.3416)
+Number of observations: 125
+Number of parameters estimated: 7
+AIC: 883.51
+
+
+
For a CAR model with binary weights, the value of Lambda (which we called \(\rho\)) directly gives the partial correlation coefficient between neighbouring districts. Note that the AIC here is slightly higher than the SAR model, so the latter gave a better fit.
+
+
+
Exercise
+
The rls_covid dataset, in shapefile format, contains data on detected COVID-19 cases (cas), number of cases per 1000 people (taux_1k) and the population density (dens_pop) in each of Quebec’s local health service networks (RLS) (Source: Data downloaded from the Institut national de santé publique du Québec as of January 17, 2021).
Simple feature collection with 6 features and 5 fields
+Geometry type: MULTIPOLYGON
+Dimension: XY
+Bounding box: xmin: 785111.2 ymin: 341057.8 xmax: 979941.5 ymax: 541112.7
+Projected CRS: Conique_conforme_de_Lambert_du_MTQ_utilis_e_pour_Adresse_Qu_be
+# A tibble: 6 × 6
+ RLS_code RLS_nom cas taux_1k dens_…¹ geometry
+ <chr> <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [m]>
+1 0111 RLS de Kamouraska 152 7.34 6.76 (((827028.3 412772.4, 82…
+2 0112 RLS de Rivière-du-Lo… 256 7.34 19.6 (((855905 452116.9, 8557…
+3 0113 RLS de Témiscouata 81 4.26 4.69 (((911829.4 441311.2, 91…
+4 0114 RLS des Basques 28 3.3 5.35 (((879249.6 471975.6, 87…
+5 0115 RLS de Rimouski 576 9.96 15.5 (((917748.1 503148.7, 91…
+6 0116 RLS de La Mitis 76 4.24 5.53 (((951316 523499.3, 9525…
+# … with abbreviated variable name ¹dens_pop
+
+
+
Fit a linear model of the number of cases per 1000 as a function of population density (it is suggested to apply a logarithmic transform to the latter). Check whether the model residuals are correlated between bordering RLS with a Moran’s test and then model the same data with a conditional autoregressive model.
+
+
+
Reference
+
Ver Hoef, J.M., Peterson, E.E., Hooten, M.B., Hanks, E.M. and Fortin, M.-J. (2018) Spatial autoregressive models for statistical inference from ecological data. Ecological Monographs 88: 36-59.
+
+
+
+
14 GLMM with spatial Gaussian process
+
+
Data
+
The gambia dataset found in the geoR package presents the results of a study of malaria prevalence among children of 65 villages in The Gambia. We will use a slightly transformed version of the data found in the file gambia.csv.
ggplot(gambia_agg, aes(x = x, y = y)) +
+geom_point(aes(color = prev)) +
+geom_path(data = gambia.borders, aes(x = x /1000, y = y /1000)) +
+coord_fixed() +
+theme_minimal() +
+scale_color_viridis_c()
+
+
+
+
+
We use the gambia.borders dataset from the geoR package to trace the country boundaries with geom_path. Since those boundaries are in meters, we divide by 1000 to get the same scale as our points. We also use coord_fixed to ensure a 1:1 aspect ratio between the axes and use the viridis color scale, which makes it easier to visualize a continuous variable compared with the default gradient scale in ggplot2.
+
Based on this map, there seems to be spatial correlation in malaria prevalence, with the eastern cluster of villages showing more high prevalence values (yellow-green) and the middle cluster showing more low prevalence values (purple).
+
+
+
Non-spatial GLMM
+
For this first example, we will ignore the spatial aspect of the data and model the presence of malaria (pos) as a function of the use of a bed net (netuse) and the presence of a public health centre (phc). Since we have a binary response, we need to use a logistic regression model (a GLM). Since we have predictors at both the individual and village level, and we expect that children of the same village have more similar probabilities of having malaria even after accounting for those predictors, we need to add a random effect of the village. The result is a GLMM that we fit using the glmer function in the lme4 package.
Generalized linear mixed model fit by maximum likelihood (Laplace
+ Approximation) [glmerMod]
+ Family: binomial ( logit )
+Formula: pos ~ netuse + phc + (1 | id_village)
+ Data: gambia
+
+ AIC BIC logLik deviance df.resid
+ 2428.0 2450.5 -1210.0 2420.0 2031
+
+Scaled residuals:
+ Min 1Q Median 3Q Max
+-2.1286 -0.7120 -0.4142 0.8474 3.3434
+
+Random effects:
+ Groups Name Variance Std.Dev.
+ id_village (Intercept) 0.8149 0.9027
+Number of obs: 2035, groups: id_village, 65
+
+Fixed effects:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 0.1491 0.2297 0.649 0.5164
+netuse -0.6044 0.1442 -4.190 2.79e-05 ***
+phc -0.4985 0.2604 -1.914 0.0556 .
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Correlation of Fixed Effects:
+ (Intr) netuse
+netuse -0.422
+phc -0.715 -0.025
+
+
+
According to these results, both netuse and phc result in a decrease of malaria prevalence, although the effect of phc is not significant at a threshold \(\alpha = 0.05\). The intercept (0.149) is the logit of the probability of malaria presence for a child with no bednet and no public health centre, but it is the mean intercept across all villages, and there is a lot of variation between villages, based on the random effect standard deviation of 0.90. We can get the estimated intercept for each village with the function coef:
So for example, the intercept for village 1 is around 0.94, equivalent to a probability of 72%:
+
+
plogis(0.937)
+
+
[1] 0.7184933
+
+
+
while the intercept in village 2 is equivalent to a probability of 52%:
+
+
plogis(0.092)
+
+
[1] 0.5229838
+
+
+
The DHARMa package provides a general method for checking whether the residuals of a GLMM are distributed according to the specified model and whether there is any residual trend. The package works by simulating replicates of each observation according to the fitted model and then determining a “standardized residual”, which is the relative position of the observed value with respect to the simulated values, e.g. 0 if the observation is smaller than all the simulations, 0.5 if it is in the middle, etc. If the model represents the data well, each value of the standardized residual between 0 and 1 should be equally likely, so the standardized residuals should produce a uniform distribution between 0 and 1.
+
The simulateResiduals function performs the calculation of the standardized residuals, then the plot function plots the diagnostic graphs with the results of certain tests.
The graph on the left is a quantile-quantile plot of standardized residuals. The results of three statistical tests also also shown: a Kolmogorov-Smirnov (KS) test which checks whether there is a deviation from the theoretical distribution, a dispersion test that checks whether there is underdispersion or overdispersion, and an outlier test based on the number of residuals that are more extreme than all the simulations. Here, we get a significant result for the outliers, though the message indicates that this result might have an inflated type I error rate in this case.
+
On the right, we generally get a graph of standardized residuals (in y) as a function of the rank of the predicted values, in order to check for any leftover trend in the residual. Here, the predictions are binned by quartile, so it might be better to instead aggregate the predictions and residuals by village, which we can do with the recalculateResiduals function.
+
+
plot(recalculateResiduals(res_glmm, group = gambia$id_village))
+
+
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
The plot to the right now shows individual points, along with a quantile regression for the 1st quartile, the median and the 3rd quartile. In theory, these three curves should be horizontal straight lines (no leftover trend in the residuals vs. predictions). The curve for the 3rd quartile (in red) is significantly different from a horizontal line, which could indicate some systematic effect that is missing from the model.
+
+
+
Spatial GLMM with spaMM
+
The spaMM (spatial mixed models) package is a relatively new R package that can perform approximate maximum likelihood estimation of parameters for GLMM with spatial dependence, modelled either as a Gaussian process or with a CAR (we will see the latter in the last section). The package implements different algorithms, but there is a single fitme function that chooses the appropriate algorithm for each model type. For example, here is the same (non-spatial) model as above fit with spaMM.
formula: pos ~ netuse + phc + (1 | id_village)
+Estimation of lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.1491 0.2287 0.6519
+netuse -0.6045 0.1420 -4.2567
+phc -0.4986 0.2593 -1.9231
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ id_village : 0.8151
+ --- Coefficients for log(lambda):
+ Group Term Estimate Cond.SE
+ id_village (Intercept) -0.2045 0.2008
+# of obs: 2035; # of groups: id_village, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1210.016
+
+
+
Note that the estimates of the fixed effects as well as the variance of random effects are nearly identical to those obtained by glmer above.
+
We can now use spaMM to fit the same model with the addition of spatial correlations between villages. In the formula of the model, this is represented as a random effect Matern(1 | x + y), which means that the intercepts are spatially correlated between villages following a Matérn correlation function of coordinates (x, y). The Matérn function is a flexible function for spatial correlation that includes a shape parameter \(\nu\) (nu), so that when \(\nu = 0.5\) it is equivalent to the exponential correlation but as \(\nu\) grows to large values, it approaches a Gaussian correlation. We could let the function estimate \(\nu\), but here we will fix it to 0.5 with the fixed argument of fitme.
Increase spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').
+
+
summary(mod_spamm)
+
+
formula: pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.06861 0.3352 0.2047
+netuse -0.51719 0.1407 -3.6757
+phc -0.44416 0.2052 -2.1648
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.nu 1.rho
+0.50000000 0.05128692
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ x + y : 0.6421
+ id_village : 0.1978
+# of obs: 2035; # of groups: x + y, 65; id_village, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1197.968
+
+
+
Let’s first check the random effects of the model. The spatial correlation function has a parameter rho equal to 0.0513. This parameter in spaMM is the inverse of the range, so here the range of exponential correlation is 1/0.0513 or around 19.5 km. There are now two variance prameters, the one identified as x + y is the long-range variance (i.e. sill) for the exponential correlation model whereas the one identified as id_village shows the non-spatially correlated portion of the variation between villages.
+
In fact, while we left the random effects (1 | id_village) in the formula to represent the non-spatial portion of variation between villages, we could also represent this with a nugget effect in the geostatistical model. In both cases, it would represent the idea that even two villages very close to each other would have different baseline prevalences in the model.
+
By default, the Matern function has no nugget effect, but we can add one by specifying a non-zero Nugget in the initial parameter list init.
Increase spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').
+
+
summary(mod_spamm2)
+
+
formula: pos ~ netuse + phc + Matern(1 | x + y)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.06861 0.3352 0.2047
+netuse -0.51719 0.1407 -3.6757
+phc -0.44416 0.2052 -2.1648
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.nu 1.Nugget 1.rho
+0.50000000 0.23551027 0.05128692
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ x + y : 0.8399
+# of obs: 2035; # of groups: x + y, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1197.968
+
+
+
As you can see, all estimates are the same, except that the variance of the spatial portion (sill) is now 0.84 and the nugget is equal to a fraction 0.235 of that sill, so a variance of 0.197, which is the same as the id_village random effect in the version above. Thus the two formulations are equivalent.
+
Now, recall the coefficients we obtained for the non-spatial GLMM:
In the spatial version, both fixed effects have moved slightly towards zero, but the standard error of the effect of phc has decreased. It is interesting that the inclusion of spatial dependence has allowed us to estimate more precisely the effect of having a public health centre in the village. This would not always be the case: for a predictor that is also strongly correlated in space, spatial correlation in the response makes it harder to estimate the effect of this predictor, since it is confounded with the spatial effect. However, for a predictor that is not correlated in space, including the spatial effect reduces the residual (non-spatial) variance and may thus increase the precision of the predictor’s effect.
+
The spaMM package is also compatible with DHARMa for residual diagnostics. (You can in fact ignore the warning that it is not in the class of supported models, this is due to using the fitme function rather than a specific algorithm function in spaMM.)
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
plot(recalculateResiduals(res_spamm, group = gambia$id_village))
+
+
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
Finally, while we will show how to make and visualize spatial predictions below, we can produce a quick map of the estimated spatial effects in a spaMM model with the filled.mapMM function.
+
+
filled.mapMM(mod_spamm2)
+
+
+
+
+
+
+
Gaussian process models vs. smoothing splines
+
If you are familiar with generalized additive models (GAM), you might think that the spatial variation in malaria prevalence (as shown in the map above) could be represented by a 2D smoothing spline (as a function of \(x\) and \(y\)) within a GAM.
+
The code below fits the GAM equivalent of our Gaussian process GLMM above with the gam function in the mgcv package. The spatial effect is represented by the 2D spline s(x, y) whereas the non-spatial random effect of village is represented by s(id_village, bs = "re"), which is the same as (1 | id_village) in the previous models. Note that for the gam function, categorical variables must be explicitly converted to factors.
To visualize the 2D spline, we will use the gratia package.
+
+
library(gratia)
+draw(mod_gam)
+
+
+
+
+
Note that the plot of the spline s(x, y) (top right) does not extend too far from the locations of the data (other areas are blank). In this graph, we can also see that the village random effects follow the expected Gaussian distribution (top left).
+
Next, we will use both the spatial GLMM from the previous section and this GAMM to predict the mean prevalence on a spatial grid of points contained in the file gambia_pred.csv. The graph below adds those prediction points (in black) on the previous map of the data points.
+
+
gambia_pred <-read.csv("data/gambia_pred.csv")
+
+ggplot(gambia_agg, aes(x = x, y = y)) +
+geom_point(data = gambia_pred) +
+geom_point(aes(color = prev)) +
+geom_path(data = gambia.borders, aes(x = x /1000, y = y /1000)) +
+coord_fixed() +
+theme_minimal() +
+scale_color_viridis_c()
+
+
+
+
+
To make predictions from the GAMM model at those points, the code below goes through the following steps:
+
+
All predictors in the model must be in the prediction data frame, so we add constant values of netuse and phc (both equal to 1) for all points. Thus, we will make predictions of malaria prevalence in the case where a net is used and a public health centre is present. We also add a constant id_village, although it will not be used in predictions (see below).
+
We call the predict function on the output of gam to produce predictions at the new data points (argument newdata), including standard errors (se.fit = TRUE) and excluding the village random effects, so the prediction is made for an “average village”. The resulting object gam_pred will have columns fit (mean prediction) and se.fit (standard error). Those predictions and standard errors are on the link (logit) scale.
+
We add the original prediction data frame to gam_pred with cbind.
+
We add columns for the mean prediction and 50% confidence interval boundaries (mean \(\pm\) 0.674 standard error), converted from the logit scale to the probability scale with plogis. We choose a 50% interval since a 95% interval may be too wide here to contrast the different predictions on the map at the end of this section.
Note: The reason we do not make predictions directly on the probability (response) scale is that the normal formula for confidence intervals applies more accurately on the logit scale. Adding a certain number of standard errors around the mean on the probability scale would lead to less accurate intervals and maybe even confidence intervals outside the possible range (0, 1) for a probability.
+
We apply the same strategy to make predictions from the spaMM spatial GLMM model. There are a few differences in the predict method compared with the GAMM case.
+
+
The argument binding = "fit" means that mean predictions (fit column) will be attached to the prediction dataset and returned as spamm_pred.
+
The variances = list(linPred = TRUE) tells predict to calculate the variance of the linear predictor (so the square of the standard error). However, it appears as an attribute predVar in the output data frame rather than a se.fit column, so we move it to a column on the next line.
Finally, we combine both sets of predictions as different rows of a pred_all dataset with bind_rows. The name of the dataset each prediction originates from (gam or spamm) will appear in the “model” column (argument .id). To simplify production of the next plot, we then use pivot_longer in the tidyr package to change the three columns “pred”, “lo” and “hi” to two columns, “stat” and “value” (pred_tall has thus three rows for every row in pred_all).
Having done these steps, we can finally look at the prediction maps (mean, lower and upper bounds of the 50% confidence interval) with ggplot. The original data points are shown in red.
While both models agree that there is a higher prevalence near the eastern cluster of villages, the GAMM also estimates a higher prevalence at a few points (western edge and around the center) where there is no data. This is an artifact of the shape of the spline fit around the data points, since a spline is meant to fit a global, although nonlinear, trend. In contrast, the geostatistical model represents the spatial effect as local correlations and reverts to the overall mean prevalence when far from any data points, which is a safer assumption. This is one reason to choose a geostatistical / Gaussian process model in this case.
+
+
+
Bayesian methods for GLMMs with Gaussian processes
+
Bayesian models provide a flexible framework to express models with complex dependence structure among the data, including spatial dependence. However, fitting a Gaussian process model with a fully Bayesian approach can be slow, due the need to compute a spatial covariance matrix between all point pairs at each iteration.
+
The INLA (integrated nested Laplace approximation) method performs an approximate calculation of the Bayesian posterior distribution, which makes it suitable for spatial regression problems. We do not cover it in this course, but I recommend the textbook by Paula Moraga (in the references section below) that provides worked examples of using INLA for various geostatistical and areal data models, in the context of epidemiology, including models with both space and time dependence. The book presents the same Gambia malaria data as an example of a geostatistical dataset, which inspired its use in this course.
+
+
+
+
15 GLMM with spatial autoregression
+
We return to the last example of the previous part, where we modelled the rate of COVID-19 cases (cases / 1000) for administrative health network divisions (RLS) in Quebec as a function of their population density. The rate is given by the “taux_1k” column in the rls_covid shapefile.
Previously, we modelled the logarithm of this rate as a linear function of the logarithm of population density, with the residual variance correlated among neighbouring units via a CAR (conditional autoregression) structure, as shown in the code below.
+Call: spautolm(formula = log(taux_1k) ~ log(dens_pop), data = rls_covid,
+ listw = rls_w, family = "CAR")
+
+Residuals:
+ Min 1Q Median 3Q Max
+-1.201858 -0.254084 -0.053348 0.281482 1.427053
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 1.702068 0.168463 10.1035 < 2.2e-16
+log(dens_pop) 0.206623 0.032848 6.2903 3.169e-10
+
+Lambda: 0.15762 LR test value: 23.991 p-value: 9.6771e-07
+Numerical Hessian standard error of lambda: 0.0050486
+
+Log likelihood: -80.68953
+ML residual variance (sigma squared): 0.2814, (sigma: 0.53048)
+Number of observations: 95
+Number of parameters estimated: 4
+AIC: 169.38
+
+
+
As a reminder, the poly2nb function in the spdep package creates a list of neighbours based on bordering polygons in a shapefile, then the nb2listw converts it to a list of weights, here binary weights (style = "B") so that each bordering region receives the same weight of 1 in the autoregressive model.
+
Instead of using the rates, it would be possible to model the cases directly (column “cas” in the dataset) with a Poisson regression, which is appropriate for count data. To account for the fact that if the risk per person were equal, cases would be proportional to population, we can add the unit’s population pop as an offset in the Poisson regression. Therefore, the model would look like: cas ~ log(dens_pop) + offset(log(pop)). Note that since the Poisson regression uses a logarithmic link, that model with log(pop) as an offset assumes that log(cas / pop) (so the log rate) is proportional to log(dens_pop), just like the linear model above, but it has the advantage of modelling the stochasticity of the raw data (the number of cases) directly with a Poisson distribution.
+
We do not have the population in this data, but we can estimate it from the cases and the rate (cases / 1000) as follows:
To define a CAR model in spaMM, we need a weights matrix rather than a list of weights as in the spatialreg package. Fortunately, the spdep package also includes a function nb2mat to convert the neighbours list to a matrix of weights, here again using binary weights. To avoid a warning, we specify the row and column names of that matrix to be equal to the IDs associated with each unit (RLS_code). Then, we add a term adjacency(1 | RLS_code) to the model to specify that the residual variation between different groups defined by RLS_code is spatially correlated with a CAR structure (here, each group has only one observation since we have one data point by RLS unit).
formula: cas ~ log(dens_pop) + offset(log(pop)) + adjacency(1 | RLS_code)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: poisson( link = log )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) -5.1618 0.16855 -30.625
+log(dens_pop) 0.1999 0.03267 6.119
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.rho
+0.1576605
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ RLS_code : 0.266
+# of obs: 95; # of groups: RLS_code, 95
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -709.3234
+
+
+
Note that the spatial correlation coefficient rho (0.158) is similar to the equivalent quantity in the spautolm model above, where it was called Lambda. The effect of log(dens_pop) is also approximately 0.2 in both models.
+
+
Reference
+
Moraga, Paula (2019) Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. Available online at https://www.paulamoraga.com/book-geospatial/.
+
+
+
+
+
16 Statistiques spatiales en écologie
+
BIOS² a organisé une session de formation en ligne sur l’analyse statistique des données spatiales en écologie, animée par le Pr. Philippe Marchand (UQAT). Cette formation de 12 heures s’est déroulée en 4 sessions : 12, 14, 19 & 21 janvier (2021) de 13h00 à 16h00 HNE.
+
Le contenu comprenait trois types d’analyses statistiques spatiales et leurs applications en écologie : (1) l’analyse des patrons de points qui permet d’étudier la distribution d’individus ou d’événements dans l’espace; (2) les modèles géostatistiques qui représentent la corrélation spatiale de variables échantillonnées à des points géoréférencés; et (3) les modèles de données aréales, qui s’appliquent aux mesures prises sur des régions de l’espace et qui représentent les liens spatiaux par le biais de réseaux de voisinage. La formation comprenait également des exercices pratiques utilisant l’environnement de programmation statistique R.
+
Philippe Marchand est professeur d’écologie et de biostatistique à l’Institut de recherche sur les forêts, Université du Québec en Abitibi-Témiscamingue (UQAT) et membre académique de BIOS². Ses travaux de recherche portent sur la modélisation de processus qui influencent la distribution spatiale des populations, incluant: la dispersion des graines et l’établissement des semis, le mouvement des animaux, et la propagation des épidémies forestières.
+
Si vous souhaitez consulter le matériel pédagogique et suivre les exercices à votre propre rythme, vous pouvez y accéder par ce lien. Une connaissance de base des modèles de régression linéaire et une expérience de l’ajustement de ces modèles dans R sont recommandées. Le repositoire original se trouve ici.
Dans le cadre de cette formation, nous discuterons de trois types d’analyses spatiales: l’analyse des patrons de points, les modèles géostatistiques et les modèles de données aréales.
+
Dans l’analyse des patrons de points, nous avons des données ponctuelles représentant la position d’individus ou d’événements dans une région d’étude et nous supposons que tous les individus ou événements ont été recensés dans cette région. Cette analyse s’intéresse à la distribution des positions des points eux-mêmes. Voici quelques questions typiques de l’analyse des patrons de points:
+
+
Les points sont-ils disposés aléatoirement ou agglomérés?
+
Deux types de points sont-ils disposés indépendamment?
+
+
Les modèles géostatistiques visent à représenter la distribution spatiale de variables continues qui sont mesurés à certains points d’échantillonnage. Ils supposent que les mesures de ces variables à différents points sont corrélées en fonction de la distance entre ces points. Parmi les applications des modèles géostatistiques, notons le lissage des données spatiales (ex.: produire une carte d’une variable sur l’ensemble d’une région en fonction des mesures ponctuelles) et la prédiction de ces variables pour des points non-échantillonnés.
+
Les données aréales sont des mesures prises non pas à des points, mais pour des régions de l’espace représentées par des polygones (ex.: divisions du territoire, cellules d’une grille). Les modèles représentant ces types de données définissent un réseau de voisinage reliant les régions et incluent une corrélation spatiale entre régions voisines.
+
+
+
Stationnarité et isotropie
+
Plusieurs analyses spatiales supposent que les variables sont stationnaires dans l’espace. Comme pour la stationnarité dans le domaine temporel, cette propriété signifie que les statistiques sommaires (moyenne, variance et corrélations entre mesures d’une variable) ne varient pas avec une translation dans l’espace. Par exemple, la corrélation spatiale entre deux points peut dépendre de la distance les séparant, mais pas de leur position absolue.
+
En particulier, il ne peut pas y avoir de tendance à grande échelle (souvent appelée gradient dans un contexte spatial), ou bien cette tendance doit être prise en compte afin de modéliser la corrélation spatiale des résidus.
+
Dans le cas de l’analyse des patrons de points, la stationnarité (aussi appelée homogénéité dans ce contexte) signifie que la densité des points ne suit pas de tendance à grande échelle.
+
Dans un modèle statistique isotropique, les corrélations spatiales entre les mesures à deux points dépendent seulement de la distance entre ces points, pas de la direction. Dans ce cas, les statistiques sommaires ne varient pas si on effectue une rotation dans l’espace.
+
+
+
Données géoréférencées
+
Les études environnementales utilisent de plus en plus de données provenant de sources de données géospatiales, c’est-à-dire des variables mesurées sur une grande partie du globe (ex.: climat, télédétection). Le traitement de ces données requiert des concepts liés aux systèmes d’information géographique (SIG), qui ne sont pas couverts dans cet atelier, alors que nous nous concentrons sur les aspects statistiques de données variant dans l’espace.
+
L’utilisation de données géospatiales ne signifie pas nécessairement qu’il faut avoir recours à des statistiques spatiales. Par exemple, il est courant d’extraire les valeurs de ces variables géographiques à des points d’étude pour expliquer une réponse biologique observée sur le terrain. Dans ce cas, l’utilisation de statistiques spatiales est seulement nécessaire en présence d’une corrélation spatiale dans les résidus, après avoir tenu compte de l’effet des prédicteurs.
+
+
+
+
18 Analyse des patrons de points
+
+
Patron de points et processus ponctuel
+
Un patron de points (point pattern) décrit la position spatiale (le plus souvent en 2D) d’individus ou d’événements, représentés par des points, dans une aire d’étude donnée, souvent appelée la fenêtre d’observation.
+
On suppose que chaque point a une étendue spatiale négligeable par rapport aux distances entre les points. Des méthodes plus complexes existent pour traiter des patrons spatiaux d’objets qui ont une largeur non-néligeable, mais ce sujet dépasse la portée de cet atelier.
+
Un processus ponctuel (point process) est un modèle statistique qui peut être utilisé pour simuler des patrons de points ou expliquer un patron de points observé.
+
+
+
Structure spatiale totalement aléatoire
+
Une structure spatiale totalement aléatoire (complete spatial randomness) est un des patrons les plus simples, qui sert de modèle nul pour évaluer les caractéristiques de patrons de points réels. Dans ce patron, la présence d’un point à une position donnée est indépendante de la présence de points dans un voisinage.
+
Le processus créant ce patron est un processus de Poisson homogène. Selon ce modèle, le nombre de points dans toute région de superficie \(A\) suit une distribution de Poisson: \(N(A) \sim \text{Pois}(\lambda A)\), où \(\lambda\) est l’intensité du processus (i.e. la densité de points). \(N\) est indépendant entre deux régions disjointes, peu importe comment ces régions sont définies.
+
Dans le graphique ci-dessous, seul le patron à droite est totalement aléatoire. Le patron à gauche montre une agrégation des points (probabilité plus grande d’observer un point si on est à proximité d’un autre point), tandis que le patron du centre montre une répulsion (faible probabilité d’observer un point très près d’un autre).
+
+
+
+
+
+
+
+
Analyse exploratoire ou inférentielle pour un patron de points
+
Plusieurs statistiques sommaires sont utilisées pour décrire les caractéristiques un patron de points. La plus simple est l’intensité\(\lambda\), qui comme mentionné plus haut représente la densité de points par unité de surface. Si le patron de points est hétérogène, l’intensité n’est pas constante, mais dépend de la position: \(\lambda(x, y)\).
+
Par rapport à l’intensité qui est une statistique dite de premier ordre, les statistiques de second ordre décrivent comment la probabilité de présence d’un point dans une région dépend de la présence d’autres points. L’indice \(K\) de Ripley présenté dans la prochaine section est un exemple de statistique sommaire de second ordre.
+
Les inférences statistiques réalisées sur des patrons de points consistent habituellement à tester l’hypothèse que le patron de points correspond à un modèle nul donné, par exemple une structure spatiale totalement aléatoire, ou un modèle nul plus complexe. Même pour les modèles nuls les plus simples, nous connaissons rarement la distribution théorique pour une statistique sommaire du patron de points sous le modèle nul. Les tests d’hypothèses sur les patrons de points sont donc réalisés par simulation: on simule un grand nombre de patrons de points à partir du modèle nul et on compare la distribution des statistiques sommaires qui nous intéressent pour ces simulations à la valeur des statistiques pour le patron de points observé.
+
+
+
Indice \(K\) de Ripley
+
L’indice de Ripley \(K(r)\) est défini comme le nombre moyen de points se trouvant dans un cercle de rayon \(r\) donné autour d’un point du patron, normalisé par l’intensité \(\lambda\).
+
Pour un patron totalement aléatoire, le nombre moyen de points dans un cercle de rayon \(r\) est \(\lambda \pi r^2\), donc en théorie \(K(r) = \pi r^2\) pour ce modèle nul. Une valeur de \(K(r)\) supérieure signifie qu’il y a agrégation des points à l’échelle \(r\), tandis qu’une valeur inférieure signifie qu’il y a une répulsion.
+
En pratique, \(K(r)\) est estimé pour un patron de points donné par l’équation:
+
\[ K(r) = \frac{A}{n(n-1)} \sum_i \sum_{j > i} I \left( d_{ij} \le r \right) w_{ij}\]
+
où \(A\) est l’aire de la fenêtre d’observation et \(n\) est le nombre de points du patron, donc \(n(n-1)\) est le nombre de paires de points distinctes. On fait la somme pour toutes les paires de points de la fonction indicatrice \(I\), qui prend une valeur de 1 si la distance entre les points \(i\) et \(j\) est inférieure ou égale à \(r\). Finalement, le terme \(w_{ij}\) permet de donner un poids supplémentaire à certaines paires de points pour tenir compte des effets de bordure, tel que discuté dans la section suivante.
+
Par exemple, les graphiques ci-dessous présentent la fonction estimée \(K(r)\) pour les patrons illustrés ci-dessus, pour des valeurs de \(r\) allant jusqu’à 1/4 de la largeur de la fenêtre. La courbe pointillée rouge indique la valeur théorique pour une structure spatiale totalement aléatoire et la zone grise est une “enveloppe” produite par 99 simulations de ce modèle nul. Le patron agrégé montre un excès de voisins jusqu’à \(r = 0.25\) et le patron avec répulsion montre un déficit significatif de voisins pour les petites valeurs de \(r\).
+
+
+
+
+
+
Outre le \(K\), il existe d’autres statistiques pour décrire les propriétés de second ordre du patron, par exemple la distance moyenne entre un point et ses \(N\) plus proches voisins. Vous pouvez consulter le manuel de Wiegand et Moloney (2013) suggéré en référence pour en apprendre plus sur différentes statistiques sommaires des patrons de points.
+
+
+
Effets de bordure
+
Dans le contexte de l’analyse de patrons de points, l’effet de bordure (“edge effect”) est dû au fait que nous avons une connaissance incomplète du voisinage des points près du bord de la fenêtre d’observation, ce qui peut induire un biais dans le calcul des statistiques comme le \(K\) de Ripley.
+
Différentes méthodes ont été développées pour corriger le biais dû aux effets de bordure. Selon la méthode de Ripley, la contribution d’un voisin \(j\) situé à une distance \(r\) d’un point \(i\) reçoit un poids \(w_{ij} = 1/\phi_i(r)\), où \(\phi_i(r)\) est la fraction du cercle de rayon \(r\) autour de \(i\) contenu dans la fenêtre d’observation. Par exemple, si 2/3 du cercle se trouve dans la fenêtre, ce voisin compte pour 3/2 voisins dans le calcul d’une statistique comme \(K\).
+
+
La méthode de Ripley est une des plus simples pour corriger les effets de bordure, mais n’est pas nécessairement la plus efficace; notamment, les poids plus grands donnés à certaines paires de points tend à accroître la variance du calcul de la statistique. D’autres méthodes de correction sont présentées dans les manuels spécialisés, comme celui de Wiegand et Moloney (2013) en référence.
+
+
+
Exemple
+
Pour cet exemple, nous utilisons le jeu de données semis_xy.csv, qui représente les coordonnées \((x, y)\) de semis de deux espèces (sp, B = bouleau et P = peuplier) dans une placette de 15 x 15 m.
x y sp
+1 14.73 0.05 P
+2 14.72 1.71 P
+3 14.31 2.06 P
+4 14.16 2.64 P
+5 14.12 4.15 B
+6 9.88 4.08 B
+
+
+
Le package spatstat permet d’effectuer des analyses de patrons de point dans R. La première étape consiste à transformer notre tableau de données en objet ppp (patron de points) avec la fonction du même nom. Dans cette fonction, nous spécifions quelles colonnes contiennent les coordonnées x et y ainsi que les marques (marks), qui seront ici les codes d’espèce. Il faut aussi spécifier une fenêtre d’observation (window) à l’aide de la fonction owin, à laquelle nous indiquons les limites de la placette en x et y.
+
+
library(spatstat)
+
+semis <-ppp(x = semis$x, y = semis$y, marks =as.factor(semis$sp),
+window =owin(xrange =c(0, 15), yrange =c(0, 15)))
+semis
+
+
Marked planar point pattern: 281 points
+Multitype, with levels = B, P
+window: rectangle = [0, 15] x [0, 15] units
+
+
+
Les marques peuvent être numériques ou catégorielles. Notez que pour des marques catégorielles comme c’est le cas ici, il faut convertir explicitement la variable en facteur.
+
La fonction plot appliquée à un patron de points montre un diagramme du patron.
+
+
plot(semis)
+
+
+
+
+
La fonction intensity calcule la densité des points de chaque espèce par unité de surface, ici en \(m^2\).
+
+
intensity(semis)
+
+
B P
+0.6666667 0.5822222
+
+
+
Pour analyser d’abord séparément la distribution de chaque espèce, nous séparons le patron avec split. Puisque le patron contient des marques catégorielles, la séparation se fait automatiquement en fonction de la valeur des marques. Le résultat est une liste de deux patrons de points.
+
+
semis_split <-split(semis)
+plot(semis_split)
+
+
+
+
+
La fonction Kest calcule le \(K\) de Ripley pour une série de distances allant (par défaut) jusqu’à 1/4 de la largeur de la fenêtre. Ici, nous l’appliquons au premier patron (bouleau) en choisissant semis_split[[1]]. Notez que les doubles crochets sont nécessaires pour choisir un élément d’une liste dans R.
+
L’argument correction = "iso" indique d’appliquer la méthode de Ripley pour corriger les effets de bordure.
+
+
k <-Kest(semis_split[[1]], correction ="iso")
+plot(k)
+
+
+
+
+
Selon ce graphique, il semble y avoir une excès de voisins à partir d’un rayon de 1 m. Pour vérifier s’il s’agit d’un écart significatif, nous produisons une enveloppe de simulation avec la fonction envelope. Le permier argument d’envelope est un patron de point auquel les simulations seront comparées, le deuxième une fonction à calculer (ici, Kest) pour chaque patron simulé, puis on y ajoute les arguments de la fonction Kest (ici, seulement correction).
Tel qu’indiqué par le message, cette fonction effectue par défaut 99 simulations de l’hypothèse nulle correspondant à une structure spatiale totalement aléatoire (CSR, pour complete spatial randomness).
+
La courbe observée sort de l’enveloppe des 99 simulations près de \(r = 2\). Il faut être prudent de ne pas interpréter trop rapidement un résultat sortant de l’enveloppe. Même s’il y a environ une probabilité de 1% d’obtenir un résultat plus extrême selon l’hypothèse nulle à une distance donnée, l’enveloppe est calculée pour un grand nombre de valeurs de la distance et nous n’effectuons pas de correction pour les comparaisons multiples. Ainsi, un écart significatif pour une très petite plage de valeurs de \(r\) peut être simplement dû au hasard.
+
+
Exercice 1
+
En regardant le graphique du deuxième patron de points (semis de peuplier), pouvez-vous prédire où se situera le \(K\) de Ripley par rapport à l’hypothèse nulle d’une structure spatiale totalement aléatoire? Vérifiez votre prédiction en calculant le \(K\) de Ripley pour ce patron de points dans R.
+
+
+
+
Effet de l’hétérogénéité
+
Le graphique ci-dessous illustre un patron de points hétérogène, c’est-à-dire qu’il présente un gradient d’intensité (plus de points à gauche qu’à droite).
+
+
+
+
+
+
Un gradient de densité peut être confondu avec une agrégation des points, comme on peut voir sur le graphique du \(K\) de Ripley correspondant. En théorie, il s’agit de deux processus différents:
+
+
Hétérogénéité: La densité de points varie dans la région d’étude, par exemple dû au fait que certaines conditions locales sont plus propices à la présence de l’espèce étudiée.
+
Agrégation: La densité moyenne des points est homogène, mais la présence d’un point augmente la présence d’autre points dans son voisinage, par exemple en raison d’interactions positives entre les individus.
+
+
Cependant, il peut être difficile de différencier les deux en pratique, surtout que certains patrons peuvent être à la fois hétérogènes et agrégés.
+
Prenons l’exemple des semis de peuplier de l’exercice précédent. La fonction density appliquée à un patron de points effectue une estimation par noyau (kernel density estimation) de la densité des semis à travers la placette. Par défaut, cette fonction utilise un noyau gaussien avec un écart-type sigma spécifié dans la fonction, qui détermine l’échelle à laquelle les fluctuations de densité sont “lissées”. Ici, nous utilisons une valeur de 2 m pour sigma et nous représentons d’abord la densité estimée avec plot, avant d’y superposer les points (add = TRUE signifie que les points sont ajoutés au graphique existant plutôt que de créer un nouveau graphique).
Pour mesurer l’agrégation ou la répulsion des points d’un patron hétérogène, nous devons utilisé la version non-homogène de la statistique \(K\) (Kinhom dans spatstat). Cette statistique est toujours égale au nombre moyen de voisins dans un rayon \(r\) d’un point du patron, mais plutôt que de normaliser ce nombre par l’intensité globale du patron, il est normalisé par l’estimation locale de la densité de points. Comme ci-dessus, nous spécifions sigma = 2 pour contrôler le niveau de lissage de l’estimation de la densité variable.
En tenant compte de l’hétérogénéité du patron à une échelle sigma de 2 m, il semble donc y avoir un déficit de voisins à partir d’environ 1.5 m des points du patron. Il reste à voir si cette déviation est significative.
+
Comme précédemment, nous utilisons envelope pour simuler la statistique Kinhom sous le modèle nul. Cependant, ici le modèle nul n’est pas un processus de Poisson homogène (structure spatiale totalement aléatoire). Il s’agit plutôt d’un processus de Poisson hétérogène simulé par la fonction rpoispp(dens_p), c’est-à-dire que les points sont indépendants les uns des autres, mais leur densité est hétérogène et donnée par dens_p. L’argument simulate de la fonction envelope permet de spécifier une fonction utilisée pour les simulations sous le modèle nul; cette fonction doit avoir un argument, ici x, même s’il n’est pas utilisé.
+
Finalement, en plus des arguments nécessaires pour Kinhom, soit sigma et correction, nous spécifions aussi nsim = 199 pour réaliser 199 simulations et nrank = 5 pour éliminer les 5 résultats les plus extrêmes de chaque côté de l’enveloppe, donc les 10 plus extrêmes sur 199, pour réaliser un intervalle contenant environ 95% de la probabilité sous l’hypothèse nulle.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
plot(khet_p)
+
+
+
+
+
Note: Pour un test d’hypothèse basé sur des simulations d’une hypothèse nulle, la valeur \(p\) est estimée par \((m + 1)/(n + 1)\), où \(n\) est le nombre de simulations et \(m\) est le nombre de simulations où la valeur de la statistique est plus extrême que celle des données observées. C’est pour cette raison qu’on choisit un nombre de simulations comme 99, 199, etc.
+
+
Exercice 2
+
Répétez l’estimation de la densité hétérogène et le calcul de Kinhom avec un écart-type sigma de 5 plutôt que 2. Comment le niveau de lissage pour la densité influence-t-il les conclusions?
+
Pour différencier une variation de densité des points et d’une interaction (agrégation ou répulsion) entre ces points avec ce type d’analyse, il faut généralement supposer que les deux processus opèrent à différentes échelles. Typiquement, nous pouvons tester si les points sont agrégés à petite échelle après avoir tenu compte d’une variation de la densité à une échelle plus grande.
+
+
+
+
Relation entre deux patrons de points
+
Considérons un cas où nous avons deux patrons de points, par exemple la position des arbres de deux espèces dans une parcelle (points oranges et verts dans le graphique ci-dessous). Chacun des deux patrons peut présenter ou non des agrégations de points.
+
+
+
+
+
+
Sans égard à cette agrégation au niveau de l’espèce, nous voulons déterminer si les deux espèces sont disposées indépendamment. Autrement dit, la probabilité d’observer un arbre d’une espèce dépend-elle de la présence d’un arbre de l’autre espèce à une distance donnée?
+
La version bivariée du \(K\) de Ripley permet de répondre à cette question. Pour deux patrons désignés 1 et 2, l’indice \(K_{12}(r)\) calcule le nombre moyen de points du patron 2 dans un rayon \(r\) autour d’un point du patron 1, normalisé par la densité du patron 2.
+
En théorie, cet indice est symétrique, donc \(K_{12}(r) = K_{21}(r)\) et le résultat serait le même si on choisit les points du patron 1 ou 2 comme points “focaux” pour l’analyse. Cependant, l’estimation des deux quantités pour un patron observé peut différer, notamment en raison des effets de bord. La variabilité peut aussi être différente pour \(K_{12}\) et \(K_{21}\) entre les simulations d’un modèle nul, donc le test de l’hypothèse nulle peut avoir une puissance différente selon le choix de l’espèce focale.
+
Le choix d’un modèle nul approprié est important ici. Afin de déterminer s’il existe une attraction ou une répulsion significative entre les deux patrons, il faut déplacer aléatoirement la position d’un des patrons relative à celle de l’autre patron, tout en conservant la structure spatiale de chaque patron pris isolément.
+
Une des façons d’effectuer cette randomisation consiste à décaler l’un des deux patrons horizontalement et/ou verticalement d’une distance aléatoire. La partie du patron qui “sort” d’un côté de la fenêtre est rattachée de l’autre côté. Cette méthode s’appelle une translation toroïdale (toroidal shift), car en connectant le haut et le bas ainsi que la gauche et la droite d’une surface rectangulaire, on obtient la forme d’un tore (un “beigne” en trois dimensions).
+
+
+
+
+
+
Le graphique ci-dessus illustre une translation du patron vert vers la droite, tandis que le patron orange reste au même endroit. Les points verts dans la zone ombragée sont ramenés de l’autre côté. Notez que si cette méthode préserve de façon générale la structure de chaque patron tout en randomisant leur position relative, elle peut comporter certains inconvénients, comme de diviser des amas de points qui se trouvent près du point de coupure.
+
Vérifions maintenant s’il y a une dépendance entre la position des deux espèces (bouleau et peuplier) dans notre placette. La fonction Kcross calcule l’indice bivarié \(K_{ij}\), il faut spécifier quel type de point est considéré comme l’espèce focale \(i\) et l’espèce voisine \(j\).
+
+
plot(Kcross(semis, i ="P", j ="B", correction ="iso"))
+
+
+
+
+
Ici, le \(K\) observé est inférieur à la valeur théorique, indiquant une répulsion possible des deux patrons.
+
Pour déterminer l’enveloppe du \(K\) selon l’hypothèse nulle d’indépendance des deux patrons, nous devons spécifier que les simulations doivent être basées sur une translation des patrons. Nous indiquons que les simulations doivent utiliser la fonction rshift (translation aléatoire) avec l’argument simulate = function(x) rshift(x, which = "B"); ici, l’argument x de simulate correspond au patron de points original et l’argument which indique quel type de points subit la translation. Comme pour le cas précédent, il faut répéter dans la fonction envelope les arguments nécessaires pour Kcross, soit i, j et correction.
+
+
plot(envelope(semis, Kcross, i ="P", j ="B", correction ="iso",
+nsim =199, nrank =5, simulate =function(x) rshift(x, which ="B")))
+
+
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
Ici, la courbe observée se situe totalement dans l’enveloppe, donc nous ne rejetons pas l’hypothèse nulle d’indépendance des deux patrons.
+
+
Questions
+
+
Quelle raison pourrait justifier ici notre choix d’effectuer la translation des points du bouleau plutôt que du peuplier?
+
Est-ce que les simulations générées par translation aléatoire constitueraient un bon modèle nul si les deux patrons étaient hétérogènes?
+
+
+
+
+
Patrons de points marqués
+
Le jeu de données fir.csv contient les coordonnées \((x, y)\) de 822 sapins dans une placette d’un hectare et leur statut (A = vivant, D = mort) suivant une épidémie de tordeuse des bourgeons de l’épinette.
+
+
fir <-read.csv("data/fir.csv")
+head(fir)
+
+
x y status
+1 31.50 1.00 A
+2 85.25 30.75 D
+3 83.50 38.50 A
+4 84.00 37.75 A
+5 83.00 33.25 A
+6 33.25 0.25 A
+
+
+
+
fir <-ppp(x = fir$x, y = fir$y, marks =as.factor(fir$status),
+window =owin(xrange =c(0, 100), yrange =c(0, 100)))
+plot(fir)
+
+
+
+
+
Supposons que nous voulons vérifier si la mortalité des sapins est indépendante ou corrélée entre arbres rapprochés. En quoi cette question diffère-t-elle de l’exemple précédent où nous voulions savoir si la position des points de deux espèces était indépendante?
+
Dans l’exemple précédent, l’indépendance ou l’interaction entre les espèces référait à la formation du patron lui-même (que des semis d’une espèce s’établissent ou non à proximité de ceux de l’autre espèce). Ici, la caractéristique qui nous intéresse (survie des sapins) est postérieure à l’établissement du patron, en supposant que tous ces arbres étaient vivants d’abord et que certains sont morts suite à l’épidémie. Donc nous prenons la position des arbres comme fixe et nous voulons savoir si la distribution des statuts (mort, vivant) entre ces arbres est aléatoire ou présente un patron spatial.
+
Dans le manuel de Wiegand et Moloney, la première situation (établissement de semis de deux espèces) est appelé patron bivarié, donc il s’agit vraiment de deux patrons qui interagissent, tandis que la deuxième est un seul patron avec une marque qualitative. Le package spatstat dans R ne fait pas de différences entre les deux au niveau de la définition du patron (les types de points sont toujours représentés par l’argument marks), mais les méthodes d’analyse appliquées aux deux questions diffèrent.
+
Dans le cas d’un patron avec une marque qualitative, nous pouvons définir une fonction de connexion de marques (mark connection function) \(p_{ij}(r)\). Pour deux points séparés par une distance \(r\), cette fonction donne la probabilité que le premier point porte la marque \(i\) et le deuxième la marque \(j\). Selon l’hypothèse nulle où les marques sont indépendantes, cette probabilité est égale au produit des proportions de chaque marque dans le patron entier, \(p_{ij}(r) = p_i p_j\) indépendamment de \(r\).
+
Dans spatstat, la fonction de connexion de marques est calculée avec la fonction markconnect, où il faut spécifier les marques \(i\) et \(j\) ainsi que le type de correction des effets de bord. Dans notre exemple, nous voyons que deux points rapprochés ont moins de chance d’avoir une statut différent (A et D) que prévu selon l’hypothèse de distribution aléatoire et indépendante des marques (ligne rouge pointillée).
+
+
plot(markconnect(fir, i ="A", j ="D", correction ="iso"))
+
+
+
+
+
Dans ce graphique, les ondulations dans la fonction sont dues à l’erreur d’estimation d’une fonction continue de \(r\) à partir d’un nombre limité de paires de points discrètes.
+
Pour simuler le modèle nul dans ce cas-ci, nous utilisons la fonction rlabel qui réassigne aléatoirement les marques parmi les points du patron, en maintenant la position des points.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
Notez que puisque la fonction rlabel a un seul argument obligatoire correspondant au patron de points original, il n’était pas nécessaire de spécifier au long: simulate = function(x) rlabel(x).
+
Voici les résultats pour les paires d’arbres du même statut A ou D:
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
+
+
+
+
Il semble donc que la mortalité des sapins due à cette épidémie est agrégée spatialement, puisque les arbres situés à proximité l’un de l’autre ont une plus grande probabilité de partager le même statut que prévu par l’hypothèse nulle.
+
+
+
Références
+
Fortin, M.-J. et Dale, M.R.T. (2005) Spatial Analysis: A Guide for Ecologists. Cambridge University Press: Cambridge, UK.
+
Wiegand, T. et Moloney, K.A. (2013) Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.
+
Le jeu de données du dernier exemple est tiré des données de la Forêt d’enseignement et de recherche du Lac Duparquet (FERLD), disponibles sur Dryad en suivant ce lien.
Generating 199 simulations by evaluating function ...
+1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
+.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
+.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
+.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
+.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.
+
+Done.
+
+
plot(khet_p)
+
+
+
+
+
Ici, puisque nous estimons la variation de densité à une plus grande échelle, même après avoir tenu compte de cette variation, les semis de peuplier semblent agrégés à petite échelle.
+
+
+
+
20 Corrélation spatiale d’une variable
+
La corrélation entre les mesures d’une variable prises à des points rapprochés est une caractéristique dans de nombreux jeux de données. Ce principe est parfois appelé “première loi de la géographie” et exprimé par la citation de Waldo Tobler: “Everything is related to everything else, but near things are more related than distant things.” (Tout est relié, mais les choses rapprochées le sont davantage que celles éloignées).
+
En statistique, nous parlons souvent d’autocorrélation pour désigner la corrélation qui existe entre les mesures d’une même variable prises à différents moments (autocorrélation temporelle) ou différents lieux (autocorrélation spatiale).
+
+
Dépendance intrinsèque ou induite
+
Il existe deux types fondamentaux de dépendance spatiale sur une variable mesurée \(y\): une dépendance intrinsèque à \(y\), ou une dépendance induite par des variables externes influençant \(y\), qui sont elles-mêmes corrélées dans l’espace.
+
Par exemple, supposons que l’abondance d’une espèce soit corrélée entre deux sites rapprochés:
+
+
cette dépendance spatiale peut être induite si elle est due à une corrélation spatiale des facteurs d’habitat qui favorisent ou défavorisent l’espèce;
+
ou elle peut être intrinsèque si elle est due à la dispersion d’individus entre sites rapprochés.
+
+
Dans plusieurs cas, les deux types de dépendance affectent une variable donnée.
+
Si la dépendance est simplement induite et que les variables externes qui en sont la cause sont incluses dans le modèle expliquant \(y\), alors les résidus du modèle seront indépendants et nous pouvons utiliser toutes les méthodes déjà vues qui ignorent la dépendance spatiale.
+
Cependant, si la dépendance est intrinsèque ou due à des influences externes non-mesurées, alors il faudra tenir compte de la dépendance spatiale des résidus dans le modèle.
+
+
+
Différentes façons de modéliser les effets spatiaux
+
Dans cette formation, nous modéliserons directement les corrélations spatiales de nos données. Il est utile de comparer cette approche à d’autres façons d’inclure des aspects spatiaux dans un modèle statistique.
+
D’abord, nous pourrions inclure des prédicteurs dans le modèle qui représentent la position (ex.: longitude, latitude). De tels prédicteurs peuvent être utiles pour détecter une tendance ou un gradient systématique à grande échelle, que cette tendance soit linéaire ou non (par exemple, avec un modèle additif généralisé).
+
En contraste à cette approche, les modèles que nous verrons maintenant servent à modéliser une corrélation spatiale dans les fluctuations aléatoires d’une variable (i.e., dans les résidus après avoir enlevé tout effet systématique).
+
Les modèles mixtes utilisent des effets aléatoires pour représenter la non-indépendance de données sur la base de leur groupement, c’est-à-dire qu’après avoir tenu compte des effets fixes systématiques, les données d’un même groupe sont plus semblables (leur variation résiduelle est corrélée) par rapport aux données de groupes différents. Ces groupes étaient parfois définis selon des critères spatiaux (observations regroupées en sites).
+
Cependant, dans un contexte d’effet aléatoire de groupe, tous les groupes sont aussi différents les uns des autres, ex.: deux sites à 100 km l’un de l’autre ne sont pas plus ou moins semblables que deux sites distants de 2 km.
+
Les méthodes que nous verrons ici et dans les prochains parties de la formation nous permettent donc ce modéliser la non-indépendance sur une échelle continue (plus proche = plus corrélé) plutôt que seulement discrète (hiérarchie de groupements).
+
+
+
+
21 Modèles géostatistiques
+
La géostatistique désigne un groupe de techniques tirant leur origine en sciences de la Terre. Elle s’intéresse à des variables distribuées de façon continue dans l’espace, dont on cherche à estimer la distribution en échantillonnant un nombre de points. Un exemple classique de ces techniques provient du domaine minier, où l’on cherchait à créer une carte de la concentration du minerai sur un site à partir d’échantillons pris à différents points du site.
+
Pour ces modèles, nous supposerons que \(z(x, y)\) est une variable spatiale stationnaire mesurée selon les coordonnées \(x\) et \(y\).
+
+
Variogramme
+
Un aspect central de la géostatistique est l’estimation du variogramme \(\gamma_z\) de la variable \(z\). Le variogramme est égal à la moitié de l’écart carré moyen entre les valeurs de \(z\) pour deux points \((x_i, y_i)\) et \((x_j, y_j)\) séparés par une distance \(h\).
Dans cette équation, la fonction \(\text{E}\) avec l’indice \(d_{ij}=h\) désigne l’espérance statistique (autrement dit, la moyenne) de l’écart au carré entre les valeurs de \(z\) pour les points séparés par une distance \(h\).
+
Si on préfère exprimer l’autocorrélation \(\rho_z(h)\) entre mesures de \(z\) séparées par une distance \(h\), celle-ci est reliée au variogramme par l’équation:
+
\[\gamma_z = \sigma_z^2(1 - \rho_z)\] ,
+
où \(\sigma_z^2\) est la variance globale de \(z\).
+
Notez que \(\gamma_z = \sigma_z^2\) si nous sommes à une distance où les mesures de \(z\) sont indépendantes, donc \(\rho_z = 0\). Dans ce cas, on voit bien que \(\gamma_z\) s’apparente à une variance, même s’il est parfois appelé “semivariogramme” ou “semivariance” en raison du facteur 1/2 dans l’équation ci-dessus.
+
+
+
Modèles théoriques du variogramme
+
Plusieurs modèles paramétriques ont été proposés pour représenter la corrélation spatiale en fonction de la distance entre points d’échantillonnage. Considérons d’abord une corrélation qui diminue de façon exponentielle:
+
\[\rho_z(h) = e^{-h/r}\]
+
Ici, \(\rho_z = 1\) pour \(h = 0\) et la corréaltion est multipliée par \(1/e \approx 0.37\) pour chaque augmentation de \(r\) de la distance. Dans ce contexte, \(r\) se nomme la portée (range) de la corrélation.
+
À partir de l’équation ci-dessus, nous pouvons calculer le variogramme correspondant.
+
\[\gamma_z(h) = \sigma_z^2 (1 - e^{-h/r})\]
+
Voici une représentation graphique de ce variogramme.
+
+
+
+
+
+
En raison de la fonction exponentielle, la valeur de \(\gamma\) à des grandes distances s’approche de la variance globale \(\sigma_z^2\) sans exactement l’atteindre. Cette asymptote est appelée palier (sill) dans le contexte géostatistique et représentée par le symbole \(s\).
+
Finalement, il n’est parfois pas réaliste de supposer une corrélation parfaite lorsque la distance tend vers 0, en raison d’une variation possible de \(z\) à très petite échelle. On peut ajouter au modèle un effet de pépite (nugget), noté \(n\), pour que \(\gamma\) s’approche de \(n\) (plutôt que 0) si \(h\) tend vers 0. Le terme pépite provient de l’origine minière de ces techniques, où une pépite d’un minerai pourrait être la source d’une variation abrupte de la concentration à petite échelle.
+
En ajoutant l’effet de pépite, le reste du variogramme est “compressé” pour conserver le même palier, ce qui résulte en l’équation suivante.
+
\[\gamma_z(h) = n + (s - n) (1 - e^{-h/r})\]
+
Dans le package gstat que nous utiliserons ci-dessous, le terme \((s - n)\) est le palier partiel (partial sill, ou psill) pour la partie exponentielle.
+
+
+
+
+
+
En plus du modèle exponentiel, deux autres modèles théoriques courants pour le variogramme sont le modèle gaussien (où la corrélation suit une courbe demi-normale), ainsi que le modèle sphérique (où le variogramme augmente de façon linéaire au départ pour ensuite courber et atteindre le palier à une distance égale à sa portée \(r\)). Le modèle sphérique permet donc à la corrélation d’être exactement 0 à grande distance, plutôt que de s’approcher graduellement de zéro dans le cas des autres modèles.
* Pour le modèle sphérique, \(\rho = 0\) et \(\gamma = s\) si \(h \ge r\).
+
+
+
+
+
+
+
+
Variogramme empirique
+
Pour estimer \(\gamma_z(h)\) à partir de données empiriques, nous devons définir des classes de distance, donc grouper différentes distances dans une marge \(\pm \delta\) autour d’une distance \(h\), puis calculer l’écart-carré moyen pour les paires de points dans cette classe de distance.
Ici, \(v\) désigne la variable réponse et \(u\) les prédicteurs, pour ne pas confondre avec les coordonnées spatiales \(x\) et \(y\).
+
En plus du résidu \(\epsilon\) qui est indépendant entre les observations, le modèle inclut un terme \(z\) qui représente la portion spatialement corrélée de la variance résiduelle.
+
Voici une suggestions d’étapes à suivre pour appliquer ce type de modèle:
+
+
Ajuster le modèle de régression sans corrélation spatiale.
+
Vérifier la présence de corrélation spatiale à partir du variogramme empirique des résidus.
+
Ajuster un ou plusieurs modèles de régression avec corrélation spatiale et choisir celui qui montre le meilleur ajustement aux données.
+
+
+
+
+
22 Modèles géostatistiques dans R
+
Le package gstat contient des fonctions liées à la géostatistique. Pour cet exemple, nous utiliserons le jeu de données oxford de ce package, qui contient des mesures de propriétés physiques et chimiques pour 126 échantillons du sol d’un site, ainsi que leurs coordonnées XCOORD et YCOORD.
Supposons que nous souhaitons modéliser la concentration de magnésium (MG1), représentée en fonction de la position spatiale dans le graphique suivant.
Notez que les axes \(x\) et \(y\) ont été inversés par souci d’espace. La fonction coord_fixed() de ggplot2 assure que l’échelle soit la même sur les deux axes, ce qui est utile pour représenter des données spatiales.
+
Nous voyons tout de suite que ces mesures ont été prises sur une grille de 100 m de côté. Il semble que la concentration de magnésium soit spatialement corrélée, bien qu’il puisse s’agir d’une corrélation induite par une autre variable. Nous savons notamment que la concentration de magnésium est reliée négativement au pH du sol (PH1).
+
+
ggplot(oxford, aes(x = PH1, y = MG1)) +
+geom_point()
+
+
+
+
+
La fonction variogram de gstat sert à estimer un variogramme à partir de données empiriques. Voici le résultat obtenu pour la variable MG1.
La formule MG1 ~ 1 indique qu’aucun prédicteur linéaire n’est inclus dans ce modèle, tandis que l’argument locations indique quelles variables du tableau correspondent aux coordonnées spatiales.
+
Dans le tableau obtenu, gamma est la valeur du variogramme pour la classe de distance centrée sur dist, tandis que np est le nombre de paires de points dans cette classe. Ici, puisque les points sont situés sur une grille, nous obtenons des classes de distance régulières (ex.: 100 m pour les points voisins sur la grille, 141 m pour les voisins en diagonale, etc.).
+
Nous nous limitons ici à l’estimation de variogrammes isotropiques, c’est-à-dire que le variogramme dépend seulement de la distance entre les deux points et non de la direction. Bien que nous n’ayons pas le temps de le voir aujourd’hui, il est possible avec gstat d’estimer séparément le variogramme dans différentes directions.
+
Nous pouvons illustrer le variogramme avec plot.
+
+
plot(var_mg, col ="black")
+
+
+
+
+
Si nous voulons estimer la corrélation spatiale résiduelle de MG1 après avoir inclus l’effet de PH1, nous pouvons ajouter ce prédicteur à la formule.
+
+
var_mg <-variogram(MG1 ~ PH1, locations =~ XCOORD + YCOORD, data = oxford)
+plot(var_mg, col ="black")
+
+
+
+
+
En incluant l’effet du pH, la portée de la corrélation spatiale semble diminuer, alors que le plateau est atteint autour de 300 m. Il semble même que le variogramme diminue au-delà de 400 m. En général, nous supposons que la variance entre deux points ne diminue pas avec la distance, à moins d’avoir un patron spatial périodique.
+
La fonction fit.variogram accepte comme arguments un variogramme estimé à partir des données, ainsi qu’un modèle théorique décrit dans une fonction vgm, puis estime les paramètres de ce modèle en fonction des données. L’ajustement se fait par la méthode des moindres carrés.
+
Par exemple, vgm("Exp") indique d’ajuster un modèle exponentiel.
+
+
vfit <-fit.variogram(var_mg, vgm("Exp"))
+vfit
+
+
model psill range
+1 Nug 0.000 0.00000
+2 Exp 1951.496 95.11235
+
+
+
Il n’y a aucun effet de pépite, car psill = 0 pour la partie Nug (nugget) du modèle. La partie exponentielle a un palier à 1951 et une portée de 95 m.
+
Pour comparer différents modèles, on peut donner un vecteur de noms de modèles à vgm. Dans l’exemple suivant, nous incluons les modèles exponentiel, gaussien (“Gau”) et sphérique (“Sph”).
model psill range
+1 Nug 0.000 0.00000
+2 Exp 1951.496 95.11235
+
+
+
La fonction nous donne le résultat du modèle le mieux ajusté (plus faible somme des écarts au carré), qui est ici le même modèle exponentiel.
+
Finalement, nous pouvons superposer le modèle théorique et le variogramme empirique sur un même graphique.
+
+
plot(var_mg, vfit, col ="black")
+
+
+
+
+
+
Régression avec corrélation spatiale
+
Nous avons vu ci-dessus que le package gstat permet d’estimer le variogramme des résidus d’un modèle linéaire. Dans notre exemple, la concentration de magnésium était modélisée en fonction du pH, avec des résidus spatialement corrélés.
+
Un autre outil pour ajuster ce même type de modèle est la fonction gls du package nlme, qui est inclus avec l’installation de R.
+
Cette fonction applique la méthode des moindres carrés généralisés (generalized least squares) pour ajuster des modèles de régression linéaire lorsque les résidus ne sont pas indépendants ou lorsque la variance résiduelle n’est pas la même pour toutes les observations. Comme les estimés des coefficients dépendent de l’estimé des corrélations entre les résidus et que ces derniers dépendent eux-mêmes des coefficients, le modèle est ajusté par un algorithme itératif:
+
+
On ajuste un modèle de régression linéaire classique (sans corrélation) pour obtenir des résidus.
+
On ajuste le modèle de corrélation spatiale (variogramme) avec ses résidus.
+
On ré-estime les coefficients de la régression en tenant compte maintenant des corrélations.
+
+
Les étapes 2 et 3 sont répétées jusqu’à ce que les estimés soient stables à une précision voulue.
+
Voici l’application de cette méthode au même modèle pour la concentration de magnésium dans le jeu de données oxford. Dans l’argument correlation de gls, nous spécifions un modèle de corrélation exponentielle en fonction de nos coordonnées spatiales et indiquons que nous voulons aussi estimer un effet de pépite.
+
En plus de la corrélation exponentielle corExp, la fonction gls peut aussi estimer un modèle gaussien (corGaus) ou sphérique (corSpher).
Generalized least squares fit by REML
+ Model: MG1 ~ PH1
+ Data: oxford
+ AIC BIC logLik
+ 1278.65 1292.751 -634.325
+
+Correlation Structure: Exponential spatial correlation
+ Formula: ~XCOORD + YCOORD
+ Parameter estimate(s):
+ range nugget
+478.0322964 0.2944753
+
+Coefficients:
+ Value Std.Error t-value p-value
+(Intercept) 391.1387 50.42343 7.757084 0
+PH1 -41.0836 6.15662 -6.673079 0
+
+ Correlation:
+ (Intr)
+PH1 -0.891
+
+Standardized residuals:
+ Min Q1 Med Q3 Max
+-2.1846957 -0.6684520 -0.3687813 0.4627580 3.1918604
+
+Residual standard error: 53.8233
+Degrees of freedom: 126 total; 124 residual
+
+
+
Pour comparer ce résultat au variogramme ajusté ci-dessus, il faut transformer les paramètres donnés par gls. La portée (range) a le même sens dans les deux cas et correspond à 478 m pour le résultat de gls. La variance globale des résidus est le carré de Residual standard error. L’effet de pépite ici (0.294) est exprimé comme fraction de cette variance. Finalement, pour obtenir le palier partiel de la partie exponentielle, il faut soustraire l’effet de pépite de la variance totale.
+
Après avoir réalisé ces calculs, nous pouvons donner ces paramètres à la fonction vgm de gstat pour superposer ce variogramme estimé par gls à notre variogramme des résidus du modèle linéaire classique.
Est-ce que le modèle est moins bien ajusté aux données ici? En fait, ce variogramme empirique représenté par les points avait été obtenu à partir des résidus du modèle linéaire ignorant la corrélation spatiale, donc c’est un estimé biaisé des corrélations spatiales réelles. La méthode est quand même adéquate pour vérifier rapidement s’il y a présence de corrélations spatiales. Toutefois, pour ajuster simultanément les coefficients de la régression et les paramètres de corrélation spatiale, l’approche des moindres carrés généralisés (GLS) est préférable et produira des estimés plus justes.
+
Finalement, notez que le résultat du modèle gls donne aussi l’AIC, que nous pouvons utiliser pour comparer l’ajustement de différents modèles (avec différents prédicteurs ou différentes formes de corrélation spatiale).
+
+
+
Exercice
+
Le fichier bryo_belg.csv est adapté des données de l’étude:
+
+
Neyens, T., Diggle, P.J., Faes, C., Beenaerts, N., Artois, T. et Giorgi, E. (2019) Mapping species richness using opportunistic samples: a case study on ground-floor bryophyte species richness in the Belgian province of Limburg. Scientific Reports 9, 19122. https://doi.org/10.1038/s41598-019-55593-x
+
+
Ce tableau de données indique la richesse spécifique des bryophytes au sol (richness) pour différents points d’échantillonnage de la province belge de Limbourg, avec leur position (x, y) en km, en plus de l’information sur la proportion de forêts (forest) et de milieux humides (wetland) dans une cellule de 1 km\(^2\) contenant le point d’échantillonnage.
Pour cet exercice, nous utiliserons la racine carrée de la richesse spécifique comme variable réponse. La transformation racine carrée permet souvent d’homogénéiser la variance des données de comptage afin d’y appliquer une régression linéaire.
+
+
Ajustez un modèle linéaire de la richesse spécifique transformée en fonction de la fraction de forêt et de milieux humides, sans tenir compte des corrélations spatiales. Quel est l’effet des deux prédicteurs selon ce modèle?
+
Calculez le variogramme empirique des résidus du modèle en (a). Semble-t-il y avoir une corrélation spatiale entre les points?
+
+
Note: L’argument cutoff de la fonction variogram spécifie la distance maximale à laquelle le variogramme est calculé. Vous pouvez ajuster manuellement cette valeur pour bien voir le palier.
+
+
Ré-ajustez le modèle linéaire en (a) avec la fonction gls du package nlme, en essayant différents types de corrélations spatiales (exponentielle, gaussienne, sphérique). Comparez les modèles (incluant celui sans corrélation spatiale) avec l’AIC.
+
Quel est l’effet de la fraction de forêts et de milieux humides selon le modèle en (c)? Expliquez les différences entre les conclusions de ce modèle et du modèle en (a).
+
+
+
+
+
23 Krigeage
+
Tel que mentionné précédemment, une application courante des modèles géostatistiques consiste à prédire la valeur de la variable de réponse à des points non-échantillonnés, une forme d’interpolation spatiale appelée krigeage (kriging).
+
Il existe trois principaux types de krigeage selon les suppositions faites au sujet de la variable réponse:
+
+
Krigeage ordinaire: variable stationnaire avec une moyenne inconnue.
+
Krigeage simple: Variable stationnaire avec une moyenne connue.
+
Krigeage universel: Variable dont la tendance est donnée par un modèle linéaire ou non linéaire.
+
+
Pour toutes les méthodes de krigeage, les prédictions à un nouveau point sont une moyenne pondérée des valeurs à des points connus. Ces pondérations sont choisies de manière à ce que le krigeage fournisse la meilleure prédiction linéaire non biaisée de la variable de réponse, si les hypothèses du modèle (en particulier le variogramme) sont correctes. C’est-à-dire que, parmi toutes les prédictions non biaisées possibles, les poids sont choisis de manière à donner l’erreur quadratique moyenne minimale. Le krigeage fournit également une estimation de l’incertitude de chaque prédiction.
+
Bien que nous ne présentions pas ici les équations détaillées du krigeage, les poids dépendent à la fois des corrélations (estimées par le variogramme) entre les points échantillonnés et le nouveau point, ainsi que des corrélations entre les points échantillonnés eux-mêmes. Autrement dit, les points échantillonnés proches du nouveau point ont plus de poids, mais les points échantillonnés isolés ont également plus de poids, car les points échantillonnés proches les uns des autres fournissent une informations redondante.
+
Le krigeage est une méthode d’interpolation, donc la prédiction à un point échantillonné sera toujours égale à la valeur mesurée (la variable est supposée être mesurée sans erreur, elle varie seulement entre les points). Cependant, en présence d’un effet de pépite, tout petit déplacement par rapport à l’endroit échantillonné présentera une variabilité en fonction de la pépite.
+
Dans l’exemple ci-dessous, nous générons un nouvel ensemble de données composé de coordonnées (x, y) générées de façon aléatoire dans la zone d’étude ainsi que des valeurs de pH générées de façon aléatoire sur la base des données oxford. Nous appliquons ensuite la fonction krige pour prédire les valeurs de magnésium à ces nouveaux points. Notez que nous spécifions le variogramme dérivé des résultats du gls dans l’argument model de krige.
Le résultat de krige comprend les nouvelles coordonnées du point, la prédiction de la variable var1.pred ainsi que sa variance estimée var1.var. Dans le graphique ci-dessous, nous montrons les prédictions moyennes de MG1 à partir du krigeage (triangles) ainsi que les mesures (cercles).
La moyenne et la variance estimées par krigeage peuvent être utilisées pour simuler les valeurs possibles de la variable à chaque nouveau point, conditionnellement aux valeurs échantillonnées. Dans l’exemple ci-dessous, nous avons effectué 4 simulations conditionnelles en ajoutant l’argument nsim = 4 à la même instruction krige.
+
+
sim_mg <-krige(MG1 ~ PH1, locations =~ XCOORD + YCOORD, data = oxford,
+newdata = new_points, model = gls_vgm, nsim =4)
+
+
drawing 4 GLS realisations of beta...
+[using conditional Gaussian simulation]
bryo_lm <-lm(sqrt(richness) ~ forest + wetland, data = bryo_belg)
+summary(bryo_lm)
+
+
+Call:
+lm(formula = sqrt(richness) ~ forest + wetland, data = bryo_belg)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-1.8847 -0.4622 0.0545 0.4974 2.3116
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 2.34159 0.08369 27.981 < 2e-16 ***
+forest 1.11883 0.13925 8.034 9.74e-15 ***
+wetland -0.59264 0.17216 -3.442 0.000635 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.7095 on 417 degrees of freedom
+Multiple R-squared: 0.2231, Adjusted R-squared: 0.2193
+F-statistic: 59.86 on 2 and 417 DF, p-value: < 2.2e-16
+
+
+
La proportion de forêts a un effet positif significatif et la proportion de milieux humides a un effet négatif significatif sur la richesse des bryophytes.
+
+
plot(variogram(sqrt(richness) ~ forest + wetland, locations =~ x + y,
+data = bryo_belg, cutoff =50), col ="black")
+
+
+
+
+
Le variogramme augmente au moins jusqu’à une distance de 40 km, il semble donc y avoir des corrélations spatiales dans les résidus du modèle.
+
+
bryo_exp <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corExp(form =~ x + y, nugget =TRUE))
+bryo_gaus <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corGaus(form =~ x + y, nugget =TRUE))
+bryo_spher <-gls(sqrt(richness) ~ forest + wetland, data = bryo_belg,
+correlation =corSpher(form =~ x + y, nugget =TRUE))
+
+
+
AIC(bryo_lm)
+
+
[1] 908.6358
+
+
AIC(bryo_exp)
+
+
[1] 867.822
+
+
AIC(bryo_gaus)
+
+
[1] 870.9592
+
+
AIC(bryo_spher)
+
+
[1] 866.9117
+
+
+
Le modèle sphérique a l’AIC le plus faible.
+
+
summary(bryo_spher)
+
+
Generalized least squares fit by REML
+ Model: sqrt(richness) ~ forest + wetland
+ Data: bryo_belg
+ AIC BIC logLik
+ 866.9117 891.1102 -427.4558
+
+Correlation Structure: Spherical spatial correlation
+ Formula: ~x + y
+ Parameter estimate(s):
+ range nugget
+43.1727664 0.6063187
+
+Coefficients:
+ Value Std.Error t-value p-value
+(Intercept) 2.0368769 0.2481636 8.207800 0.000
+forest 0.6989844 0.1481690 4.717481 0.000
+wetland -0.2441130 0.1809118 -1.349348 0.178
+
+ Correlation:
+ (Intr) forest
+forest -0.251
+wetland -0.235 0.241
+
+Standardized residuals:
+ Min Q1 Med Q3 Max
+-1.75204183 -0.06568688 0.61415597 1.15240370 3.23322743
+
+Residual standard error: 0.7998264
+Degrees of freedom: 420 total; 417 residual
+
+
+
La magnitude des deux effets est moins importante et l’effet des milieux humides n’est plus significatif. Comme c’est le cas pour d’autres types de résidus non indépendants, la “taille effective” de l’échantillon est ici inférieure au nombre de points, car des points proches les uns des autres fournissent une information redondante. Par conséquent, la relation entre les prédicteurs et la réponse est moins claire que celle donnée par le modèle supposant que tous ces points étaient indépendants.
+
Notez que les résultats pour les trois modèles gls sont assez similaires, donc le choix d’inclure des corrélations spatiales était plus important que la forme exacte supposée pour le variogramme.
+
+
+
25 Données aréales
+
Les données aréales sont des variables mesurées pour des régions de l’espace; ces régions sont définies par des polygones. Ce type de données est plus courant en sciences sociales, en géographie humaine et en épidémiologie, où les données sont souvent disponibles à l’échelle de divisions administratives du territoire.
+
Ce type de données apparaît aussi fréquemment dans la gestion des ressources naturelles. Par exemple, la carte suivante montre les unités d’aménagement forestier du Ministère de la Forêts, de la Faune et des Parcs du Québec.
+
+
Supposons qu’une certaine variable soit disponible au niveau de ces divisions du territoire. Comment pouvons-nous modéliser la corrélation spatiale entre les unités qui sont spatialement rapprochées?
+
Une option serait d’appliquer les méthodes géostatistiques vues précédemment, en calculant par exemple la distance entre les centres des polygones.
+
Une autre option, qui est davantage privilégiée pour les données aréales, consiste à définir un réseau où chaque région est connectée aux régions voisines par un lien. On suppose ensuite que les variables sont directement corrélées entre régions voisines seulement. (Notons toutefois que les corrélations directes entre voisins immédiats génèrent aussi des corrélations indirectes pour une chaîne de voisins.)
+
Dans ce type de modèle, la corrélation n’est pas nécessairement la même d’un lien à un autre. Dans ce cas, chaque lien du réseau peut être associé à un poids représentant son importance pour la corrélation spatiale. Nous représentons ces poids par une matrice \(W\) où \(w_{ij}\) est le poids du lien entre les régions \(i\) et \(j\). Une région n’a pas de lien avec elle-même, donc \(w_{ii} = 0\).
+
Un choix simple pour \(W\) consiste à assigner un poids égal à 1 si les régions sont voisines, sinon 0 (poids binaires).
+
Outre les divisions du territoire en polygones, un autre exemple de données aréales consiste en une grille où la variable est compilée pour chaque cellule de la grille. Dans ce cas, une cellule a généralement 4 ou 8 cellules voisines, selon que les diagonales soient incluses ou non.
+
+
+
26 Indice de Moran
+
Avant de discuter des modèles d’autocorrélation spatiale, nous présentons l’indice \(I\) de Moran, qui permet de tester si une corrélation significative est présente entre régions voisines.
+
L’indice de Moran est un coefficient d’autocorrélation spatiale des \(z\), pondéré par les poids \(w_{ij}\). Il prend donc des valeurs entre -1 et 1.
Dans cette équation, nous reconnaissons l’expression d’une corrélation, soit le produit des écarts à la moyenne de deux variables \(z_i\) et \(z_j\), divisé par le produit de leurs écarts-types (qui est le même, donc on obtient la variance). La contribution de chaque paire \((i, j)\) est multipliée par son poids \(w_{ij}\) et le terme à gauche (le nombre de régions \(N\) divisé par la somme des poids) assure que le résultat soit borné entre -1 et 1.
+
Puisque la distribution de \(I\) est connue en l’absence d’autocorrélation spatiale, cette statistique permet de tester l’hypothèse nulle selon laquelle il n’y a pas de corrélation spatiale entre régions voisines.
+
Bien que nous ne verrons pas d’exemple dans ce cours-ci, l’indice de Moran peut aussi être appliqué aux données ponctuelles. Dans ce cas, on divise les paires de points en classes de distance et on calcule \(I\) pour chaque classe de distance; le poids \(w_{ij} = 1\) si la distance entre \(i\) et \(j\) se trouve dans la classe de distance voulue, 0 autrement.
+
+
+
27 Modèles d’autorégression spatiale
+
Rappelons-nous la formule pour une régression linéaire avec dépendance spatiale:
où \(z\) est la portion de la variance résiduelle qui est spatialement corrélée.
+
Il existe deux principaux types de modèles autorégressifs pour représenter la dépendance spatiale de \(z\): l’autorégression conditionnelle (CAR) et l’autorégression simultanée (SAR).
+
+
Autorégression conditionnelle (CAR)
+
Dans le modèle d’autorégression conditionnelle, la valeur de \(z_i\) pour la région \(i\) suit une distribution normale: sa moyenne dépend de la valeur \(z_j\) des régions voisines, multipliée par le poids \(w_{ij}\) et un coefficient de corrélation \(\rho\); son écart-type \(\sigma_{z_i}\) peut varier d’une région à l’autre.
Dans ce modèle, si \(w_{ij}\) est une matrice binaire (0 pour les non-voisins, 1 pour les voisins), alors \(\rho\) est le coefficient de corrélation partielle entre régions voisines. Cela est semblable à un modèle autorégressif d’ordre 1 dans le contexte de séries temporelles, où le coefficient d’autorégression indique la corrélation partielle.
+
+
+
Autorégression simultanée (SAR)
+
Dans le modèle d’autorégression simultanée, la valeur de \(z_i\) est donnée directement par la somme de contributions des valeurs voisines \(z_j\), multipliées par \(\rho w_{ij}\), avec un résidu indépendant \(\nu_i\) d’écart-type \(\sigma_z\).
+
\[z_i = \sum_j \rho w_{ij} z_j + \nu_i\]
+
À première vue, cela ressemble à un modèle autorégressif temporel. Il existe cependant une différence conceptuelle importante. Pour les modèles temporels, l’influence causale est dirigée dans une seule direction: \(v(t-2)\) affecte \(v(t-1)\) qui affecte ensuite \(v(t)\). Pour un modèle spatial, chaque \(z_j\) qui affecte \(z_i\) dépend à son tour de \(z_i\). Ainsi, pour déterminer la distribution conjointe des \(z\), il faut résoudre simultanément (d’où le nom du modèle) un système d’équations.
+
Pour cette raison, même si ce modèle ressemble à la formule du modèle conditionnel (CAR), les solutions des deux modèles diffèrent et dans le cas du SAR, le coefficient \(\rho\) n’est pas directement égal à la corrélation partielle due à chaque région voisine.
+
Pour plus de détails sur les aspects mathématiques de ces modèles, vous pouvez consulter l’article de Ver Hoef et al. (2018) suggéré en référence.
+
Pour l’instant, nous considérerons les SAR et les CAR comme deux types de modèles possibles pour représenter une corrélation spatiale sur un réseau. Nous pouvons toujours ajuster plusieurs modèles et les comparer avec l’AIC pour choisir la meilleure forme de la corrélation ou la meilleure matrice de poids.
+
Les modèles CAR et SAR partagent un avantage sur les modèles géostatistiques au niveau de l’efficacité. Dans un modèle géostatistique, les corrélations spatiales sont définies entre chaque paire de points, même si elles deviennent négligeables lorsque la distance augmente. Pour un modèle CAR ou SAR, seules les régions voisines contribuent et la plupart des poids sont égaux à 0, ce qui rend ces modèles plus rapides à ajuster qu’un modèle géostatistique lorsque les données sont massives.
+
+
+
+
28 Analyse des données aréales dans R
+
Pour illustrer l’analyse de données aréales dans R, nous chargeons les packages sf (pour lire des données géospatiales), spdep (pour définir des réseaux spatiaux et calculer l’indice de Moran) et spatialreg (pour les modèles SAR et CAR).
+
+
library(sf)
+library(spdep)
+library(spatialreg)
+
+
Nous utiliserons comme exemple un jeu de données qui présente une partie des résultats de l’élection provinciale de 2018 au Québec, avec des caractéristiques de la population de chaque circonscription. Ces données sont inclues dans un fichier de type shapefile (.shp), que nous pouvons lire avec la fonction read_sf du package sf.
Note: Le jeu de données est en fait composé de 4 fichiers avec les extensions .dbf, .prj, .shp et .shx, mais il suffit d’inscrire le nom du fichier .shp dans read_sf.
+
Les colonnes du jeu de données sont dans l’ordre:
+
+
le nom de la circonscription électorale;
+
quatre caractéristiques de la population (âge moyen, fraction de la population qui parle principalement français à la maison, fraction des ménages qui sont propriétaires de leur logement, revenu médian);
+
quatre colonnes montrant la fraction des votes obtenues par les principaux partis (CAQ, PQ, PLQ, QS);
+
une colonne geometry qui contient l’objet géométrique (multipolygone) correspondant à la circonscription.
+
+
Pour illustrer une des variables sur une carte, nous appelons la fonction plot avec le nom de la colonne entre crochets et guillemets.
+
+
plot(elect2018["rev_med"])
+
+
+
+
+
Dans cet exemple, nous voulons modéliser la fraction des votes obtenue par la CAQ en fonction des caractéristiques de la population dans chaque circonscription et en tenant compte des corrélations spatiales entre circonscriptions voisines.
+
+
Définition du réseau de voisinage
+
La fonction poly2nb du package spdep définit un réseau de voisinage à partir de polygones. Le résultat vois est une liste de 125 éléments où chaque élément contient les indices des polygones voisins (limitrophes) d’un polygone donné.
+
+
vois <-poly2nb(elect2018)
+vois[[1]]
+
+
[1] 2 37 63 88 101 117
+
+
+
Ainsi, la première circonscription (Abitibi-Est) a 6 circonscriptions voisines, dont on peut trouver les noms ainsi:
Nous pouvons illustrer ce réseau en faisant l’extraction des coordonnées du centre de chaque circonscription, en créant une carte muette avec plot(elect2018["geometry"]), puis en ajoutant le réseau comme couche additionnelle avec plot(vois, add = TRUE, coords = coords).
Il nous reste à ajouter des poids à chaque lien du réseau avec la fonction nb2listw. Le style de poids “B” correspond aux poids binaires, soit 1 pour la présence de lien et 0 pour l’absence de lien entre deux circonscriptions.
+
Une fois ces poids définis, nous pouvons vérifier avec le test de Moran s’il y a une autocorrélation significative des votes obtenus par la CAQ entre circonscriptions voisines.
+ Moran I test under randomisation
+
+data: elect2018$propCAQ
+weights: poids
+
+Moran I statistic standard deviate = 13.148, p-value < 2.2e-16
+alternative hypothesis: greater
+sample estimates:
+Moran I statistic Expectation Variance
+ 0.680607768 -0.008064516 0.002743472
+
+
+
La valeur de \(I = 0.68\) est très significative à en juger par la valeur \(p\) du test.
+
Vérifions si la corrélation spatiale persiste après avoir tenu compte des quatre caractéristiques de la population, donc en inspectant les résidus d’un modèle linéaire incluant ces quatre prédicteurs.
+Call:
+lm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-30.9890 -4.4878 0.0562 6.2653 25.8146
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 1.354e+01 1.836e+01 0.737 0.463
+age_moy -9.170e-01 3.855e-01 -2.378 0.019 *
+pct_frn 4.588e+01 5.202e+00 8.820 1.09e-14 ***
+pct_prp 3.582e+01 6.527e+00 5.488 2.31e-07 ***
+rev_med -2.624e-05 2.465e-04 -0.106 0.915
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 9.409 on 120 degrees of freedom
+Multiple R-squared: 0.6096, Adjusted R-squared: 0.5965
+F-statistic: 46.84 on 4 and 120 DF, p-value: < 2.2e-16
+
+
moran.test(residuals(elect_lm), poids)
+
+
+ Moran I test under randomisation
+
+data: residuals(elect_lm)
+weights: poids
+
+Moran I statistic standard deviate = 6.7047, p-value = 1.009e-11
+alternative hypothesis: greater
+sample estimates:
+Moran I statistic Expectation Variance
+ 0.340083290 -0.008064516 0.002696300
+
+
+
L’indice de Moran a diminué mais demeure significatif, donc une partie de la corrélation précédente était induite par ces prédicteurs, mais il reste une corrélation spatiale due à d’autres facteurs.
+
+
+
Modèles d’autorégression spatiale
+
Finalement, nous ajustons des modèles SAR et CAR à ces données avec la fonction spautolm (spatial autoregressive linear model) de spatialreg. Voici le code pour un modèle SAR incluant l’effet des même quatre prédicteurs.
+Call: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018, listw = poids)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-23.08342 -4.10573 0.24274 4.29941 23.08245
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 15.09421119 16.52357745 0.9135 0.36098
+age_moy -0.70481703 0.32204139 -2.1886 0.02863
+pct_frn 39.09375061 5.43653962 7.1909 6.435e-13
+pct_prp 14.32329345 6.96492611 2.0565 0.03974
+rev_med 0.00016730 0.00023209 0.7208 0.47101
+
+Lambda: 0.12887 LR test value: 42.274 p-value: 7.9339e-11
+Numerical Hessian standard error of lambda: 0.012069
+
+Log likelihood: -433.8862
+ML residual variance (sigma squared): 53.028, (sigma: 7.282)
+Number of observations: 125
+Number of parameters estimated: 7
+AIC: 881.77
+
+
+
La valeur donnée par Lambda dans le sommaire correspond au coefficient \(\rho\) dans notre description du modèle. Le test du rapport de vraisemblance (LR test) confirme que cette corrélation spatiale résiduelle (après avoir tenu compte de l’effet des prédicteurs) est significative.
+
Les effets estimés pour les prédicteurs sont semblables à ceux du modèle linéaire sans corrélation spatiale. Les effets de l’âge moyen, de la fraction de francophones et la fraction de propriétaires demeurent significatifs, bien que leur magnitude ait un peu diminué.
+
Pour évaluer un modèle CAR plutôt que SAR, nous devons spécifier family = "CAR".
+Call: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med,
+ data = elect2018, listw = poids, family = "CAR")
+
+Residuals:
+ Min 1Q Median 3Q Max
+-21.73315 -4.24623 -0.24369 3.44228 23.43749
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 16.57164696 16.84155327 0.9840 0.325128
+age_moy -0.79072151 0.32972225 -2.3981 0.016478
+pct_frn 38.99116707 5.43667482 7.1719 7.399e-13
+pct_prp 17.98557474 6.80333470 2.6436 0.008202
+rev_med 0.00012639 0.00023106 0.5470 0.584364
+
+Lambda: 0.15517 LR test value: 40.532 p-value: 1.9344e-10
+Numerical Hessian standard error of lambda: 0.0026868
+
+Log likelihood: -434.7573
+ML residual variance (sigma squared): 53.9, (sigma: 7.3416)
+Number of observations: 125
+Number of parameters estimated: 7
+AIC: 883.51
+
+
+
Pour un modèle CAR avec des poids binaires, la valeur de Lambda (que nous avions appelé \(\rho\)) donne directement le coefficient de corrélation partielle entre circonscriptions voisines. Notez que l’AIC ici est légèrement supérieur au modèle SAR, donc ce dernier donnait un meilleur ajustement.
+
+
+
Exercice
+
Le jeu de données rls_covid, en format shapefile, contient des données sur les cas de COVID-19 détectés, le nombre de cas par 1000 personnes (taux_1k) et la densité de population (dens_pop) dans chacun des réseaux locaux de service de santé (RLS) du Québec. (Source: Données téléchargées de l’Institut national de santé publique du Québec en date du 17 janvier 2021.)
Simple feature collection with 6 features and 5 fields
+Geometry type: MULTIPOLYGON
+Dimension: XY
+Bounding box: xmin: 785111.2 ymin: 341057.8 xmax: 979941.5 ymax: 541112.7
+Projected CRS: Conique_conforme_de_Lambert_du_MTQ_utilis_e_pour_Adresse_Qu_be
+# A tibble: 6 × 6
+ RLS_code RLS_nom cas taux_1k dens_…¹ geometry
+ <chr> <chr> <dbl> <dbl> <dbl> <MULTIPOLYGON [m]>
+1 0111 RLS de Kamouraska 152 7.34 6.76 (((827028.3 412772.4, 82…
+2 0112 RLS de Rivière-du-Lo… 256 7.34 19.6 (((855905 452116.9, 8557…
+3 0113 RLS de Témiscouata 81 4.26 4.69 (((911829.4 441311.2, 91…
+4 0114 RLS des Basques 28 3.3 5.35 (((879249.6 471975.6, 87…
+5 0115 RLS de Rimouski 576 9.96 15.5 (((917748.1 503148.7, 91…
+6 0116 RLS de La Mitis 76 4.24 5.53 (((951316 523499.3, 9525…
+# … with abbreviated variable name ¹dens_pop
+
+
+
Ajustez un modèle linéaire du nombre de cas par 1000 en fonction de la densité de population (il est suggéré d’appliquer une transformation logarithmique à cette dernière). Vérifiez si les résidus du modèle sont corrélés entre RLS limitrophes avec un test de Moran, puis modélisez les mêmes données avec un modèle autorégressif conditionnel.
+
+
+
Référence
+
Ver Hoef, J.M., Peterson, E.E., Hooten, M.B., Hanks, E.M. et Fortin, M.-J. (2018) Spatial autoregressive models for statistical inference from ecological data. Ecological Monographs 88: 36-59.
+
+
+
+
29 GLMM avec processus spatial gaussien
+
Dans les parties précédentes, nous avons vu comment tenir compte de la dépendance spatiale dans les modèles de régression linéaire avec des modèles géostatistiques (également appelés processus gaussiens) ou des modèles d’autocorrélation spatiale (CAR/SAR). Dans cette dernière partie, nous verrons comment combiner ces caractéristiques avec des modèles de régression plus complexes, en particulier les modèles linéaires généralisés à effets mixtes (GLMM).
+
+
Données
+
Le jeu de données gambia inclus avec le package geoR présente les résultats d’une étude sur la prévalence du paludisme chez les enfants de 65 villages en Gambie. Nous utiliserons une version légèrement transformée des données contenues dans le fichier gambia.csv.
x and y: Coordonnées spatiales du village (en km, basé sur les coordonnées UTM).
+
pos: Réponse binaire, si l’enfant a eu un test positif du paludisme.
+
age: Âge de l’enfant en jours.
+
netuse: Si l’enfant dort sous un moustiquaire ou non.
+
treated: Si le moustiquaire est traité ou non.
+
green: Mesure de la végétation basée sur les données de télédétection (disponible à l’échelle du village).
+
phc: Présence ou absence d’un centre de santé publique pour le village.
+
+
Nous pouvons compter le nombre de cas positifs et le nombre total d’enfants testés par village pour cartographier la fraction des cas positifs (ou prévalence, prev).
+
+
# Jeu de données à l'échelle du village
+gambia_agg <-group_by(gambia, id_village, x, y, green, phc) %>%
+summarize(pos =sum(pos), total =n()) %>%
+mutate(prev = pos / total) %>%
+ungroup()
+
+
`summarise()` has grouped output by 'id_village', 'x', 'y', 'green'. You can
+override using the `.groups` argument.
ggplot(gambia_agg, aes(x = x, y = y)) +
+geom_point(aes(color = prev)) +
+geom_path(data = gambia.borders, aes(x = x /1000, y = y /1000)) +
+coord_fixed() +
+theme_minimal() +
+scale_color_viridis_c()
+
+
+
+
+
Nous utilisons le jeu de données gambia.borders du package geoR pour tracer les frontières des pays avec geom_path. Comme ces frontières sont en mètres, nous les divisons par 1000 pour obtenir la même échelle que nos points. Nous utilisons également coord_fixed pour assurer un rapport d’aspect de 1:1 entre les axes et utilisons la palette de couleur viridis, qui permet de visualiser plus facilement une variable continue par rapport à la palette par défaut dans ggplot2.
+
Sur la base de cette carte, il semble y avoir une corrélation spatiale dans la prévalence du paludisme, le groupe de villages de l’est montrant des valeurs de prévalence plus élevées (jaune-vert) et le groupe du milieu montrant des valeurs de prévalence plus faibles (violet).
+
+
+
GLMM non spatial
+
Pour ce premier exemple, nous allons ignorer l’aspect spatial des données et modéliser la présence du paludisme (pos) en fonction de l’utilisation d’une moustiquaire (netuse) et de la présence d’un centre de santé publique (phc). Comme nous avons une réponse binaire, nous devons utiliser un modèle de régression logistique (un GLM). Comme nous avons des prédicteurs au niveau individuel et au niveau du village et que nous nous attendons à ce que les enfants d’un même village aient une probabilité plus similaire d’avoir le paludisme même après avoir pris en compte ces prédicteurs, nous devons ajouter un effet aléatoire du village. Le résultat est un GLMM que nous ajustons en utilisant la fonction glmer du package lme4.
Generalized linear mixed model fit by maximum likelihood (Laplace
+ Approximation) [glmerMod]
+ Family: binomial ( logit )
+Formula: pos ~ netuse + phc + (1 | id_village)
+ Data: gambia
+
+ AIC BIC logLik deviance df.resid
+ 2428.0 2450.5 -1210.0 2420.0 2031
+
+Scaled residuals:
+ Min 1Q Median 3Q Max
+-2.1286 -0.7120 -0.4142 0.8474 3.3434
+
+Random effects:
+ Groups Name Variance Std.Dev.
+ id_village (Intercept) 0.8149 0.9027
+Number of obs: 2035, groups: id_village, 65
+
+Fixed effects:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 0.1491 0.2297 0.649 0.5164
+netuse -0.6044 0.1442 -4.190 2.79e-05 ***
+phc -0.4985 0.2604 -1.914 0.0556 .
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Correlation of Fixed Effects:
+ (Intr) netuse
+netuse -0.422
+phc -0.715 -0.025
+
+
+
D’après ces résultats, les variables netuse et phc sont toutes deux associées à une diminution de la prévalence du paludisme, bien que l’effet de phc ne soit pas significatif à un seuil \(\alpha = 0.05\). L’ordonnée à l’origine (0.149) est le logit de la probabilité de présence du paludisme pour un enfant sans moustiquaire et sans centre de santé publique, mais c’est l’ordonnée à l’origine moyenne pour tous les villages. Il y a beaucoup de variation entre les villages selon l’écart-type de l’effet aléatoire (0.90). Nous pouvons obtenir l’ordonnée à l’origine estimée pour chaque village avec la fonction coef:
Par exemple, l’ordonnée à l’origine pour le village 1 est environ 0.94, équivalente à une probabilité de 72%:
+
+
plogis(0.937)
+
+
[1] 0.7184933
+
+
+
tandis que celle pour le village 2 est équivalente à une probabilité de 52%:
+
+
plogis(0.092)
+
+
[1] 0.5229838
+
+
+
Le package DHARMa fournit une méthode générale pour vérifier si les résidus d’un GLMM sont distribués selon le modèle spécifié et s’il existe une tendance résiduelle. Il simule des réplicats de chaque observation selon le modèle ajusté et détermine ensuite un “résidu standardisé”, qui est la position relative de la valeur observée par rapport aux valeurs simulées, par exemple 0 si l’observation est plus petite que toutes les simulations, 0.5 si elle se trouve au milieu, etc. Si le modèle représente bien les données, chaque valeur du résidu standardisé entre 0 et 1 doit avoir la même probabilité, de sorte que les résidus standardisés doivent produire une distribution uniforme entre 0 et 1.
+
La fonction simulateResiduals effectue le calcul des résidus standardisés, puis la fonction plot trace les graphiques de diagnostic avec les résultats de certains tests.
Le graphique de gauche est un graphique quantile-quantile des résidus standardisés. Les résultats de trois tests statistiques sont également présentés: un test de Kolmogorov-Smirnov (KS) qui vérifie s’il y a un écart par rapport à la distribution théorique, un test de dispersion qui vérifie s’il y a une sous-dispersion ou une surdispersion et un test de valeurs aberrantes (outlier) basé sur le nombre de résidus qui sont plus extrêmes que toutes les simulations. Ici, nous obtenons un résultat significatif pour les valeurs aberrantes, bien que le message indique que ce résultat pourrait avoir un taux d’erreur de type I plus grand que prévu dans ce cas.
+
À droite, nous obtenons généralement un graphique des résidus standardisés (en y) en fonction du rang des valeurs prédites, afin de vérifier l’absence de tendance résiduelle. Ici, les prédictions sont regroupées par quartile, il serait donc préférable d’agréger les prédictions et les résidus par village, ce que nous pouvons faire avec la fonction recalculateResiduals.
+
+
plot(recalculateResiduals(res_glmm, group = gambia$id_village))
+
+
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
Le graphique de droite montre les points individuels, ainsi qu’une régression quantile pour le 1er quartile, la médiane et le 3e quartile. En théorie, ces trois courbes devraient être des lignes droites horizontales (pas de tendance des résidus par rapport aux prévisions). La courbe pour le 3e quartile (en rouge) est significativement différente d’une ligne horizontale, ce qui pourrait indiquer un effet systématique manquant dans le modèle.
+
+
+
GLMM spatial avec spaMM
+
Le package spaMM (modèles mixtes spatiaux) est un package R relativement récent qui permet d’effectuer une estimation approximative du maximum de vraisemblance des paramètres pour les GLM avec dépendance spatiale, modélisés soit comme un processus gaussien, soit avec un CAR (nous verrons ce dernier dans la dernière section). Le package implémente différents algorithmes, mais il existe une fonction unique fitme qui choisit l’algorithme approprié pour chaque type de modèle. Par exemple, voici le même modèle (non spatial) que nous avons vu ci-dessus, ajusté avec spaMM.
formula: pos ~ netuse + phc + (1 | id_village)
+Estimation of lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.1491 0.2287 0.6519
+netuse -0.6045 0.1420 -4.2567
+phc -0.4986 0.2593 -1.9231
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ id_village : 0.8151
+ --- Coefficients for log(lambda):
+ Group Term Estimate Cond.SE
+ id_village (Intercept) -0.2045 0.2008
+# of obs: 2035; # of groups: id_village, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1210.016
+
+
+
Notez que les estimés des effets fixes ainsi que la variance des effets aléatoires sont presque identiques à ceeux obtenues par glmer ci-dessus.
+
Nous pouvons maintenant utiliser spaMM pour ajuster le même modèle avec l’ajout de corrélations spatiales entre les villages. Dans la formule du modèle, ceci est représenté comme un effet aléatoire Matern(1 | x + y), ce qui signifie que les ordonnées à l’origine sont spatialement corrélées entre les villages suivant une fonction de corrélation de Matérn des coordonnées (x, y). La fonction de Matérn est une fonction flexible de corrélation spatiale qui comprend un paramètre de forme \(\nu\) (nu), de sorte que lorsque \(\nu = 0,5\), elle est équivalente à la corrélation exponentielle, mais quand \(\nu\) prend de grandes valeurs, elle se rapproche d’une corrélation gaussienne. Nous pourrions laisser la fonction estimer \(\nu\), mais ici nous le fixons à 0.5 avec l’argument fixed de fitme.
Increase spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').
+
+
summary(mod_spamm)
+
+
formula: pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.06861 0.3352 0.2047
+netuse -0.51719 0.1407 -3.6757
+phc -0.44416 0.2052 -2.1648
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.nu 1.rho
+0.50000000 0.05128692
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ x + y : 0.6421
+ id_village : 0.1978
+# of obs: 2035; # of groups: x + y, 65; id_village, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1197.968
+
+
+
Commençons par vérifier les effets aléatoires du modèle. La fonction de corrélation spatiale a un paramètre rho égal à 0.0513. Ce paramètre dans spaMM est l’inverse de la portée, donc ici la portée de la corrélation exponentielle est de 1/0.0513 ou environ 19.5 km. Il y a maintenant deux pramètres de variance, celui identifié comme x + y est la variance à longue distance (i.e. le palier) pour le modèle de corrélation exponentielle alors que celui identifié comme id_village montre la portion non corrélée de la variation entre les villages.
+
Si nous avions ici laissé les effets aléatoires (1 | id_village) dans la formule pour représenter la partie non spatiale de la variation entre les villages, nous pourrions également représenter ceci avec un effet de pépite dans le modèle géostatistique. Dans les deux cas, cela représenterait l’idée que même deux villages très proches l’un de l’autre auraient des prévalences de base différentes dans le modèle.
+
Par défaut, la fonction Matern n’a pas d’effet de pépite, mais nous pouvons en ajouter un en spécifiant une pépite non nulle dans la liste initiale des paramètres init.
Increase spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').
+
+
summary(mod_spamm2)
+
+
formula: pos ~ netuse + phc + Matern(1 | x + y)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: binomial( link = logit )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) 0.06861 0.3352 0.2047
+netuse -0.51719 0.1407 -3.6757
+phc -0.44416 0.2052 -2.1648
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.nu 1.Nugget 1.rho
+0.50000000 0.23551027 0.05128692
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ x + y : 0.8399
+# of obs: 2035; # of groups: x + y, 65
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -1197.968
+
+
+
Comme vous pouvez le voir, toutes les estimations sont les mêmes, sauf que la variance de la portion spatiale (palier) est maintenant de 0.84 et que la pépite est égale à une fraction 0.235 de ce palier, soit une variance de 0.197, ce qui est identique à l’effet aléatoire id_village dans la version ci-dessus. Les deux formulations sont donc équivalentes.
+
Maintenant, rappelons les coefficients que nous avions obtenus pour le GLMM non spatial :
Dans la version spatiale, les deux effets fixes se sont légèrement rapprochés de zéro, mais l’erreur-type de l’effet de phc a diminué. Il est intéressant de noter que l’inclusion de la dépendance spatiale nous a permis d’estimer plus précisément l’effet de la présence d’un centre de santé publique dans le village. Ce ne serait pas toujours le cas: pour un prédicteur qui est également fortement corrélé dans l’espace, la corrélation spatiale dans la réponse rend plus difficile l’estimation de l’effet de ce prédicteur, puisqu’il est confondu avec l’effet spatial. Cependant, pour un prédicteur qui n’est pas corrélé dans l’espace, l’inclusion de l’effet spatial réduit la variance résiduelle (non spatiale) et peut donc augmenter la précision de l’effet du prédicteur.
+
Le package spaMM est également compatible avec DHARMa pour les diagnostics résiduels. (Vous pouvez ignorer l’avertissement selon lequel il ne fait pas partie de la classe des modèles pris en charge, cela est dû à l’utilisation de la fonction fitme plutôt que d’une fonction d’algorithme spécifique dans spaMM).
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
plot(recalculateResiduals(res_spamm, group = gambia$id_village))
+
+
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
Enfin, bien que nous allons montrer comment calculer et visualiser des prédictions spatiales ci-dessous, nous pouvons produire une carte rapide des effets spatiaux estimés dans un modèle spaMM avec la fonction filled.mapMM.
+
+
filled.mapMM(mod_spamm2)
+
+
+
+
+
+
+
Processus gaussiens vs. splines de lissage
+
Si vous connaissez bien les modèles additifs généralisés (GAM), vous avez peut-être pensé à représenter la variation spatiale de la prévalence du paludisme (comme le montre la carte ci-dessus) par une spline de lissage en 2D (en fonction de \(x\) et \(y\)) dans un GAM.
+
Le code ci-dessous correspond à l’équivalent GAM de notre GLMM avec processus gaussien ci-dessus, ajusté avec la fonction gam du package mgcv. L’effet spatial est représenté par la spline 2D s(x, y) alors que l’effet aléatoire non spatial de village est représenté par s(id_village, bs = "re"), qui est équivalent à (1 | id_village) dans les modèles précédents. Notez que pour la fonction gam, les variables catégorielles doivent être explicitement converties en facteurs.
Pour visualiser la spline en 2D, nous utiliserons le package gratia.
+
+
library(gratia)
+draw(mod_gam)
+
+
+
+
+
Notez que le graphique de la spline s(x, y) (en haut à droite) ne s’étend pas trop loin des emplacements des données (les autres zones sont vides). Dans ce graphique, on peut également voir que les effets aléatoires des villages suivent la distribution gaussienne attendue (en haut à gauche).
+
Ensuite, nous utiliserons à la fois le GLMM spatial de la section précédente et ce GAMM pour prédire la prévalence moyenne sur une grille spatiale de points contenue dans le fichier gambia_pred.csv. Le graphique ci-dessous ajoute ces points de prédiction (en noir) sur la carte précédente des points de données.
+
+
gambia_pred <-read.csv("data/gambia_pred.csv")
+
+ggplot(gambia_agg, aes(x = x, y = y)) +
+geom_point(data = gambia_pred) +
+geom_point(aes(color = prev)) +
+geom_path(data = gambia.borders, aes(x = x /1000, y = y /1000)) +
+coord_fixed() +
+theme_minimal() +
+scale_color_viridis_c()
+
+
+
+
+
Pour faire des prédictions à partir du modèle GAMM à ces endroits, le code ci-dessous effectue les étapes suivantes:
+
+
Tous les prédicteurs du modèle doivent se trouver dans le tableau de données de prédiction, nous ajoutons donc des valeurs constantes de netuse et phc (toutes deux égales à 1) pour tous les points. Ainsi, nous ferons des prédictions sur la prévalence du paludisme dans le cas où un moustiquaire est utilisée et où un centre de santé publique est présent. Nous ajoutons également un id_village constant, bien qu’il ne soit pas utilisé dans les prédictions (voir ci-dessous).
+
Nous appelons la fonction predict à la sortie de gam pour produire des prédictions aux nouveaux points de données (argument newdata), en incluant les erreurs-types (se.fit = TRUE) et en excluant les effets aléatoires du village, donc la prédiction est faite pour un “village moyen”. L’objet résultant gam_pred aura des colonnes fit (prédiction moyenne) et se.fit (erreur-type). Ces prédictions et erreurs-types sont sur l’échelle du lien (logit).
+
Nous rattachons le jeu de données de prédiction original à gam_pred avec cbind.
+
Nous ajoutons des colonnes pour la prédiction moyenne et les limites de l’intervalle de confiance à 50% (moyenne \(\pm\) 0.674 erreur-type), converties de l’échelle logit à l’échelle de probabilité avec plogis. Nous choisissons un intervalle de 50% car un intervalle de 95% peut être trop large ici pour contraster les différentes prédictions sur la carte à la fin de cette section.
Note : La raison pour laquelle nous ne faisons pas de prédictions directement sur l’échelle de probabilité (réponse) est que la formule normale des intervalles de confiance s’applique plus précisément sur l’échelle logit. L’ajout d’un certain nombre d’erreurs-types autour de la moyenne sur l’échelle de probabilité conduirait à des intervalles moins précis et peut-être même à des intervalles de confiance en dehors de la plage de valeurs possible (0, 1) pour une probabilité.
+
Nous appliquons la même stratégie pour faire des prédictions à partir du GLMM spatial avec spaMM. Il y a quelques différences dans la méthode predict par rapport au cas du GAMM.
+
+
L’argument binding = "fit" signifie que les prédictions moyennes (colonne fit) seront attachées à l’ensemble de données de prédiction et retournées sous forme de tableau de données spamm_pred.
+
L’argument variances = list(linPred = TRUE) indique à predict de calculer la variance du prédicteur linéaire (donc le carré de l’erreur-type). Cependant, il apparaît comme un attribut predVar dans le tableau de données de sortie plutôt que dans une colonne se.fit, donc nous le déplaçons vers une colonne sur la ligne suivante.
Enfin, nous combinons les deux ensembles de prédictions sous la forme de différentes rangées d’un tableau de données pred_all avec bind_rows. Le nom du tableau de données d’où provient chaque prédiction (gam ou spamm) apparaîtra dans la colonne “model” (argument .id). Pour simplifier la production du prochain graphique, nous utilisons ensuite pivot_longer dans le package tidyr pour changer les trois colonnes “pred”, “lo” et “hi” en deux colonnes, “stat” et “value” (pred_tall a donc trois rangées pour chaque rangée dans pred_all).
Une fois ces étapes franchies, nous pouvons enfin examiner les cartes de prédiction (moyenne, limites inférieure et supérieure de l’intervalle de confiance à 50 %) à l’aide d’un graphique ggplot. Les points de données originaux sont indiqués en rouge.
Bien que les deux modèles s’accordent à dire que la prévalence est plus élevée près du groupe de villages de l’est, le GAMM estime également une prévalence plus élevée en quelques points (bord ouest et autour du centre) où il n’y a pas de données. Il s’agit d’un artefact de la forme de la spline autour des points de données, puisqu’une spline est censée correspondre à une tendance globale, bien que non linéaire. En revanche, le modèle géostatistique représente l’effet spatial sous forme de corrélations locales et revient à la prévalence moyenne globale lorsqu’il est éloigné de tout point de données, ce qui est une supposition plus sûre. C’est l’une des raisons pour lesquelles il est préférable de choisir un modèle géostatistique / processus gaussien dans ce cas.
+
+
+
Méthodes bayésiennes pour les GLMM avec processus gaussiens
+
Les modèles bayésiens fournissent un cadre flexible pour exprimer des modèles avec une structure de dépendance complexe entre les données, y compris la dépendance spatiale. Cependant, l’ajustement d’un modèle de processus gaussien avec une approche entièrement bayésienne peut être lent, en raison de la nécessité de calculer une matrice de covariance spatiale entre toutes les paires de points à chaque itération.
+
La méthode INLA (pour integrated nested Laplace approximation) effectue un calcul approximatif de la distribution postérieure bayésienne, ce qui la rend adaptée aux problèmes de régression spatiale. Nous ne l’abordons pas dans ce cours, mais je recommande le manuel de Paula Moraga (dans la section des références ci-dessous) qui fournit des exemples concrets d’utilisation de la méthode INLA pour divers modèles de données géostatistiques et aréales, dans le contexte de l’épidémiologie, y compris des modèles avec une dépendance à la fois spatiale et temporelle. Le livre présente les mêmes données sur le paludisme en Gambie comme exemple d’un ensemble de données géostatistiques, ce qui a inspiré son utilisation dans ce cours.
+
+
+
+
30 GLMM avec autorégression spatiale
+
Nous revenons au dernier exemple de la partie précédente, où nous avions modélisé le taux de cas de COVID-19 (cas / 1000) pour les divisions administratives du réseau de la santé (RLS) au Québec en fonction de leur densité de population. Le taux est donné par la colonne “taux_1k” dans le shapefilerls_covid.
Auparavant, nous avions modélisé le logarithme de ce taux comme une fonction linéaire du logarithme de la densité de population, la variance résiduelle étant corrélée entre les unités voisines via une structure CAR (autorégression conditionnelle), comme le montre le code ci-dessous.
+Call: spautolm(formula = log(taux_1k) ~ log(dens_pop), data = rls_covid,
+ listw = rls_w, family = "CAR")
+
+Residuals:
+ Min 1Q Median 3Q Max
+-1.201858 -0.254084 -0.053348 0.281482 1.427053
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 1.702068 0.168463 10.1035 < 2.2e-16
+log(dens_pop) 0.206623 0.032848 6.2903 3.169e-10
+
+Lambda: 0.15762 LR test value: 23.991 p-value: 9.6771e-07
+Numerical Hessian standard error of lambda: 0.0050486
+
+Log likelihood: -80.68953
+ML residual variance (sigma squared): 0.2814, (sigma: 0.53048)
+Number of observations: 95
+Number of parameters estimated: 4
+AIC: 169.38
+
+
+
Rappel: La fonction poly2nb du package spdep crée une liste de voisins basée sur les polygones limitrophes dans un shapefile, puis nb2listw la convertit en une liste de poids, ici des poids binaires (style = "B") de sorte que chaque région limitrophe reçoive le même poids de 1 dans le modèle autorégressif.
+
Au lieu d’utiliser les taux, il serait possible de modéliser directement les cas avec une régression de Poisson, qui est appropriée pour les données de comptage. Pour tenir compte du fait que si le risque par personne était égal, les cas seraient proportionnels à la population, nous pouvons ajouter la population de l’unité pop comme offset dans la régression de Poisson. Par conséquent, le modèle ressemblerait à : cas ~ log(dens_pop) + offset(log(pop)). Notez que puisque la régression de Poisson utilise un lien logarithmique, ce modèle avec log(pop) comme offset suppose que log(cas / pop) (donc le taux logarithmique) est proportionnel à log(dens_pop), tout comme le modèle linéaire ci-dessus, mais il a l’avantage de modéliser la variabilité des données brutes (le nombre de cas) directement avec une distribution de Poisson.
+
Nous n’avons pas la population dans ces données, mais nous pouvons l’estimer à partir des cas et du taux (cas / 1000) comme suit:
Pour définir un modèle CAR dans spaMM, nous avons besoin d’une matrice de poids plutôt que d’une liste de poids comme dans le package spatialreg. Heureusement, le package spdep comprend également une fonction nb2mat pour convertir la liste des voisins en une matrice de poids, là encore en utilisant des poids binaires. Pour éviter un avertissement dans R, nous spécifions que les noms des lignes et des colonnes de cette matrice doivent être égaux aux identifiants associés à chaque unité (RLS_code). Ensuite, nous ajoutons un terme adjacency(1 | RLS_code) au modèle pour spécifier que la variation résiduelle entre les différents groupes définis par RLS_code est spatialement corrélée avec une structure CAR (ici, chaque groupe n’a qu’une observation puisque nous avons un point de données par unité RLS).
formula: cas ~ log(dens_pop) + offset(log(pop)) + adjacency(1 | RLS_code)
+Estimation of corrPars and lambda by ML (p_v approximation of logL).
+Estimation of fixed effects by ML (p_v approximation of logL).
+Estimation of lambda by 'outer' ML, maximizing logL.
+family: poisson( link = log )
+ ------------ Fixed effects (beta) ------------
+ Estimate Cond. SE t-value
+(Intercept) -5.1618 0.16855 -30.625
+log(dens_pop) 0.1999 0.03267 6.119
+ --------------- Random effects ---------------
+Family: gaussian( link = identity )
+ --- Correlation parameters:
+ 1.rho
+0.1576605
+ --- Variance parameters ('lambda'):
+lambda = var(u) for u ~ Gaussian;
+ RLS_code : 0.266
+# of obs: 95; # of groups: RLS_code, 95
+ ------------- Likelihood values -------------
+ logLik
+logL (p_v(h)): -709.3234
+
+
+
Notez que le coefficient de corrélation spatiale rho (0.158) est similaire à la quantité équivalente dans le modèle spautolm ci-dessus, où il était appelé Lambda. L’effet de log(dens_pop) est également d’environ 0.2 dans les deux modèles.
+
+
Référence
+
Moraga, Paula (2019) Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. Disponible en ligne: https://www.paulamoraga.com/book-geospatial/.
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-10-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-10-1.png
new file mode 100644
index 0000000..e827710
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-10-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-100-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-100-1.png
new file mode 100644
index 0000000..847c620
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-100-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-102-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-102-1.png
new file mode 100644
index 0000000..1579e74
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-102-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-103-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-103-1.png
new file mode 100644
index 0000000..62429cd
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-103-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-104-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-104-1.png
new file mode 100644
index 0000000..678ea3c
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-104-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-105-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-105-1.png
new file mode 100644
index 0000000..e40ccbd
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-105-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-106-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-106-1.png
new file mode 100644
index 0000000..0bc1352
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-106-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-107-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-107-1.png
new file mode 100644
index 0000000..8c30a64
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-107-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-108-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-108-1.png
new file mode 100644
index 0000000..c897569
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-108-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-109-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-109-1.png
new file mode 100644
index 0000000..6512cb3
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-109-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-11-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-11-1.png
new file mode 100644
index 0000000..847c620
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-11-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-111-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-111-1.png
new file mode 100644
index 0000000..3d21a0f
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-111-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-112-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-112-1.png
new file mode 100644
index 0000000..f80ce79
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-112-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-113-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-113-1.png
new file mode 100644
index 0000000..f04fb55
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-113-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-114-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-114-1.png
new file mode 100644
index 0000000..c6d591f
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-114-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-115-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-115-1.png
new file mode 100644
index 0000000..3462250
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-115-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-1.png
new file mode 100644
index 0000000..18b37bb
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-2.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-2.png
new file mode 100644
index 0000000..d46c839
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-116-2.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-117-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-117-1.png
new file mode 100644
index 0000000..c256353
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-117-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-118-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-118-1.png
new file mode 100644
index 0000000..da04932
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-118-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-119-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-119-1.png
new file mode 100644
index 0000000..27d6996
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-119-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-121-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-121-1.png
new file mode 100644
index 0000000..5ec164e
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-121-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-122-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-122-1.png
new file mode 100644
index 0000000..eaa9c35
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-122-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-124-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-124-1.png
new file mode 100644
index 0000000..b3e984a
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-124-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-125-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-125-1.png
new file mode 100644
index 0000000..a9a8154
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-125-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-128-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-128-1.png
new file mode 100644
index 0000000..1c86bb1
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-128-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-13-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-13-1.png
new file mode 100644
index 0000000..1579e74
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-13-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-130-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-130-1.png
new file mode 100644
index 0000000..4d9a258
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-130-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-133-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-133-1.png
new file mode 100644
index 0000000..1112bc2
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-133-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-135-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-135-1.png
new file mode 100644
index 0000000..a150946
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-135-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-137-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-137-1.png
new file mode 100644
index 0000000..6403054
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-137-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-14-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-14-1.png
new file mode 100644
index 0000000..62429cd
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-14-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-143-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-143-1.png
new file mode 100644
index 0000000..cfeb3c9
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-143-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-146-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-146-1.png
new file mode 100644
index 0000000..6845184
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-146-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-147-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-147-1.png
new file mode 100644
index 0000000..87c7739
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-147-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-15-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-15-1.png
new file mode 100644
index 0000000..678ea3c
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-15-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-155-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-155-1.png
new file mode 100644
index 0000000..291b9ff
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-155-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-16-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-16-1.png
new file mode 100644
index 0000000..e40ccbd
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-16-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-160-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-160-1.png
new file mode 100644
index 0000000..029b711
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-160-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-161-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-161-1.png
new file mode 100644
index 0000000..896bd8e
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-161-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-1.png
new file mode 100644
index 0000000..cecdab5
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-2.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-2.png
new file mode 100644
index 0000000..8da716d
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-166-2.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-167-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-167-1.png
new file mode 100644
index 0000000..3018796
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-167-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-169-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-169-1.png
new file mode 100644
index 0000000..052f19c
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-169-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-17-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-17-1.png
new file mode 100644
index 0000000..0bc1352
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-17-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-170-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-170-1.png
new file mode 100644
index 0000000..b4040f8
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-170-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-174-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-174-1.png
new file mode 100644
index 0000000..fa24509
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-174-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-175-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-175-1.png
new file mode 100644
index 0000000..2cc612d
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-175-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-18-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-18-1.png
new file mode 100644
index 0000000..1e0adb8
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-18-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-19-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-19-1.png
new file mode 100644
index 0000000..c897569
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-19-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-2-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-2-1.png
new file mode 100644
index 0000000..be5f7aa
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-2-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-20-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-20-1.png
new file mode 100644
index 0000000..6512cb3
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-20-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-22-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-22-1.png
new file mode 100644
index 0000000..3d21a0f
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-22-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-23-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-23-1.png
new file mode 100644
index 0000000..f80ce79
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-23-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-24-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-24-1.png
new file mode 100644
index 0000000..f04fb55
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-24-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-25-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-25-1.png
new file mode 100644
index 0000000..c6d591f
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-25-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-26-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-26-1.png
new file mode 100644
index 0000000..3462250
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-26-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-1.png
new file mode 100644
index 0000000..18b37bb
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-2.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-2.png
new file mode 100644
index 0000000..d46c839
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-27-2.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-28-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-28-1.png
new file mode 100644
index 0000000..115f34f
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-28-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-29-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-29-1.png
new file mode 100644
index 0000000..a0d2cc1
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-29-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-30-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-30-1.png
new file mode 100644
index 0000000..91357a9
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-30-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-32-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-32-1.png
new file mode 100644
index 0000000..1cb19de
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-32-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-33-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-33-1.png
new file mode 100644
index 0000000..ccfb6d4
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-33-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-35-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-35-1.png
new file mode 100644
index 0000000..b3e984a
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-35-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-36-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-36-1.png
new file mode 100644
index 0000000..a9a8154
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-36-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-39-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-39-1.png
new file mode 100644
index 0000000..1c86bb1
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-39-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-4-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-4-1.png
new file mode 100644
index 0000000..2af9406
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-4-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-41-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-41-1.png
new file mode 100644
index 0000000..4d9a258
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-41-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-44-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-44-1.png
new file mode 100644
index 0000000..e0251f1
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-44-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-46-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-46-1.png
new file mode 100644
index 0000000..c4e88bd
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-46-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-48-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-48-1.png
new file mode 100644
index 0000000..6403054
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-48-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-54-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-54-1.png
new file mode 100644
index 0000000..cfeb3c9
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-54-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-57-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-57-1.png
new file mode 100644
index 0000000..6845184
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-57-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-58-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-58-1.png
new file mode 100644
index 0000000..87c7739
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-58-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-66-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-66-1.png
new file mode 100644
index 0000000..291b9ff
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-66-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-7-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..af4ee7e
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-71-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-71-1.png
new file mode 100644
index 0000000..029b711
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-71-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-72-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-72-1.png
new file mode 100644
index 0000000..896bd8e
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-72-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-1.png
new file mode 100644
index 0000000..cecdab5
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-2.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-2.png
new file mode 100644
index 0000000..8da716d
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-77-2.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-78-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-78-1.png
new file mode 100644
index 0000000..3018796
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-78-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-80-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-80-1.png
new file mode 100644
index 0000000..a61302d
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-80-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-81-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-81-1.png
new file mode 100644
index 0000000..b4040f8
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-81-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-85-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-85-1.png
new file mode 100644
index 0000000..fa24509
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-85-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-86-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-86-1.png
new file mode 100644
index 0000000..2cc612d
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-86-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-9-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-9-1.png
new file mode 100644
index 0000000..1a45f62
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-91-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-91-1.png
new file mode 100644
index 0000000..134fc0b
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-91-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-93-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-93-1.png
new file mode 100644
index 0000000..7e46c5c
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-93-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-96-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-96-1.png
new file mode 100644
index 0000000..af4ee7e
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-96-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-98-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-98-1.png
new file mode 100644
index 0000000..1a45f62
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-98-1.png differ
diff --git a/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-99-1.png b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-99-1.png
new file mode 100644
index 0000000..e827710
Binary files /dev/null and b/docs/posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index_files/figure-html/unnamed-chunk-99-1.png differ
diff --git a/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_English.pdf b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_English.pdf
new file mode 100644
index 0000000..cc09a98
Binary files /dev/null and b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_English.pdf differ
diff --git a/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_Francais.pdf b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_Francais.pdf
new file mode 100644
index 0000000..24a4adc
Binary files /dev/null and b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/L'equite la Diversite et l'Inclusion_Sciences (BIOS2+_2e cycle - 1h30)_Francais.pdf differ
diff --git a/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/image.jpg b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/image.jpg
new file mode 100644
index 0000000..4def10c
Binary files /dev/null and b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/image.jpg differ
diff --git a/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/index.html b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/index.html
new file mode 100644
index 0000000..d09562c
--- /dev/null
+++ b/docs/posts/2021-01-22-introduction-aux-concepts-edi-en-contexte-scientifique/index.html
@@ -0,0 +1,374 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Introduction to EDI concepts in a scientific context
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to EDI concepts in a scientific context
+
+
+
A short introduction to EDI concepts in a scientific context.
+
+
+
+
Transversal competencies
+
FR
+
EN
+
+
+
+
+
+
Authors
+
+
+
+ Agathe Riallan
+
+
+
+ Faculté des Sciences à Université de Sherbrooke
+
1 Introduction to EDI concepts in a scientific context
+
In 2021, the BIOS2 training program will be holding a series of training and reflection activities on equity, diversity and inclusion issues. The goal is to develop an EDI action plan for the program in order to consolidate a more inclusive, respectful and open environment.
+
The objectives of this workshop are:
+
+
Define the concepts of equity, diversity and inclusion
+
Identify the benefits and challenges of EDI in the university context
+
Recongnize how to become an EDI bearer during one’s university career
+
Raise awareness of intercultural communication (professional competence of tomorrow)
+
+
The workshop is developed by Agathe Riallan, Faculty Coordinator for Equity, Diversity and Inclusion (EDI) at the Faculty of Science, Université de Sherbrooke, in collaboration with Marie-José Naud, Equity, Diversity and Inclusion Advisor and Coordinator at the Centre d’études nordiques (CEN).
+
+
+
+
+
+
2 Introduction aux concepts EDI en contexte scientifique
+
En 2021, nous aurons une série de formations et d’activités de réflexion sur les questions d’équité, diversité et d’inclusion. Notre objectif est de mettre en place un plan d’action EDI pour le programme afin de consolider un environnement plus inclusif, respectueux et ouvert.
+
Les objectifs de cet ateliers sont:
+
+
Définir les concepts d’équité, de diversité et d’inclusion
+
Identifier les avantages et les défis de l’ÉDI en contexte universitaire
+
Identifier comment être porteuse ou porteur de l’ÉDI lors de son parcours universitaire
+
Se sensibiliser à la communication interculturelle (compétence professionnelle de demain)
+
+
L’atelier est développé par Agathe Riallan, Coordinatrice facultaire de l’Équité, de la Diversité et de l’Inclusion (ÉDI) de la Faculté des Sciences à Université de Sherbrooke, en collaboration avec Marie-José Naud, Conseillère en équité, diversité et inclusion et coordonnatrice au Centre d’études nordiques (CEN).
Analysis of point-count data in the presence of variable survey methodologies and detection error offered by Peter Solymos to BIOS2 Fellows in March 2021.
+
+
+
+
Technical
+
EN
+
+
+
+
+
+
+
+
+
Author
+
+
Peter Solymos
+
+
+
+
+
Published
+
+
March 25, 2021
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
This course is aimed towards researchers analyzing field observations, who are often faced by data heterogeneities due to field sampling protocols changing from one project to another, or through time over the lifespan of projects, or trying to combine ‘legacy’ data sets with new data collected by recording units.
+
Such heterogeneities can bias analyses when data sets are integrated inadequately, or can lead to information loss when filtered and standardized to common standards. Accounting for these issues is important for better inference regarding status and trend of species and communities.
+
Analysts of such ‘messy’ data sets need to feel comfortable with manipulating the data, need a full understanding the mechanics of the models being used (i.e. critically interpreting the results and acknowledging assumptions and limitations), and should be able to make informed choices when faced with methodological challenges.
+
The course emphasizes critical thinking and active learning through hands on programming exercises. We will use publicly available data sets to demonstrate the data manipulation and analysis. We will use freely available and open-source R packages.
+
The expected outcome of the course is a solid foundation for further professional development via increased confidence in applying these methods for field observations.
Follow the instructions at the R website to download and install the most up-to-date base R version suitable for your operating system (the latest R version at the time of writing these instructions is 4.0.4).
Having RStudio is not absolutely necessary, but it will make life easier. RStudio is also available for different operating systems. Pick the open source desktop edition from here (the latest RStudio Desktop version at the time of writing these instructions is 1.4.1106).
+
Prior exposure to R programming is not necessary, but knowledge of basic R object types and their manipulation (arrays, data frames, indexing) is useful for following hands-on exercises. Software Carpentry’s Data types and structures in R is a good resource to brush up your R skills.
Sólymos, P., Toms, J. D., Matsuoka, S. M., Cumming, S. G., Barker, N. K. S., Thogmartin, W. E., Stralberg, D., Crosby, A. D., Dénes, F. V., Haché, S., Mahon, C. L., Schmiegelow, F. K. A., and Bayne, E. M., 2020. Lessons learned from comparing spatially explicit models and the Partners in Flight approach to estimate population sizes of boreal birds in Alberta, Canada. Condor, 122: 1-22. PDF
+
Sólymos, P., Matsuoka, S. M., Cumming, S. G., Stralberg, D., Fontaine, P., Schmiegelow, F. K. A., Song, S. J., and Bayne, E. M., 2018. Evaluating time-removal models for estimating availability of boreal birds during point-count surveys: sample size requirements and model complexity. Condor, 120: 765-786. PDF
+
Sólymos, P., Matsuoka, S. M., Stralberg, D., Barker, N. K. S., and Bayne, E. M., 2018. Phylogeny and species traits predict bird detectability. Ecography, 41: 1595-1603. PDF
+
Van Wilgenburg, S. L., Sólymos, P., Kardynal, K. J. and Frey, M. D., 2017. Paired sampling standardizes point count data from humans and acoustic recorders. Avian Conservation and Ecology, 12(1):13. PDF
+
Yip, D. A., Leston, L., Bayne, E. M., Sólymos, P. and Grover, A., 2017. Experimentally derived detection distances from audio recordings and human observers enable integrated analysis of point count data. Avian Conservation and Ecology, 12(1):11. PDF
+
Sólymos, P., and Lele, S. R., 2016. Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods in Ecology and Evolution, 7:196-205. PDF
+
Matsuoka, S. M., Mahon, C. L., Handel, C. M., Sólymos, P., Bayne, E. M., Fontaine, P. C., and Ralph, C. J., 2014. Reviving common standards in point-count surveys for broad inference across studies. Condor 116:599-608. PDF
+
Sólymos, P., Matsuoka, S. M., Bayne, E. M., Lele, S. R., Fontaine, P., Cumming, S. G., Stralberg, D., Schmiegelow, F. K. A. & Song, S. J., 2013. Calibrating indices of avian density from non-standardized survey data: making the most of a messy situation. Methods in Ecology and Evolution 4:1047-1058. PDF
+
Matsuoka, S. M., Bayne, E. M., Sólymos, P., Fontaine, P., Cumming, S. G., Schmiegelow, F. K. A., & Song, S. A., 2012. Using binomial distance-sampling models to estimate the effective detection radius of point-counts surveys across boreal Canada. Auk 129:268-282. PDF
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2021-03-25-point-count-data-analysis/qpad-workshop-Final.zip b/docs/posts/2021-03-25-point-count-data-analysis/qpad-workshop-Final.zip
new file mode 100644
index 0000000..1ef8245
Binary files /dev/null and b/docs/posts/2021-03-25-point-count-data-analysis/qpad-workshop-Final.zip differ
diff --git a/docs/posts/2021-03-25-point-count-data-analysis/thumb.jpeg b/docs/posts/2021-03-25-point-count-data-analysis/thumb.jpeg
new file mode 100644
index 0000000..ad324fe
Binary files /dev/null and b/docs/posts/2021-03-25-point-count-data-analysis/thumb.jpeg differ
diff --git a/docs/posts/2021-05-04-building-r-packages/altumcode-PNbDkQ2DDgM-unsplash.jpeg b/docs/posts/2021-05-04-building-r-packages/altumcode-PNbDkQ2DDgM-unsplash.jpeg
new file mode 100644
index 0000000..d118b68
Binary files /dev/null and b/docs/posts/2021-05-04-building-r-packages/altumcode-PNbDkQ2DDgM-unsplash.jpeg differ
diff --git a/docs/posts/2021-05-04-building-r-packages/index.html b/docs/posts/2021-05-04-building-r-packages/index.html
new file mode 100644
index 0000000..2fdb10d
--- /dev/null
+++ b/docs/posts/2021-05-04-building-r-packages/index.html
@@ -0,0 +1,521 @@
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Building R packages
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Building R packages
+
+
+
This practical training will cover the basics of modern package development in R with a focus on the following three aspects: (1) how to turn your code into functions, (2) how to write tests and documentation, and (3) how to share your R package on GitHub..
But most of all: cookies are delicious for what they contain: chocolate chunks, candy, oats, cocoa. However, all cookies share some fundamental ingredients and nearly identical structure. Flour, saturated with fat and sugar hydrated only with an egg, flavoured with vanilla and salt. The basic formula is invariant and admits only slight deviation – otherwise, it becomes something other than a cookie.
+
This workshop is devoted to the study of cookie dough.
+
+
Mise en place : development environment
+
We’ll explore a few useful packages in this workshop. The first two in particular are very popular tools for modern-day R package development:
Building an R package also requires specific tools for compiling the finished package. Run the following line to make sure you have the development environment:
+
devtools::has_devel()
+
If you do not have the software to build R packages, you should see a message which will help you find the correct links to download what you need!
+
Windows will need RTools. First do the check above to see if you are already set up. If not then download the software here.
+
and Install. After that, open R and run the following:
+
writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con ="~/.Renviron")
+
and restart R. Then run the check above once more to confirm
+
+
+
+
The structure: flour and sugar
+
+
No cookies without carbs
+
+
An R package is essentially a folder on your computer with specific structure. We will begin by creating an empty R package and taking a tour!
+
Open your R code editor, and find out where you are:
+
getwd()
+
This is to prepare for the next step, where we will choose a location for our R package folder. Please be intentional about where you place your R package! Do not place it in the same space as another package, Rstudio project or other project. Create a new and isolated location for it.
+
I am working from an existing R project in my typical R Projects folder, so I go up one level:
+
usethis::create_package("../netwerk")
+
+
+
Let’s run R CMD CHECK right away. We will do this MANY TIMES.
+
devtools::check()
+
We should see some warnings! let’s keep these in mind as we continue our tour.
+
+
The DESCRIPTION file
+
The most important file to notice is the DESCRIPTION. This gives general information about the entire package. It is written in a specific file format
+
Package: netwerk
+Title: Werks with Networks
+Version: 0.0.0.9000
+Authors@R:
+ person(given = "Andrew",
+ family = "MacDonald",
+ role = c("aut", "cre"),
+ email = "<you@email.com>")
+Description: it does networks.
+License: MIT + file LICENSE
+Encoding: UTF-8
+LazyData: true
+Roxygen: list(markdown = TRUE)
+RoxygenNote: 7.1.1
+Suggests:
+ testthat (>= 3.0.0)
+Config/testthat/edition: 3
+
Here are some things to edit manually in DESCRIPTION:
+
+
package name [tk naming of R packages] – make it short and convenient if you can!
+
Title: write this part In Title Case. Don’t end the title with a period.
+
Description: Describe the package in a short block of text. This should end with a period.
+
Authors: Add your name here and the name of anyone building the package with you. usethis will have done the first step for you, and filled in the structure. Only “aut” (author) and “cre” (creator) are essential. but many others are possible
+
+
Add your name here.
+
Add a license
+
usethis::use_mit_license(copyright_holder ="")
+
note about the different roles taht R package authors can have. Funny ones. but creator and maintainer are the key ones.
+
Note the R folder. We’ll get much more into that later
+
+
Rbuildignore
+
+
+
+
+
Keeping notes
+
create an R file
+
usethis::use_build_ignore("dev.R")
+
the docs folder
+
here we have a very minimal version of an R packages we’re going to be adding to it as the course progresses.
+
One thing we can do right away is build and check the R package
+
What exactly is happining here? slide from R package tutorial.
+
Lots of checkpoints and progress confrimations along the way.
+
OK so what is that all about? we have compiled the R package and it has gone to where the R packages on our computer go.
+
There is a natural cycle to how the different steps in an R package workflow proceed – see the documentation for this lesson – we will be following this process (TK another pictures?
+
Ok so now that we ahve the basic structure, let’s talk about some content for the R package. I received the donation of a little R function already that we can use to create this workflow in a nice way
+
This R function (explain what the function does)
+
OK so let’s focus on just one part of this function.
+
load all – shortcut
+
+
how do we do this in VScode?
+
+
+
how to add something to the .Rbuildignore? it would be nice to have a little .dev script as a space to create all the ohter dependencies that are involved in making an R package.
+
+
+
+
✔ Setting active project to '/Users/katherine/Documents/GitHub/bios2.github.io-quarto'
+✔ Adding '^development\\.R$' to 'posts/2021-05-04-building-r-packages/.Rbuildignore'
+
+
+
+
+
Useful links
+
This workshop borrows heavily from some excellent sources:
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2021-05-04-building-r-packages/start_pkg.png b/docs/posts/2021-05-04-building-r-packages/start_pkg.png
new file mode 100644
index 0000000..40072d2
Binary files /dev/null and b/docs/posts/2021-05-04-building-r-packages/start_pkg.png differ
diff --git "a/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.25.32.png" "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.25.32.png"
new file mode 100644
index 0000000..742209e
Binary files /dev/null and "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.25.32.png" differ
diff --git "a/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.58.08.png" "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.58.08.png"
new file mode 100644
index 0000000..b02a6aa
Binary files /dev/null and "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 05.58.08.png" differ
diff --git "a/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 06.25.38.png" "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 06.25.38.png"
new file mode 100644
index 0000000..1d35049
Binary files /dev/null and "b/docs/posts/2021-06-22-introduction-to-shiny-apps/Capture d\342\200\231\303\251cran, le 2021-06-23 \303\240 06.25.38.png" differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/image.jpg b/docs/posts/2021-06-22-introduction-to-shiny-apps/image.jpg
new file mode 100644
index 0000000..325cf34
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/image.jpg differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/add_elements_diagram.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/add_elements_diagram.png
new file mode 100644
index 0000000..935a311
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/add_elements_diagram.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-ids.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-ids.png
new file mode 100644
index 0000000..b1540a1
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-ids.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in1.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in1.png
new file mode 100644
index 0000000..dd93dcb
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in1.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in2.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in2.png
new file mode 100644
index 0000000..2612ed2
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-in2.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-output.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-output.png
new file mode 100644
index 0000000..5cbb723
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-output.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-reactiveid.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-reactiveid.png
new file mode 100644
index 0000000..944e581
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/app-reactiveid.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/darklytheme.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/darklytheme.png
new file mode 100644
index 0000000..4090cc6
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/darklytheme.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/plotOutput.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/plotOutput.png
new file mode 100644
index 0000000..bcba3ab
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/plotOutput.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/populated_shiny.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/populated_shiny.png
new file mode 100644
index 0000000..325cf34
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/populated_shiny.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/shiny_dashboard_layout.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/shiny_dashboard_layout.png
new file mode 100644
index 0000000..53e9894
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/shiny_dashboard_layout.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic.png
new file mode 100644
index 0000000..915f958
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic_plot.png b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic_plot.png
new file mode 100644
index 0000000..40e2d22
Binary files /dev/null and b/docs/posts/2021-06-22-introduction-to-shiny-apps/images/thematic_plot.png differ
diff --git a/docs/posts/2021-06-22-introduction-to-shiny-apps/index.html b/docs/posts/2021-06-22-introduction-to-shiny-apps/index.html
new file mode 100644
index 0000000..ec8ae2d
--- /dev/null
+++ b/docs/posts/2021-06-22-introduction-to-shiny-apps/index.html
@@ -0,0 +1,1216 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Introduction to Shiny Apps
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to Shiny Apps
+
+
+
Introduction to interactive app development with R Shiny.
+
+
+
+
Technical
+
Fellow contributed
+
EN
+
+
+
+
+
+
+
+
+
Authors
+
+
Katherine Hébert
+
Andrew MacDonald
+
Jake Lawlor
+
Vincent Bellevance
+
+
+
+
+
Published
+
+
June 22, 2021
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Why do you want to use Shiny?
+
There are many reasons to consider using Shiny for a project:
+
+
Sharing results from a paper with your readers;
+
Helping you explore a model, mathematics, simulations;
+
Letting non R users use R.
+
+
+
+
Hello Shiny!
+
Here is an example of a Shiny app that RStudio generates when you open a new Shiny Web App file:
+
+
# Define UI for app that draws a histogram ----
+ui <-fluidPage(
+
+# App title ----
+titlePanel("Hello Shiny!"),
+
+# Sidebar layout with input and output definitions ----
+sidebarLayout(
+
+# Sidebar panel for inputs ----
+sidebarPanel(
+
+# Input: Slider for the number of bins ----
+sliderInput(inputId ="bins",
+label ="Number of bins:",
+min =1,
+max =50,
+value =30)
+
+ ),
+
+# Main panel for displaying outputs ----
+mainPanel(
+
+# Output: Histogram ----
+plotOutput(outputId ="distPlot")
+
+ )
+ )
+)
+
+
+
+
1 How a Shiny app works
+
+
Building blocks
+
We’ve now seen the basic building blocks of a Shiny app:
+
+
The user interface, which determines how the app “looks”. This is how we tell Shiny where to ask for user inputs, and where to put any outputs we create.
+
Reactive values, which are values that change according to user inputs. These are values that affect the outputs we create in the Shiny app, such as tables or plots.
+
The server, where we use reactive values to generate some outputs.
+
+
+
IDs
+
The user interface and server communicate through IDs that we assign to inputs from the user and outputs from the server.
+
+
We use an ID (in orange) to link the user input in the UI to the reactive values used in the server:
+
+
We use another ID (in blue) to link the output created in the server to the output shown in the user interface:
+
+
+
+
Organisation
+
These elements can all be placed in one script named app.R or separately in scripts named ui.R and server.R. The choice is up to you, although it becomes easier to work in separate ui.R and server.R scripts when the Shiny app becomes more complex.
+
Example 1: Everything in app.R
+
Example 2: Split things into ui.R and server.R
+
+
+
+
+
Plots
+
Shiny is an excellent tool for visual exploration - it is at its most useful when a user can see something change before their eyes according to some selections. This is a great way to allow users to explore a dataset, explore the results of some analyses according to different parameters, and so on!
+
Let’s now add a plot to our Shiny app, to visualize the distribution of a variable depending on user input. We’ll be adding the ggplot2 and ggridges packages in the set-up step at the top of our app.R to allow us to make a plot.
To add a plot in our Shiny, we need to indicate where the plot should appear in the app. We can do this with plotOutput(), a similar function to tableOutput() in the previous section that is meant for plot outputs, as the name suggests.
+
+
# Define UI for application that makes a table andplots the Volcano Explosivity
+# Index for the most eruptive volcanoes within a selected range of years
+
+ui <-fluidPage(
+
+# Application title ----
+
+titlePanel("Exploring volcano explosivity"),
+
+# Input interface ----
+
+sidebarLayout(
+sidebarPanel(
+
+# Sidebar with a slider range input
+sliderInput("years", # the id your server needs to use the selected value
+label =h3("Years"),
+min =1900, max =2020, # maximum range that can be selected
+value =c(2010, 2020) # this is the default slider position
+ )
+ )
+ ),
+
+# Show the outputs from the server ---------------
+mainPanel(
+
+# Show a ridgeplot of explosivity index for selected volcanoes
+plotOutput("ridgePlot"),
+
+# then, show the table we made in the previous step
+tableOutput("erupt_table")
+
+ )
+)
+
+
Now our Shiny app knows where we want to place our plot.
+
+
+
Server
+
We now need to create the plot we want to show in our app. This plot will change depending on one or several reactive values that the user can input or select in our UI.
+
We link the UI and server together with IDs that are assigned to each object. Above, we told the UI to expect a plot output with the ID "ridgePlot". In the server, we will create a plot and render it as a plot object using renderPlot(), and we will assign this plot output to the ID we call in the UI (as output$ridgePlot).
+
+
# Define server logic required to make your output(s)
+server <-function(input, output) {
+
+
+# prepare the data
+# ----------------------------------------------------------
+
+# read the dataset
+ eruptions <- readr::read_rds(here::here("data", "eruptions.rds"))
+
+# filter the dataset to avoid overloading the plot
+ eruptions <- eruptions[which(eruptions$volcano_name %in%names(which(table(eruptions$volcano_name) >30))),]
+# this subsets to volcanoes that have erupted more than 30 times
+
+
+# make reactive dataset
+# ----------------------------------------------------------
+
+# subset volcano data with input year range
+ eruptions_filtered <-reactive({
+subset(eruptions, start_year >= input$years[1] & end_year <= input$years[2])
+ })
+
+
+# create and render the outputs
+# ----------------------------------------------------------
+
+# create the table of volcanoes
+ output$erupt_table <-renderTable({
+head(eruptions_filtered())
+ })
+
+# render the plot output
+ output$ridgePlot <-renderPlot({
+
+# create the plot
+ggplot(data =eruptions_filtered(),
+aes(x = vei,
+y = volcano_name,
+fill = volcano_name)) +
+# we are using a ridgeplot geom here, from the ggridges package
+geom_density_ridges( size = .5) +# line width
+
+# label the axes
+labs(x ="Volcano Explosivity Index", y ="") +
+
+# adjust the ggplot theme to make the plot "prettier"
+theme_classic() +
+theme(legend.position ="none",
+axis.text =element_text(size =12, face ="bold"),
+axis.title =element_text(size =14, face ="bold"))
+ })
+}
+
+
+
+
The Shiny app
+
Now, if we run the Shiny app, we have a plot above the table we made previously. They are positioned in this way because the plotOutput() comes before the tableOutput() in the UI.
+
+
# Run the application
+shinyApp(ui = ui, server = server)
+
+
+
+
+
+
Customising the theme
+
If you’d like to go one step further, you can also customize the appearance of your Shiny app using built-in themes, or creating your own themes.
+
+
Using built-in themes
+
There are several built-in themes in Shiny, which allow you to quickly change the appearance of your app. You can browse a gallery of available themes here here, or test themes out interactively here.
+
Let’s try the darkly theme on our Shiny app. To do this, we will need the shinythemes package.
+
+
library(shinythemes)
+
+
We can change the theme of our previous app with one line of code:
+
+
# Define UI for application that makes a table andplots the Volcano Explosivity
+# Index for the most eruptive volcanoes within a selected range of years
+
+ui <-fluidPage(
+
+# Application title ----
+
+titlePanel("Exploring volcano explosivity"),
+
+# Input interface ----
+
+sidebarLayout(
+sidebarPanel(
+
+# Sidebar with a slider range input
+sliderInput("years", # the id your server needs to use the selected value
+label =h3("Years"),
+min =1900, max =2020, # maximum range that can be selected
+value =c(2010, 2020) # this is the default slider position
+ )
+ )
+ ),
+
+# Show the outputs from the server ---------------
+mainPanel(
+
+# Show a ridgeplot of explosivity index for selected volcanoes
+plotOutput("ridgePlot"),
+
+# then, show the table we made in the previous step
+tableOutput("erupt_table")
+
+ ),
+
+# Customize the theme ----------------------
+
+# Use the darkly theme
+theme = shinythemes::shinytheme("darkly")
+)
+
+
Now, if we run the app, it looks a little different:
+
+
+
+
Using a custom theme
+
You can also go beyond the built-in themes, and create your own custom theme with the fonts and colours of your choice. You can also apply this theme to the outputs rendered in the app, to bring all the visuals together for a more cohesive look.
+
+
Customizing a theme
+
To create a custom theme, we will be using the bs_theme() function from the bslib package.
+
+
library(bslib)
+
+
+
# Create a custom theme
+cute_theme <- bslib::bs_theme(
+
+bg ="#36393B", # background colour
+fg ="#FFD166", # most of the text on your app
+primary ="#F26430", # buttons, ...
+
+# you can also choose fonts
+base_font =font_google("Open Sans"),
+heading_font =font_google("Open Sans")
+)
+
+
To apply this theme to our Shiny app (and the outputs), we will be using the thematic package.
+
+
library(thematic)
+
+
There are two essential steps to apply a custom theme to a Shiny app:
+
+
Activating thematic.
+
Setting the user interface’s theme to the custom theme (cute_theme).
+
+
+
# Activate thematic
+# so your R outputs will be changed to match up with your chosen styling
+thematic::thematic_shiny()
+
+# Define UI for application that makes a table andplots the Volcano Explosivity
+# Index for the most eruptive volcanoes within a selected range of years
+
+ui <-fluidPage(
+
+# Application title ----
+
+titlePanel("Exploring volcano explosivity"),
+
+# Input interface ----
+
+sidebarLayout(
+sidebarPanel(
+
+# Sidebar with a slider range input
+sliderInput("years", # the id your server needs to use the selected value
+label =h3("Years"),
+min =1900, max =2020, # maximum range that can be selected
+value =c(2010, 2020) # this is the default slider position
+ )
+ )
+ ),
+
+# Show the outputs from the server ---------------
+mainPanel(
+
+# Show a ridgeplot of explosivity index for selected volcanoes
+plotOutput("ridgePlot"),
+
+# then, show the table we made in the previous step
+tableOutput("erupt_table")
+
+ ),
+
+# Customize the theme ----------------------
+
+# Use our custom theme
+theme = cute_theme
+)
+
+
Now, if we run the app, the user interface and plot theme is set to the colours and fonts we set in cute_theme:
+
+
Here, thematic is not changing the colours used to represent a variable in our plot, because this is an informative colour scale (unlike the colour of axis labels, lines, and the plot background). However, if we remove this colour variable in our ridgeplot in the server, thematic will change the plot colours as well. Here is a simplified example of our server to see what these changes would look like:
+
+
# Define server logic required to make your output(s)
+server <-function(input, output) {
+
+#... (all the good stuff we wrote above)
+
+# render the plot output
+ output$ridgePlot <-renderPlot({
+
+# create the plot
+ggplot(data =eruptions_filtered(),
+aes(x = vei,
+y = volcano_name)) +# we are no longer setting
+# the fill argument to a variable
+
+# we are using a ridgeplot geom here, from the ggridges package
+geom_density_ridges(size = .5) +
+
+# label the axes
+labs(x ="Volcano Explosivity Index", y ="") +
+
+# remove the "classic" ggplot2 so it doesn't override thematic's changes
+# theme_classic() +
+theme(legend.position ="none",
+axis.text =element_text(size =12, face ="bold"),
+axis.title =element_text(size =14, face ="bold"))
+ })
+ }
+
+
Now, our plot’s theme follows the app’s custom theme as well:
+
+
+
+
+
+
+
2 Constructing a Shiny app using shinyDashboards
+
+
Taking advantage of good defaults
+
Here, we will use shiny extension shinyDashboards and leaflet to construct a custom Shiny App to map volcanoes of the world. First, we need a few additional packages.
+
Note: All Source code for this app can be found here on the BIOS2 Github.
We will create our app using defaults from the ShinyDashboard package, which always includes three main components: a header, using dashboardHeader(), a sidebar, using dashboardSidebar(), and a body, using dashboardBody(). These are then added together using the dashboardPage() function.
+
Building these elements is less like usual R coding, and more like web design, since we are, in fact, designing a unser interface for a web app. Here, we’ll make a basic layout before populating it.
+
+
# create the header of our app
+header <-dashboardHeader(
+title ="Exploring Volcanoes of the World",
+titleWidth =350# since we have a long title, we need to extend width element in pixels
+)
+
+
+# create dashboard body - this is the major UI element
+body <-dashboardBody(
+
+# make first row of elements (actually, this will be the only row)
+fluidRow(
+
+# make first column, 25% of page - width = 3 of 12 columns
+column(width =3,
+
+
+# Box 1: text explaining what this app is
+#-----------------------------------------------
+box( width =NULL,
+status="primary", # this line can change the automatic color of the box.
+title =NULL,
+p("here, we'll include some info about this app")
+
+
+ ), # end box 1
+
+
+# box 2 : input for selecting volcano type
+#-----------------------------------------------
+box(width =NULL, status ="primary",
+title ="Selection Criteria", solidHeader = T,
+
+p("here, we'll add a UI element for selecting volcano types"),
+
+ ), # end box 2
+
+
+
+# box 3: ggplot of selected volcanoes by continent
+#------------------------------------------------
+box(width =NULL, status ="primary",
+solidHeader =TRUE, collapsible = T,
+title ="Volcanoes by Continent",
+p("here, we'll add a bar plot of volcanoes in each continent")
+ ) # end box 3
+
+ ), # end column 1
+
+# second column - 75% of page (9 of 12 columns)
+#--------------------------------------------------
+column(width =9,
+# Box 4: leaflet map
+box(width =NULL, background ="light-blue", height =850,
+p("here, we'll show volcanoes on a map"),
+ ) # end box with map
+ ) # end second column
+
+ ) # end fluidrow
+) # end body
+
+
+# add elements together
+dashboardPage(
+skin ="blue",
+header = header,
+sidebar =dashboardSidebar(disable =TRUE), # here, we only have one tab of our app, so we don't need a sidebar
+body = body
+)
+
+
+
+
+
+
Populating the Layout
+
Now, we are going to fill out app with elements. In this app, we will only have one user input: a selection of the volcano type to show. We will use this input (input$volcano_type), which will be used to filter data in the server (i.e. make a smaller dataset using only volcanoes of the selected types), then use this filtered dataset to create output elements (plots and maps).
+
Below, we show the necessary code to include in both the UI and the Server to create each plot element. Notice that after the reactive value selected_volcanoes is created in the selection box, this is the only object that is used to create the other elements in the app.
+
+
+
+
+
+
+
+
+
+
Location
+
Element
+
UI
+
Server
+
+
+
+
+
Box 1
+
Intro Textbox
+
Markdown/HTML text code
+
+
+
+
Box 2
+
Selection Wigets
+
checkboxGroupButtons(inputID = "volcano_type")
+
selected_volcanoes <- reactive({volcano_df %>% filter(type %in% input$volcano_type)}) to create a filtered dataset that will react to user input
+
+
+
Box 3
+
Bar Graph
+
plotOutput("continentplot")
+
output$continentplot <- renderPlot(...)) which will plot from the selectied_volcanoes reactive object
+
+
+
Box 4
+
Leaflet Map
+
leafletOutput("volcanomap")
+
output$volcanomap <- renderLeaflet(...) to map points from the selectied_volcanoes reactive object
+
+
+
+
+
+
+
Challenge!
+
Use the code provided to add your own additional user input to the Shiny App. The code (which you can access here leaves a space for an additional UI input inside box 2). Then, you’ll need to use your new input element to the reactive value in the Server, as noted in the server code.
Golem provides you with some very helpful “workflow” scripts:
+
Edit the Description by filling in this function in 01_dev.R
+
Then, add all the dependencies
+
Together, this edits the DESCRIPTION of your R package to look something like this:
+
Now we can start building the app.
+
+
app_ui <-function(request) {
+tagList(
+# Leave this function for adding external resources
+golem_add_external_resources(),
+# Your application UI logic
+ shinydashboard::dashboardPage(
+header = shinydashboard::dashboardHeader(
+title ="Exploring Volcanoes of the World",
+titleWidth =350# since we have a long title, we need to extend width element in pixels
+ ),
+sidebar = shinydashboard::dashboardSidebar(disable =TRUE), # here, we only have one tab, so we don't need a sidebar
+body = shinydashboard::dashboardBody(
+# make first row of elements (actually, this will be the only row)
+fluidRow(
+# make first column, 25% of page - width = 3 of 12 colums
+column(width =3,
+# box 1 : input for selecting volcano type
+#-----------------------------------------------
+ shinydashboard::box(width =NULL, status ="primary",
+title ="Selection Criteria", solidHeader = T
+
+## CHECKBOX HERE
+
+ ), # end box 1
+# box 2: ggplot of selected volcanoes by continent
+#------------------------------------------------
+ shinydashboard::box(width =NULL, status ="primary",
+solidHeader =TRUE, collapsible = T,
+title ="Volcanoes by Continent"
+
+## PLOT HERE
+
+ ) # end box 2
+ ), # end column 1
+
+# second column - 75% of page (9 of 12 columns)
+column(width =9,
+
+# Box 3: leaflet map
+ shinydashboard::box(width =NULL, background ="light-blue"
+
+## MAP HERE
+
+ ) # end box with map
+ ) # end second column
+ ) # end fluidrow
+ ) # end body
+ )
+ )
+}
+
+
This indicates where each of the three components of the app would go.
+
At this point we can run the app to get a very empty-looking UI:
+
+
+
Golem modules
+
We could split this app into three different sections, corresponding to each of the three boxes:
+
+
Filter the type of volcano we see
+
take the filtered volcanoes and plot a stacked bar chart
+
take the filtered volcanoes and plot a map
+
+
+
+
Selecting the volcanoes
+
+
#' volcano_select UI Function
+#'
+#' @description A shiny Module.
+#'
+#' @param id,input,output,session Internal parameters for {shiny}.
+#'
+#' @noRd
+#'
+#' @importFrom shiny NS tagList
+mod_volcano_select_ui <-function(id){
+ ns <-NS(id)
+tagList(
+# Widget specifying the species to be included on the plot
+ shinyWidgets::checkboxGroupButtons(
+inputId =ns("volcano_type"),
+label ="Volcano Type",
+choices =c("Stratovolcano" , "Shield" ,"Cone" , "Caldera" , "Volcanic Field",
+"Complex" , "Other", "Lava Dome" , "Submarine" ),
+checkIcon =list(
+yes = tags$i(class ="fa fa-check-square",
+style ="color: steelblue"),
+no = tags$i(class ="fa fa-square-o",
+style ="color: steelblue"))
+ ) # end checkboxGroupButtons
+ )
+}
+
+#' volcano_select Server Functions
+#'
+#' @noRd
+mod_volcano_select_server <-function(id, volcano){
+moduleServer( id, function(input, output, session){
+ ns <- session$ns
+
+# make reactive dataset
+# ------------------------------------------------
+# Make a subset of the data as a reactive value
+# this subset pulls volcano rows only in the selected types of volcano
+ selected_volcanoes <-reactive({
+
+req(input$volcano_type)
+
+ volcano %>%
+
+# select only volcanoes in the selected volcano type (by checkboxes in the UI)
+ dplyr::filter(volcano_type_consolidated %in% input$volcano_type) %>%
+# Space to add your suggested filter here!!
+# --- --- --- --- --- --- --- --- --- --- --- --- ---
+# filter() %>%
+# --- --- --- --- --- --- --- --- --- --- --- --- ---
+# change volcano type into factor (this makes plotting it more consistent)
+ dplyr::mutate(volcano_type_consolidated =factor(volcano_type_consolidated,
+levels =c("Stratovolcano" , "Shield", "Cone", "Caldera", "Volcanic Field",
+"Complex" , "Other" , "Lava Dome" , "Submarine" ) ) )
+ })
+
+ })
+}
+
+## To be copied in the UI
+# mod_volcano_select_ui("volcano_select_1")
+
+## To be copied in the server
+# mod_volcano_select_server("volcano_select_1")
+
+
I also like to test my modules by using them to create a toy Shiny app. The best place to do this is by using a testthat directory. This is another great advantage of using a package workflow. You can set this up easily with usethis::use_test(): just run usethis::use_test from the R console when you have the module open.
+
Then write a simple test like this
+
+
test_that("volcano selection module works", {
+
+ testthat::skip_if_not(interactive())
+
+
+ ui <-fluidPage(
+## To be copied in the UI
+mod_volcano_select_ui("volcano_select_1"),
+tableOutput("table")
+ )
+
+ server <-function(input, output) {
+## To be copied in the server
+ volcano_data <-readRDS("data/volcanoes.rds")
+ selected_data <-mod_volcano_select_server("volcano_select_1",
+volcano = volcano_data)
+
+ output$table <-renderTable(selected_data())
+ }
+
+shinyApp(ui = ui, server = server)
+
+})
+
+
Which generates the following simple app:
+
+
+
+
Barplot of continents
+
+
#' continentplot UI Function
+#'
+#' @description A shiny Module.
+#'
+#' @param id,input,output,session Internal parameters for {shiny}.
+#'
+#' @noRd
+#'
+#' @importFrom shiny NS tagList
+mod_continentplot_ui <-function(id){
+ ns <-NS(id)
+tagList(
+plotOutput(ns("barplot"), # this calls to object continentplot that is made in the server page
+height =350)
+ )
+}
+
+#' continentplot Server Functions
+#'
+#' @noRd
+mod_continentplot_server <-function(id, volcano, selected_volcanoes){
+
+# kind of helpful
+stopifnot(is.reactive(selected_volcanoes))
+
+moduleServer( id, function(input, output, session){
+ ns <- session$ns
+
+ output$barplot <-renderPlot({
+
+# create basic barplot
+ barplot <- ggplot2::ggplot(data = volcano,
+ ggplot2::aes(x=continent,
+fill = volcano_type_consolidated))+
+# update theme and axis labels:
+ ggplot2::theme_bw()+
+ ggplot2::theme(plot.background = ggplot2::element_rect(color="transparent",fill ="transparent"),
+panel.background = ggplot2::element_rect(color="transparent",fill="transparent"),
+panel.border = ggplot2::element_rect(color="transparent",fill="transparent"))+
+ ggplot2::labs(x=NULL, y=NULL, title =NULL) +
+ ggplot2::theme(axis.text.x = ggplot2::element_text(angle=45,hjust=1))
+
+
+# IF a selected_volcanoes() object exists, update the blank ggplot.
+# basically this makes it not mess up when nothing is selected
+
+ barplot <- barplot +
+ ggplot2::geom_bar(data =selected_volcanoes(), show.legend = F) +
+ ggplot2::scale_fill_manual(values = RColorBrewer::brewer.pal(9,"Set1"), drop=F) +
+ ggplot2::scale_x_discrete(drop=F)
+
+
+# print the plot
+ barplot
+
+ }) # end renderplot command
+
+
+ })
+}
+
+## To be copied in the UI
+# mod_continentplot_ui("continentplot_1")
+
+## To be copied in the server
+# mod_continentplot_server("continentplot_1")
+
+
the test
+
+
test_that("volano barplot works", {
+
+ testthat::skip_if_not(interactive())
+
+
+ ui <-fluidPage(
+## To be copied in the UI
+mod_continentplot_ui("continentplot_1"),
+tableOutput("table")
+ )
+
+ server <-function(input, output) {
+## To be copied in the server
+ volcano_data <-readRDS("data/volcanoes.rds")
+ volcano_recent <-subset(volcano_data, last_eruption_year >2000)
+mod_continentplot_server("continentplot_1",
+volcano = volcano_data,
+selected_volcanoes =reactive(volcano_recent))
+ }
+
+shinyApp(ui = ui, server = server)
+})
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2021-07-19-glm-community-ecology/Gaussian.png b/docs/posts/2021-07-19-glm-community-ecology/Gaussian.png
new file mode 100755
index 0000000..8753544
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/Gaussian.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/InflatedApproach4thcorner.png b/docs/posts/2021-07-19-glm-community-ecology/InflatedApproach4thcorner.png
new file mode 100755
index 0000000..5655436
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/InflatedApproach4thcorner.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/UtilityFunctions.R b/docs/posts/2021-07-19-glm-community-ecology/UtilityFunctions.R
new file mode 100644
index 0000000..37d8a5e
--- /dev/null
+++ b/docs/posts/2021-07-19-glm-community-ecology/UtilityFunctions.R
@@ -0,0 +1,115 @@
+#########################################################
+## Utility function to compute the different correlations
+#########################################################
+library(ade4)
+TraitEnvCor <- function(L,E,T, Chessel = TRUE){
+
+ E<-as.matrix(E)
+ T<-as.matrix(T)
+ L<-as.matrix(L)
+ # centering_mat <- function(X,w){ X - rep(1,length(w))%*%t(w)%*% X }
+ standardize_w <- function(X,w){
+ ones <- rep(1,length(w))
+ Xc <- X - ones %*% t(w)%*% X
+ Xc / ones%*%sqrt(t(ones)%*%(Xc*Xc*w))
+ }
+
+ # check_L()
+ rows<-seq_len(nrow(L))
+ cols<-seq_len(ncol(L))
+ rni <-which(rowSums(L)==0)
+ repeat {
+ if (length(rni)) {L <- L[-rni,,drop = FALSE]; rows <-rows[-rni]}
+ ksi <- which(colSums(L)==0)
+ if (length(ksi)) {L <- L[,-ksi, drop = FALSE]; cols <- cols[-ksi]}
+ rni <-which(rowSums(L)==0)
+ if ( length(rni)==0 & length(ksi)==0){break}
+ }
+ E <-E[rows,,drop = FALSE]
+ T <-T[cols,,drop = FALSE]
+ # end check_L()
+
+ L<-L/sum(L)
+ # dimensions
+ #S <- ncol(L) # number of species
+ #n <- nrow(L) # number of communities
+ p <- ncol(E) # number of environmental predictors
+ q <- ncol(T) # number of traits
+
+ # setting up matrices
+ Wn <- rowSums(L)
+ Ws <- colSums(L)
+ # cor matrices are trait by environment
+ CWM <- L%*%T/Wn # weighted means wrt to T
+ CWM.cor <- cor(CWM,E)
+
+ SNC <- t(L)%*%E/Ws # weighted means wrt to E
+ SNC.cor <- cor(T,SNC)
+
+ CWMstd_w <- standardize_w(CWM,Wn)
+ Estd_w <- standardize_w(E,Wn)
+ wCWM.cor <- t(t(Estd_w)%*%(CWMstd_w*Wn))
+
+ SNCstd_w <- standardize_w(SNC,Ws)
+ Tstd_w <- standardize_w(T,Ws)
+ wSNC.cor <- t(Tstd_w)%*%(SNCstd_w*Ws)
+
+ # Fourth corner calculated as W_n weighted covariance between
+ # CWM and standardized T (trait)
+
+ CWM_std_tw <- L%*%Tstd_w/Wn #CWM wrt to standardized T (trait)
+ Fourthcorner <- t(CWM_std_tw)%*%(Estd_w*Wn)
+ if (Chessel){
+ singular_val1 <- sqrt(ade4::dudi.coa(L, scannf = FALSE)$eig[1])
+ Chessel.4thcor<-Fourthcorner/ singular_val1
+ }else { Chessel.4thcor<-NA;singular_val1 <-1}
+
+
+ # variation components
+ # Among communities
+ Among.Variation <- sum(diag(t(CWM_std_tw)%*%(CWM_std_tw* Wn)))
+ # Within communities
+ Within.Variation <- 1 - Among.Variation
+
+ # result specialized to one trait and one environment variables; use array(0, dim(6,k,p)) in the general case
+ # array.result<-matrix(c(CWM.cor,wCWM.cor,SNC.cor,wSNC.cor,Fourthcorner,Chessel.4thcor,Mean.Variation),ncol=1)
+ array.result<-array(0, dim=c(8,q,p))
+ rownames(array.result)<- c("CWM.cor","wCWM.cor","SNC.cor","wSNC.cor","Fourthcorner","Chessel.4thcor","Among Wn-variance (%)", "Within Wn-variance (%)")
+ array.result[1,,]<-CWM.cor
+ array.result[2,,]<-wCWM.cor
+ array.result[3,,]<-SNC.cor
+ array.result[4,,]<-wSNC.cor
+ array.result[5,,]<-Fourthcorner
+ array.result[6,,]<-Chessel.4thcor
+ array.result[7,,]<-Among.Variation * 100
+ array.result[8,,]<-Within.Variation * 100
+ return(array.result[,,])
+}
+
+###################################################################
+## Utility function for the row, column, and row-column permutation schemes
+## for a single trait and a single environmental variable
+## and the five test statistics/approaches of the paper
+## CWM.cor, wCWM.cor, SNC.cor wSNC.cor and Fourthcorner
+###################################################################
+
+CorPermutationTest <- function(L, E, T, nrepet = 999){
+ E<-as.matrix(E)
+ T<-as.matrix(T)
+ L<-as.matrix(L)
+ obs <- TraitEnvCor(L,E,T)[1:5]
+ sim.row <- matrix(0, nrow = nrepet, ncol = ncol(E) * 5)
+ sim.col <- matrix(0, nrow = nrepet, ncol = ncol(E) * 5)
+ for(i in 1:nrepet){
+ per.row <- sample(nrow(L))
+ per.col <- sample(ncol(L))
+ sim.row[i, ] <- c(as.matrix(data.frame(TraitEnvCor(L,E[per.row,,drop= FALSE],T))))[1:5]
+ sim.col[i, ] <- c(as.matrix(data.frame(TraitEnvCor(L,E,T[per.col,,drop= FALSE]))))[1:5]
+ }
+ pval.row <- (rowSums(apply(sim.row^2, 1, function(i) i >= obs^2)) + 1) / (nrepet + 1)
+ pval.col <- (rowSums(apply(sim.col^2, 1, function(i) i >= obs^2)) + 1) / (nrepet + 1)
+
+ result <- cbind(cor = obs, prow = pval.row, pcol = pval.col, pmax = apply(cbind(pval.row, pval.col), 1, max))
+ return(result)
+}
+
diff --git a/docs/posts/2021-07-19-glm-community-ecology/aravo.png b/docs/posts/2021-07-19-glm-community-ecology/aravo.png
new file mode 100755
index 0000000..615814d
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/aravo.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/bilinear.png b/docs/posts/2021-07-19-glm-community-ecology/bilinear.png
new file mode 100755
index 0000000..4e0af74
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/bilinear.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/image.jpg b/docs/posts/2021-07-19-glm-community-ecology/image.jpg
new file mode 100755
index 0000000..615814d
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/image.jpg differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index.html b/docs/posts/2021-07-19-glm-community-ecology/index.html
new file mode 100644
index 0000000..2881e5a
--- /dev/null
+++ b/docs/posts/2021-07-19-glm-community-ecology/index.html
@@ -0,0 +1,3114 @@
+
+
+
+
+
+
+
+
+
+
+
+
+BIOS2 Education resources - Generalized Linear Models for Community Ecology
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Generalized Linear Models for Community Ecology
+
+
+ In this workshop we will explore, discuss, and apply generalized linear models to combine information on species distributions, traits, phylogenies, environmental and landscape variation. We will also discuss inference under spatial and phylogenetic autocorrelation under fixed and random effects implementations. We will discuss technical elements and cover implementations using R.
+
+
+
+
Technical
+
EN
+
+
+
+
+
+
+
+
+
Author
+
+
Pedro Peres-Neto
+
+
+
+
+
Published
+
+
May 17, 2021
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Generalized Linear Model for Community Ecology
+Pedro Peres-Neto, Concordia University
+BIOS2 workshop, May 17 to 21, 2021
+This document was put together for the first time for this workshop.
+Let me know if you have suggestions or find any issues in the document.
+
+
+
+
Tentative schedule
+
+
Day 1:
+Introduction to types of data and approaches using GLMs in community ecology.
+Types of patterns in species distributions involving trait and environmental variation.
+Simulating data as a path to understand GLMs in community ecology.
+The simplest GLM: widely used bivariate correlations.
+The challenges of statistical inference regarding linking different types of information from communities and species.
+Understanding estimators and their properties in GLMs.
+
Day 2:
+From bivariate correlations to a variety of more complex GLMs: the case of Binomial and Poisson.
+The role of latents in specifying GLMs for community ecology.
+The issues underlying autocorrelation in ecological data: the cases of spatial and phylogenetic autocorrelation.
+Simple GLMM approaches (Generalized Linear Mixed Models).
+
Day 3:
+More complex GLMM approaches.
+Potential approaches for incorporating intraspecific data on traits.
+Discussion with participants: your research interests, your questions or your data (or anything really).
+
+
Philosophy: We can’t cover everything with extreme details. I’ve chosen a level that should be interesting enough and cover many different important aspects of GLMs applied to community ecology.
+
Note: I mostly apply here base functions so that participants without strong knowledge of certain packages (e.g., ggplot, dplyr) can follow the code more easily.
+
Questions: Participants should feel free to ask questions either directly or in the zoom chat. I’ve also set a good doc where participants can put questions there during the week when we are not connected. I’ll read them and try to provide an answer or cover the question somehow:
One way to develop good intuition underlying quantitative methods is to be able to simulate data according to certain desired characteristics. We can then apply methods (GLMs here) to see how well they retrieve the data characteristics.
+
Let’s start with a very simple GLM, the logistic regression for one single species. Here, for simplicity, we considered one predictor. In many ecological simulations, this single predictor is considered an “environmental gradient” containing many environmental predictors. We can consider more gradients and we will discuss that later on in the workshop.
+
+
set.seed(100) # so that we all have the same results
+n.sites <-100
+X <-rnorm(n.sites)
+b0 <-0.5# controls the max prob. values
+prob.presence <-1./(1+exp(-(b0+3*X)))
+plot(prob.presence ~ X)
Plotting the predicted versus the observed presence-absence values:
+
+
plot(model$fitted.values ~ Distribution)
+
+
+
+
+
At this point, we won’t cover model diagnostics. Data were simulated according to the model and, as such, assumptions hold well. Plus, this is a single-species model; and this workshop is about community data, i.e., multiple species :).
Species don’t tend to respond linearly to environmental features:
+
+
+
+
+
+
Now that we understand some basics of presence-absence data, let’s concentrate on more realistic species distribution data and multi-species data. There are many ways (found in the ecological literature) in which we can simulate these type of data. Below we will generate data using a standard Gaussian model for presence-absence data according to a trait and environmental feature (we will cover abundance data later on). This is a commonly used way to simulate data. Let’s start with a single species and one environmental variable.
+
+
set.seed(100) # so that we all have the same results
+n.sites <-100
+X <-rnorm(n.sites)
+optimum <-0.2
+niche.breadth <-0.5
+b0 <-1# controls the max prob. values
+b1 <--2
+# this is a logistic model:
+prob.presence <-1./(1+exp(-(b0+(b1*(X-optimum)^2)/(2*niche.breadth^2))))
+
+
This more “complex” model has the following form:
+
\[p=\frac{1}{1+e^{-(\beta_0+\beta_1\frac{(X-\mu)^2}{2\sigma^2})}}=\frac{1}{1+e^{-(1+2\frac{(X-0.2)^2}{2\cdot0.5^2})}}\] where \(\mu\) represents the species optimum and \(\sigma\) its niche breadth.
The parameters can be then estimated from the data using a logistic regression:
+
+
predictor <-cbind(X,X^2)
+model <-glm(Distribution ~ predictor,family=binomial(link=logit))
+coeffs <-coefficients(model)
+b0 <- coeffs[1]
+b1 <- coeffs[2]
+b2 <- coeffs[3]
+estimated.optimum <--b1/(2*b2) # as in ter Braak and Looman 1986
+estimated.niche.breadth <-1/sqrt(-2*b2)
+c(estimated.optimum,estimated.niche.breadth) # estimated by the glm
+
+
predictorX predictor
+ 0.3317377 0.3922625
+
+
c(optimum,niche.breadth) # set in our simulations above
+
+
[1] 0.2 0.5
+
+
+
We can demonstrate computationally that the parameter estimations are unbiased as they are maximum likelihood via the GLM. Here we will be showing the sampling variation only for niche optimum and breadth. The other two parameters, \(\beta_0\) and \(\beta_1\) can be placed in the code below as well, demonstrating that they are also not biased.
+
+
n.samples <-1000
+estimation.matrix <-matrix(0,n.samples,2)
+colnames(estimation.matrix) <-c("optimum","niche.breadth")
+# remember that we already set the parameters for the model above, i.e., optimum and niche breadth
+for (i in1:n.samples){
+ X <-rnorm(n.sites)
+ prob.presence <-1./(1+exp((-b0+((X-optimum)^2)/(2*niche.breadth^2)))) # this is a logistic model
+ Distribution <-rbinom(n.sites,1,prob.presence)
+ predictor <-cbind(X,X^2)
+ model <-glm(Distribution ~ predictor,family=binomial(link=logit))
+ coeffs <-coefficients(model)
+ intercept <- coeffs[1]
+ b1 <- coeffs[2]
+ b2 <- coeffs[3]
+ estimation.matrix[i,"optimum"] <--b1/(2*b2)
+ estimation.matrix[i,"niche.breadth"] <-1/sqrt(-2*b2)
+}
+
+
There may be warnings “glm.fit: fitted probabilities numerically 0 or 1 occurred”. But that is not a problem per se. It just tells us that the for some species, their models capture the distribution values perfectly. By the way, that also happens with real data.
+
Let’s observe the random variation around the parameter estimates and the average values:
c(optimum,niche.breadth) # set in our simulations above
+
+
[1] 0.2 0.5
+
+
+
Note how the mean values are pretty close to the true values used to generate the data. This small simulation helps one understand the principles of sampling variation and unbiased estimation.
+
Simulating multiple species
+
Now, let’s generalize our code to multiple species. We will create a function that allow us to make sure that all sites have at least one species present and all species are present at least in one site; this is a common (but not necessary) characteristic of data used in community ecology.
+
+
generate_communities <-function(tolerance,E,T,n.species,n.communities){
+repeat {
+# generates variation in niche breadth across species
+ niche.breadth <-runif(n.species)*tolerance
+ b0 <-runif(n.species,min=-4,max=4)
+ prob.presence <-matrix(data=0,nrow=n.communities,ncol=n.species)
+ Dist.matrix <-matrix(data=0,nrow=n.communities,ncol=n.species)
+for(j in1:n.species){
+# species optima are trait values; which makes sense ecologically
+ prob.presence[,j] <-1./(1+exp((-b0[j]+((E-T[j])^2)/(2*niche.breadth[j]^2))))
+ Dist.matrix[,j] <-rbinom(n.communities,1,prob.presence[,j])
+ }
+ n_species_c <-sum(colSums(Dist.matrix)!=0) # _c for check
+ n_communities_c <-sum(rowSums(Dist.matrix)!=0)
+if ((n_species_c == n.species) & (n_communities_c==n.communities)){break}
+ }
+ result <-list(Dist.matrix=Dist.matrix,prob.presence=prob.presence)
+return(result)
+}
+
+
Now let’s generate a community. Note that we are using one environmental gradient and one trait. We could consider more variables (trait or environmental features) by adding terms to the logistic equation above. But for the time being, that will suffice. Note, thout, that in many ecological simulations, this single predictor is considered an “environmental gradient” containing many environmental predictors. We can consider more gradients and we will discuss that later on in the workshop.
+
+
set.seed(12351) # so that we all have the same results
+n.communities <-100
+n.species <-50
+E <-rnorm(n.communities)
+T <-rnorm(n.species)
+Dist <-generate_communities(tolerance =1.5, E, T, n.species, n.communities)
+Probs <- Dist$prob.presence
+Distribution <- Dist$Dist.matrix
+
+
Let’s plot the probability values across environmental values. To do that nicely, we need to order the communities according to their environmental values as follows:
+Classic and simple GLMs for one trait and one environment (bivariate correlations)
+
+
Now that we have a basic understanding of one GLM (logistic) and how they can model different types of community ecology data, i.e., species distributions, traits and environmental variation, we can start looking into approaches that that are used by ecologists to estimate the importance of environmental and trait variation to species distributions.
+
Let’s start by calculating the simplest and widely used metric of the community weighted trait mean:
+
+
CWM <- Distribution %*% T /rowSums(Distribution)
+
+
For data that are based on presence-absence, this is simply the average of species trait values within communities.
+
Let’s now correlate CWM with the environment, i.e., community weighted means correlation. This is a widely used approach by ecologists:
+
+
plot(CWM ~ E)
+
+
+
+
cor(CWM,E)
+
+
[,1]
+[1,] 0.9296338
+
+
+
Again, the community weighted means correlation is likely the most commonly used approach with 1000s of studies having been published with it.
+
Another approach is to calculate the species weighted environment means and correlate with trait values. This approach is less common (but still quite used in the ecological literature) and is sometimes referred as to species niche centroid (SNC):
Note that the two correlations (CWM- and SNC-based) differ. That’s odd as they were both calculated on exactly the same information: the same Species Matrix, Environment and Trait. Peres-Neto, Dray & ter Braak (2017) demonstrated (mathematically) that this issue is related to the fact that although CWM and SNC are based on weights, they don’t standardized and correlate them with the proper weights. CWM is based on averages calculated based on the sum of species (richness or total abundance per community) and SNC is based on averages calculated based on the number of communities in which species are presence (prevalence) or their total abundance.
+
Because of that, we have found (Peres-Neto et al. 2017) a few undesirable properties of these two correlations (CWM and SNC-based correlations). One is that when the correlation is expected to be zero, the sampling variation of these correlations are quite large (i.e., low precision). Let’s evaluate this issue: Below we generate data with structure, but then use a false trait completely independent of the original one, thus destroying the link between trait and environmental variation.
+
+
set.seed(120) # so that we all have the same results
+n.communities <-100
+n.species <-50
+n.samples <-100# set to larger later
+
+CWM.cor <-as.matrix(0,n.samples)
+for (i in1:n.samples){
+ E <-rnorm(n.communities)
+ T <-rnorm(n.species)
+ Dist <-generate_communities(tolerance =1.5, E, T, n.species, n.communities)
+ T.false <-rnorm(n.species)
+ CWM <- Dist$Dist.matrix %*% T.false /rowSums(Dist$Dist.matrix)
+ CWM.cor[i] <-cor(CWM,E)
+}
Note that the variation is quite large for correlations based on a random trait (i.e., T.false). For example, some correlations were greater than 0.7 and smaller than -0.60. We showed that although the variation is quite large, the expected value is zero; we would need 10000 or more simulations to make mean(CWM.cor) approach almost zero. A similar issue (i.e., large variation, low precision) happens for correlations based on SNO but we won’t simulate here for brevity. One can easily adapt the code above to do so though.
+
We have shown that precision is much increased when using the 4th corner statistic. Originally described in matrix form by Legendre et al. (1997), we (Peres-Neto et al. 2017) demonstrated that the 4th corner statisticit is a GLM assuming an identity link, i.e., normally distributed residuals.
+
The basis of the 4th corner correlation is that it starts by the standardization of the trait by the sum of their species abundances (or number of sites occupied for presence-absence data), and the standardization of the environment by the sum of their community abundances. The default standardization (function scale) transforms the variable (trait or environment) in a way that its mean and standard deviation are 0 and 1, respectively. A weighted standardization makes the weighted mean and weighted standard deviation to be 0 and 1, respectively.
+
R doesn’t have a default function for weighted standardization. But this can be done using the follow function:
+
+
standardize_w <-function(X,w){
+ ones <-rep(1,length(w))
+ Xc <- X - ones %*%t(w)%*% X
+ Xc / ones%*%sqrt(t(ones)%*%(Xc*Xc*w))
+}
+
+
Let’s get back to the original data used to calculate CWM and SNC based correlations:
+
+
set.seed(12351) # so that we all have the same results
+n.communities <-100
+n.species <-50
+E <-rnorm(n.communities)
+T <-rnorm(n.species)
+Dist <-generate_communities(tolerance =1.5, E, T, n.species, n.communities)
+Distribution <- Dist$Dist.matrix
+
+
We then standardize environment and trait by their respective abundance sums (rows for environment & columns for trait).
+
+
# make distribution matrix relative to its total sum; it makes calculations easier
+Dist.rel <- Distribution/sum(Distribution)
+Wn <-rowSums(Dist.rel)
+Ws <-colSums(Dist.rel)
+E.std_w <-standardize_w(E,Wn)
+T.std_w <-standardize_w(T,Ws)
+
+
Note: In the future, include here the calculation of the weighted mean and standard deviation of E.std_w and T.std_w to show that they are zero and one, respectively (when weighted).
+
We then calculate the community average trait (weighted standardized) or the species niche centroid (weighted standardized):
Note that regardless whether SNC or CWM were used, the 4th corner correlation gives the same result, which makes sense mathematically as both correlations use the exact same information. The reason (again) that the CWM and SNC standard correlation approaches differ is because they don’t use appropriate weights in their standardization and weighted correlation.
+
Another issue to notice is that the 4th corner values are smaller than their standard CWM values. Whereas the CWM correlation was 0.9296, the 4th corner was 0.2814. The issue here is that the CWM correlation refers only to the trait variation among communities (trait beta-diversity), whereas the 4th corner refers to the total variation in traits (within, i.e., trait alpha diversity, and among communities, i.e., trait beta-diversity). This was demonstrated by algebraic proofs in Peres-Neto et al. (2017) but we won’t get into these details here.
+
This does bring an interesting point for the analysis of trait in a community ecology context. The relative trait variation among communities (i.e., total trait beta-diversity) and within communities (i.e., gamma trait diversity) can be estimated as follows:
+
+
# Among communities
+Among.Variation <-sum(diag(t(CWM.w)%*%(CWM.w* Wn))) *100
+# Within communities
+Within.Variation <-100- Among.Variation
+c(Among.Variation,Within.Variation)
+
+
[1] 9.460287 90.539713
+
+
+
The standard CWM correlation is high because it pertains to only 9.46% of the total variation, whereas the 4th corner correlation pertains to all variation, i.e., both within and among. As the among communities component become large, the two correlations become somewhat more similar.
+
Let’s now investigate the sampling properties of the 4th corner correlation as we did above for the CWM correlation, i.e., when the trait-environment correlation is expected to be zero:
+
+
set.seed(120) # so that we all have the same results and the same communities and traits are generated as before
+n.communities <-100
+n.species <-50
+n.samples <-100# set to larger later
+
+CWM.4th.cor <-as.matrix(0,n.samples)
+for (i in1:n.samples){
+ E <-rnorm(n.communities)
+ T <-rnorm(n.species)
+ Dist <-generate_communities(tolerance =1.5, E, T, n.species, n.communities)
+ T.false <-rnorm(n.species) # destroys the original generated relationship
+ Dist.rel <- Dist$Dist.matrix/sum(Distribution)
+ Wn <-rowSums(Dist.rel)
+ E.std_w <-standardize_w(E,Wn)
+ T.std_w.false <-standardize_w(T.false,Ws)
+ CWM.w.false <- Dist.rel %*% T.std_w.false / Wn
+t(CWM.w.false) %*% (E.std_w*Wn)
+ CWM.4th.cor[i] <-cor(CWM,E)
+}
+
+
Let’s compare the two statistics:
+
+
boxplot(cbind(CWM.cor,CWM.4th.cor))
+
+
+
+
+
Note how the 4th corner correlation is a much more precise predictor around the true value of zero.
+
+
+
+Statistical hypothesis testing
+
+
We know for a while that the bivariate correlations discussed so far have elevated type I error rates based on parametric testing and under certain permutation schemes (Dray and Legendre 2008; and Dray et al. 2014). That means that when the statistical null hypothesis of no link between trait and environment will be rejected more often than the preset alpha level (significance level, e.g., 0.05 or 0.01). More recently, this was also established for more complex models (more on this later). Resolving these issues are challenging and remain a very active field of research.
+
The code so far has helped to build some intuition underlying the different bivariate correlations. We will now use a more complete utility function that allows calculating these different metrics using one single function. This function is part of Peres-Neto et al. (2017).
As we can see, although only environmental features were important but not traits, the row permutation (across commmunities) detected the relationship as significant. Note, however, that the permutation across species did not. ter Braak et al. (2012) determined that the maximum value between the two p-values (row and column based) assures appropriate type I error rate as expected alpha.
+
The function above allows understanding the permutation procedures. That said, the utility function file has a more complete function:
+
+
set.seed(125)
+CorPermutationTest(Distribution, E, T.false, nrepet =99)
I hope by now you are convinced that the 4th corner is a more robust metric of bivariate correlation (one trait and one environment). Here we will use the Aravo community plant data set (Massif du Grand Gabilier, France; Choler 2005) contained in the package ade4. We provide more explanation on the data in Dray et al. (2012). Here we will replicate the analysis in that paper. The data contain species abundances for 82 species distributed into 75 sites. Sites are described by 6 environmental variables: mean snowmelt date over the period 1997–1999, slope inclination, aspect, index of microscale landform, index of physical disturbance due to cryoturbation and solifluction, and an index of zoogenic disturbance due to trampling and burrowing activities of the Alpine marmot. All variables are quantitative except the landform and zoogenic disturbance indexes that are categorical variables with five and three categories, respectively. And eight quantitative functional traits (i.e., vegetative height, lateral spread, leaf elevation angle, leaf area, leaf thickness, specific leaf area, mass-based leaf nitrogen content, and seed mass) were measured on the 82 most abundant plant species (out of a total of 132 recorded species).
+
+
+
+
+
+
Load the package and the data:
+
+
# install.packages("ade4") in case you don't have it installed
+library(ade4)
+data(aravo)
+dim(aravo$spe)
+
+
[1] 75 82
+
+
dim(aravo$env)
+
+
[1] 75 6
+
+
dim(aravo$trait)
+
+
[1] 82 8
+
+
+
Let’s estimate the 4th corner correlations between each trait and environmental variable. nrept is the number of permutations and should be set to a reasonable high number (say 9999). Here we will use 999 to speed calculations. Note that all permutation tests (not only the ones for the 4th corner) include the observed correlation as part of the null distribution (i.e., permuted); hence the use of nrept as 999, i.e., 1000 possible permutations (the observed correlation is a possible permutation if we run the test infinite times; which is not possible; so we consider it as default. modeltype is set to 6, which is the largest p-value between model 2 permutation (entire communities in the distribution matrix) and model 4 (entire species in the distribution matrix). The p-max procedure is detailed in ter Braak et. 2012. Finally, p-values are adjusted using the false discovery rate for multiple testing.
The ‘classic’ table of results can be produced as follows. D2 indicates that the 4th corner correlation is to be used between the quantitative variable and each category of the qualitative variables. Other bivariate metrics of 4th corner association are also described in Dray and Legendre (2008) for qualitative-quantitative associations. In the default plot, blue cells correspond to negative signicant relationships while red cells correspond to positive signicant relationships (this can be modified using the argument col in the function fourthcorner).
+
+
plot(four.comb.aravo.adj, alpha =0.05, stat ="D2")
+
+
+
+
+
+
+
+Stacking species information as a way to understand how to build more complex GLMs
+
+
Although widely used (1000s of studies published using them), bivariate correlations are the simplest forms of GLMs for community data. That said, the 4th corner correlation can be calculated in a way that allows us (hopefully) to understand how more complex GLMs can be produced. They allow us understanding species stacking. Perhaps I should have considered this presentation before the calculations based on weights and standardizations (will inverst in the next version of the workshop). Let’s build a small data so that we understand this principle. Consider a very artificial distribution matrix with 4 communities and 4 species. It was made artificial so that we can understand well its structure:
+
+
Distribution <-as.matrix(rbind(c(1,1,0,0),c(1,0,0,0),c(0,0,1,1),c(0,0,1,0)))
+Distribution
Let’s create some traits and environmental features:
+
+
T <-c(1,2,5,8)
+E <-c(10,12,100,112)
+
+
Now let’s calculate its 4th corner correlation:
+
+
TraitEnvCor(Distribution,E,T)["Fourthcorner"]
+
+
Fourthcorner
+ 0.8902989
+
+
+
The 4th corner correlation is pretty high given the highly structured data. Another way to calculate a 4th corner correlation is by using what we refer to as an “inflated approach”. This approach allows understanding the structure of stacked information. This figure demonstrates the process and calculation:
+
+
+
+
+
+
Next we stack species distributions, environment and trait information:
+
+
n.species <-ncol(Distribution)
+n.sites <-nrow(Distribution)
+Dist.stacked <-as.vector(Distribution)
+E.stacked <-rep(1, n.species) %x% E
+T.stacked <- T %x%rep(1, n.sites)
+
+
View(cbind(Dist.stacked,E.stacked,T.stacked))
+
We then eliminatate the cells for which the distribution is zero and calculate the correlation:
Note: perhaps I should have started with this explanation and then move to the more complicated way of using weights. The inflated approach is in fact a way to see how weights are given.
+
+
+
+Simulating abundance data and understanding link functions in GLMs
+
+
Simple model
+
Although community ecologists commonly work with presence-absence data, abundance data are also commonly used in many approaches. Here we will use a Poisson model to simulate community data involving species distributions, traits and environment. Although link functions are used in all families of GLMs (poisson, binomial, negative binomial, gamma, etc), we will try to provide an explanation here of what they mean using abundance data. But the same rationale will apply to all families.
+
+
set.seed(100) # so that we all have the same results
+n.sites <-100
+X <-rnorm(n.sites)
+b0 <-0.05
+b1 <-2
+Y <-exp(b0 + b1*X)
+Abundance <-rpois(n.sites,Y)
+plot(Abundance ~ X)
+
+
+
+
+
This Poisson model has the following form. Note that multiple predictors can be considered. Here, for simplicity, we considered one predictor. Again, in many ecological simulations, this single predictor is considered an “environmental gradient” containing many environmental predictors. We can consider more gradients and we will discuss that later on in the workshop.
+
\[Y={e^{(\beta_0+\beta_1X_1)}}=e^{(0.05+1.2X_1)}=log(Y)=0.05+1.2X_1\] The model can be estimated as:
+
+
model <-glm(Abundance ~ X, family="poisson")
+coefficients(model)
+
+
(Intercept) X
+ 0.03095719 2.02323099
+
+
plot(model$fitted.values ~ Abundance)
+
+
+
+
+
GLMs are linear models because they use link-functions to map non-linear relationships to a linear one. In this way, the link function connects the predictors in a model with the expected (mean) value of the response variable (dependent variable). In other words, the link-functions transforms the response values into new values that can be then regressed using linear approaches (Maximum Likelihood-based approaches and not simple OLS, ordinary least square approaches as in linear regression) against the X values. As we saw in the first example of the Poisson regression above, the relationship is not linear. The link-function for the Poisson distribution is ln of the response. Let’s understand this point by plotting the log(Y) and X. Note that Y has no error and Abundance has error (i.e., based on the rpois, i.e., poisson trials)
+
+
# without error:
+plot(log(Y) ~ X)
+
+
+
+
# with error
+plot(log(Abundance) ~ X)
+
+
+
+
+
What does the passage “the link-function connects the predictors in a model with the expected (mean) value of the response variable (dependent variable)” mean? As you can notice, we used Poisson trials to generate error around the initial Y values. Let’s create a 100 possible trials:
+
+
mult.Y <-replicate(n=100,expr=rpois(n.sites,Y))
+
+
View(mult.Y)
+
Each column in mult.Y contains one single trial. This would mimic, for instance, your error in estimating abundances when sampling real populations and assuming that a Poisson GLM would model your abundances across sites well. As such, in real data, obviously, we only have one “trial”. But this small demonstration hopefully helps you understand what the GLM is trying to estimate via the link-function.
+
Let’s repeat the predictor X 100 times`so that it becomes compatible in size with the multiple trials; and we can then plot them:
+
+
rep.X <-rep(X, times =100)
+plot(as.vector(mult.Y) ~ rep.X,pch=16,cex=0.5,col="firebrick")
+
+
+
+
+
Note that larger values of abundances tend to have more error under the poisson model (which is also ecologically plausible).
+
Now we can understand what the passage “the link-function connects the predictors in a model with the expected (mean) value of the response”. Let’s first increase the number of trials to 10000 and for each site (for which we have an X value), calculate its mean:
As we can observe, the mean across all trials (errors) equal the response variable Y, i.e., without error. Hopefully this provides a general understanding of what link functions are. A similar explanation can be given for the logistic regression (i.e., binomial error). In there we use a logit link function (transformation) instead.
+
Missing predictors in GLMs as a source of error
+
Obviously the fit is great, particularly because we considered all the important predictors in the model. We don’t usually have all predictors in a model and this can be simulated as well. Considering the following example where two environmental predictors were used to generate the abundance data but only was used in the regression model:
+
\[p={e^{(\beta_0+\beta_1X_1+\beta_1X_2)}}\]
+
+
set.seed(100) # so that we all have the same results
+n.sites <-100
+X <-matrix(rnorm(n.sites*2),n.sites,2)
+b0 <-2
+b1 <-0.5
+b2 <-1.2
+Y <-exp(b0 + b1*X[,1] + b1*X[,2]) # there are more direct matricial ways to do that
+Abundance <-rpois(n.sites,Y)
+model <-glm(Abundance ~ X[,1], family="poisson")
+model
Note that the errors around predicted and true values are much greater because only one predictor was included in the GLM even though two predictors were important. Now considering both predictors:
+
+
model <-glm(Abundance ~ X, family="poisson")
+model
This is a critical empirical (ecological) consideration because we can’t measure everything. The error is much smaller and, as a consequence, the model fit is much improved (i.e., smaller AIC values) because it considers the two predictors. It can’t be perfect because of the error related to the poisson trials that will always result in random variation (residual variation).
+
A more realistic Gaussian Poisson model
+
As for the logistic model, we can also consider a more realistic Poisson model based on a Gaussian distribution:
+
\[Y={h\cdot e^{-(\beta_0+\beta_1\frac{(X-\mu)^2}{2\sigma^2})}}\] As before, \(\mu\) represents the species optimum and \(\sigma\) its niche breadth. \(h\) represents the expected abundance in the optimum environmental value. Using code to simulate this model for one species; for simplicity, we will set \(\beta_0=0\) and \(\beta_1=1\):
estimated.optimum estimated.niche.breadth h
+predictorX 1.179753 0.4726181 104.8035
+
+
+
Now we can generalize the code to generate abundance data for multiple species:
+
+
generate_community_abundance <-function(tolerance,E,T,preset_nspecies,preset_ncommunities){
+repeat {
+# one trait, one environmental variable
+ h <-runif(preset_nspecies,min=0.3,max=1)
+ sigma <-runif(preset_nspecies)*tolerance
+ L <-matrix(data=0,nrow=preset_ncommunities,ncol=preset_nspecies)
+for(j in1:preset_nspecies){
+ L[,j] <-30*h[j]*exp(-(E-T[j])^2/(2*sigma[j]^2))
+#rpois(preset_ncommunities,30*h[j]*exp(-(E-T[j])^2/(2*sigma[j]^2)))
+ }
+ n_species_c <-sum(colSums(L)!=0) # _c for check
+ n_communities_c <-sum(rowSums(L)!=0)
+if ((n_species_c == preset_nspecies) & (n_communities_c==preset_ncommunities)){break}
+ }
+return(L)
+}
+
+
+
set.seed(120) # so that we all have the same results
+n.communities <-100
+n.species <-50
+E <-rnorm(n.communities)
+T <-rnorm(n.species)
+Y <-generate_community_abundance(tolerance =1.5, E, T, n.species, n.communities)
+
+
Let’s plot the expected abundance values across environmental values. To do that nicely, we need to order the communities according to their environmental values as follows:
Y, however, has no error and to create abundance values for each species (i.e., each column of Y; MARGIN = 2) according to a poisson model we can simply:
And now for the data without error, i.e., before the poisson trials:
+
+
TraitEnvCor(Y,E,T, Chessel =TRUE)["Fourthcorner"]
+
+
Fourthcorner
+ 0.5333836
+
+
+
Despite the error we observed once we transformed Y (values without error) into abundances (with error via the poisson trials), the bivariate correlations are pretty similar, indicating that these metrics are robust against sampling error in abundances. Note, however, that we only used one predictor which was the one used to simulate the data to begin with; empirical data are much more complex than that.
+
Finally, note that, as such, bivariate correlations are calculated in the same way regardless if the data are presence-absence or abundance; biomass data could be also considered.
+
+
+
+Moving to more complex GLMs - the bilinear model
+
+
Let’s go back to our stacked model:
+
+
+
+
+
+
And now let’s get back to our stacked approach using again the simplest example we used earlier. Let’s enter the data again to make sure that we have the same data.
There are two ways in which we can code the analysis. Using the stacked way or using a kronecker product. The kronecker product stacks the data in the same way but it’s a bit more “cryptic” and demonstrating the stacking is then easier by using simple coding. The stacked GLM below estimates the statistical interaction between Environment and Traits only. Note that the distribution for our ficitional example above is for presence and absences, hence we will use a binomial link-function (i.e., logistic regression). This can be referred as to the bilinear model by Gabriel (1998) which is rarely used in ecology but can provide a good introduction to what “stacking” means which is critical to understand more complex GLMs.
+Call:
+glm(formula = Dist.stacked ~ predictor, family = binomial(link = logit))
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-2.1282 -0.5101 -0.2570 0.7417 1.3162
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) -0.9160 0.7681 -1.193 0.2330
+predictor 2.4991 1.1975 2.087 0.0369 *
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for binomial family taken to be 1)
+
+ Null deviance: 21.170 on 15 degrees of freedom
+Residual deviance: 13.597 on 14 degrees of freedom
+AIC: 17.597
+
+Number of Fisher Scoring iterations: 5
+
+
+
The stacking we saw can be easily done using matrix algebra (the early presentation is helpful to understand the principles though). To do it in algebra, we use the kronecker product as follows:
+
\[Y= T\otimes E\]
+
As such, the bilinear model is testing the statistical interaction between traits and environmental variables.
+
which in R becomes:
+
+
predictor2 <- T %x% E
+model <-glm(Dist.stacked ~ predictor,family=binomial(link=logit))
+summary(model)
+
+
+Call:
+glm(formula = Dist.stacked ~ predictor, family = binomial(link = logit))
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-2.1282 -0.5101 -0.2570 0.7417 1.3162
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) -0.9160 0.7681 -1.193 0.2330
+predictor 2.4991 1.1975 2.087 0.0369 *
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for binomial family taken to be 1)
+
+ Null deviance: 21.170 on 15 degrees of freedom
+Residual deviance: 13.597 on 14 degrees of freedom
+AIC: 17.597
+
+Number of Fisher Scoring iterations: 5
+
+
+
Note that the two ways to run the model results in the exact same estimates. Here we should interpret the slope for the model as the strength between trait and environment, i.e., \(\beta_1 = 2.4991\). Obviously the bilinear model can be easily extended to any class of GLMs as well depending on the nature of the response variable (e.g., poisson, negative binomial, etc; more on these below).
+
Note also that we could have considered multiple traits and environments (as long as we have enough degrees of freedom).
+
Stacked or bilinear model on the Aravo data set
+
Let’s apply the Aravo data set we saw early here. We will reduce the environmental matrix by removing the qualitative variables. These can be easily accommodated but for the sake of speed, we will reduce it.
+
+
E <- aravo$env[,c("Aspect","Slope","Snow")]
+E <-as.matrix(E)
+T <-as.matrix(aravo$trait)
+
+
Now we apply the kronecker product using the base function kronecker rather than %x% that we saw early. This function generate names for each column of the matrix (i.e., interactions between each trait and predictor). The columns contain all possible two by two (pairwise) combinations of traits and environmental features.
+
+
TE <-kronecker(T, E, make.dimnames =TRUE)
+dim(TE)
Because we have abundance data, let’s run the GLM using the poisson model (log link function) and the negative binomial which may work better when data are overdispersed (e.g., too many zeros, i.e., absences).
+
+
Dist.stacked <-as.vector(as.matrix(aravo$spe))
+ColNames <-colnames(TE)
+TE <-scale(TE) # so that slopes can be compared directly to one another
+colnames(TE) <- ColNames
+model.bilinear.poisson <-glm(Dist.stacked ~ TE,family="poisson")
+library(MASS)
+model.bilinear.negBinom <-glm.nb(Dist.stacked ~ TE)
The BIC suggests that negative binomial fits the data better (smaller BIC, better fit).
+
Model diagnostics for non-Gaussian models are challenging and often standard tools don’t provide a good way to assess quality of models. Let’s check model residuals using the more classic approach. sppVec below will be used to create a different color for each species.
+
+
# create a vector of species names compatible with the stacked model:
+sppVec =rep(row.names(aravo$traits),each=nrow(aravo$spe))
+plot(residuals(model.bilinear.negBinom),log(fitted(model.bilinear.negBinom)),col=as.numeric(factor(sppVec)),xlab="Fitted values [log scale]",ylab="residuals")
+
+
+
+
+
As we can see, the plot is pretty bad; and that’s a common feature for GLMs.
+
Dunn and Smyth (1996) developed a new class of residuals that allows a much better way to diagnose whether the model fits the data well. The package DHARMa implements the approach. A tutorial for this package and the package features can be found at: <a href = “https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html”https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html.
+
+
# install.packages("DHARMa")
+library("DHARMa")
+
+
This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
The poisson model doesn’t fit as well the expected line as the negative binomial model. As such, the negative binomial model fits better also with the assumption of normality.
+
We can use the package jtools to create model summaries and visualization for GLMs. You can find a good introduction to this package here:
Error in glm.control(...) :
+ unused argument (family = list("Negative Binomial(0.5278)", "log", function (mu)
+log(mu), function (eta)
+pmax(exp(eta), .Machine$double.eps), function (mu)
+mu + mu^2/.Theta, function (y, mu, wt)
+2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev)
+{
+ term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
+ 2 * sum(term * wt)
+}, function (eta)
+pmax(exp(eta), .Machine$double.eps), expression({
+ if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
+ n <- rep(1, nobs)
+ mustart <- y + (y == 0)/6
+}), function (mu)
+all(mu > 0), function (eta)
+TRUE, function (object, nsim)
+{
+ ftd <- fitted(object)
+ rnegbin(nsim * length(ftd), ftd, .Theta)
+}))
+
+
+
Warning: Something went wrong when calculating the pseudo R-squared. Returning NA
+instead.
+
+
+
+
+
+
+
Observations
+
6150
+
+
+
Dependent variable
+
Dist.stacked
+
+
+
Type
+
Generalized linear model
+
+
+
Family
+
Negative Binomial(0.5278)
+
+
+
Link
+
log
+
+
+
+
+
+
𝛘²(NA)
+
NA
+
+
+
Pseudo-R² (Cragg-Uhler)
+
NA
+
+
+
Pseudo-R² (McFadden)
+
NA
+
+
+
AIC
+
8609.80
+
+
+
BIC
+
8784.63
+
+
+
+
+
+
+
Est.
+
S.E.
+
z val.
+
p
+
+
+
+
+
(Intercept)
+
-1.27
+
0.03
+
-40.67
+
0.00
+
+
+
TEHeight:Aspect
+
0.09
+
0.11
+
0.81
+
0.42
+
+
+
TEHeight:Slope
+
0.08
+
0.05
+
1.56
+
0.12
+
+
+
TEHeight:Snow
+
-0.07
+
0.11
+
-0.69
+
0.49
+
+
+
TESpread:Aspect
+
0.06
+
0.10
+
0.56
+
0.58
+
+
+
TESpread:Slope
+
0.07
+
0.06
+
1.18
+
0.24
+
+
+
TESpread:Snow
+
-0.38
+
0.10
+
-3.82
+
0.00
+
+
+
TEAngle:Aspect
+
0.05
+
0.11
+
0.50
+
0.62
+
+
+
TEAngle:Slope
+
0.21
+
0.07
+
2.99
+
0.00
+
+
+
TEAngle:Snow
+
-0.35
+
0.10
+
-3.52
+
0.00
+
+
+
TEArea:Aspect
+
0.05
+
0.12
+
0.36
+
0.72
+
+
+
TEArea:Slope
+
0.01
+
0.06
+
0.16
+
0.87
+
+
+
TEArea:Snow
+
-0.24
+
0.12
+
-1.90
+
0.06
+
+
+
TEThick:Aspect
+
-0.01
+
0.09
+
-0.09
+
0.93
+
+
+
TEThick:Slope
+
0.10
+
0.05
+
2.09
+
0.04
+
+
+
TEThick:Snow
+
-0.17
+
0.09
+
-1.91
+
0.06
+
+
+
TESLA:Aspect
+
0.45
+
0.19
+
2.43
+
0.02
+
+
+
TESLA:Slope
+
-0.31
+
0.14
+
-2.19
+
0.03
+
+
+
TESLA:Snow
+
-0.42
+
0.15
+
-2.81
+
0.00
+
+
+
TEN_mass:Aspect
+
-0.60
+
0.19
+
-3.10
+
0.00
+
+
+
TEN_mass:Slope
+
-0.06
+
0.15
+
-0.38
+
0.70
+
+
+
TEN_mass:Snow
+
0.67
+
0.15
+
4.54
+
0.00
+
+
+
TESeed:Aspect
+
0.30
+
0.11
+
2.80
+
0.01
+
+
+
TESeed:Slope
+
0.11
+
0.05
+
2.21
+
0.03
+
+
+
TESeed:Snow
+
-0.50
+
0.11
+
-4.39
+
0.00
+
+
+
+ Standard errors: MLE
+
+
+
+
The package jtools offers a variety of table and graphical outputs and is worth exploring. Here we will produce a confidence interval plot for each slope. This will take a moment (one minute or less):
+
+
plot_summs(model.bilinear.negBinom, scale =TRUE)
+
+
Error in glm.control(...) :
+ unused argument (family = list("Negative Binomial(0.5278)", "log", function (mu)
+log(mu), function (eta)
+pmax(exp(eta), .Machine$double.eps), function (mu)
+mu + mu^2/.Theta, function (y, mu, wt)
+2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev)
+{
+ term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
+ 2 * sum(term * wt)
+}, function (eta)
+pmax(exp(eta), .Machine$double.eps), expression({
+ if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
+ n <- rep(1, nobs)
+ mustart <- y + (y == 0)/6
+}), function (mu)
+all(mu > 0), function (eta)
+TRUE, function (object, nsim)
+{
+ ftd <- fitted(object)
+ rnegbin(nsim * length(ftd), ftd, .Theta)
+}))
+
+
+
Warning: Something went wrong when calculating the pseudo R-squared. Returning NA
+instead.
+
+
+
Registered S3 methods overwritten by 'broom':
+ method from
+ tidy.glht jtools
+ tidy.summary.glht jtools
+
+
+
Loading required namespace: broom.mixed
+
+
+
+
+
+
One needs to be careful when interpreting these confidence intervals as they were produced parametrically and they may be biased as we discussed in the section on statistical hypothesis testing. Again, this is a strong area of development among quantitative ecologists; but we haven’t yet derived unified views on this issue.
+Considering main effects of trait and environment, and their interactions - predicting and explaining species distributions
+
+
The bilinear model estimates and test for the interactions between traits and environmental features in a single model. That allows for partial slopes to be estimated (i.e., in which the effects of one trait-environment interaction is independent of the others as in standard linear regression models and GLMs). Understanding partial slopes is essential for inference and usually covered in Intro statistics for biologists/ecologists under multiple regression. A standard definition is “The partial slope in multiple regression (or GLM) is the slope of the relationship between a predictor variable that is independent of the other predictor variables and the criterion. It is also the regression coefficient for the predictor variable in question.” (https://onlinestatbook.com/2/glossary/partial_slope.html).
+
Now, we may want to consider also the main effects of each environmental feature and traits in addition to their interactions. This is the model implemented in Jamil et al. (2013) and Brown et al. (2014). The Brown model et al. model is:
+
\[{ln(Y_{ij}) = \beta_0\ +\beta_1env_i\ +\beta_2env_i^2\ +\beta_3spp_j\ +\beta_4(env \times\ trait)_{ij}\ }\] where \(\beta_0\) is the overall intercept for the model, \(\beta_1env_i\) contains the slopes for each environmental variable, the model also consider square terms for the environmental variables \(\beta_2env_i^2\), \(\beta_3spp_j\) contains species-specific intercept terms which allows predictions for each species separately, \(\beta_4(env \times\ trait)_{ij}\) contains the slopes for each trait by environment interaction (the 4th corner slopes). Whereas Brown et al. treated \(spp_j\) as fixed, Jamil et al. (2013) treated as random (more on this later). I’ve followed the formulation notation given in Brown et al. to facilitate understanding their paper; but we could easily change by the notation used in the mixed model selection (following Gelman and Hill 2007).
+
By setting species-specific intercept terms (i.e., \(\beta_3spp_j\)) we can predict species in their appropriate scale of abundance variation. As such, we can use these types of models to predict species distributions. This sort of modelling is becoming the standard for predicting multiple species because it considers multiple types of predictors (trait, environments, non-linearities) and their interactions. Single species models, for instance, can’t consider trait variation and the interactions of traits and environment. As such, stacked models are extremely powerful tools even for modelling single species distributions.
+
The Brown et al. model can be fit using the package mvabund as follows:
The overall intercept \(\beta_0=-1.9397\), the individual intercept for each species \(\beta_3spp_j\) are the coefficients above starting with spp; for instance, the individual slope for Alch.glau (1st species in the list) is \(\beta_3spp_1=-0.055\). The slopes for each environmental variable appear under the names we gave. For instance, the slope for aspect is \(\beta_1env_aspect=0.015\). Note that .squ in the coefficients are the slopes for the squared environmental terms; for instance \(\beta_2env_{snow}^2=-0.274\). Finally we have the 4th corner slopes. For instance, \(\beta_4(env \times\ trait)_{snow,seed\_mass}=-0.1698\). Remember that snow here refers to the mean snowmelt date. As such, communities with large seed masses (in average) are found in sites that have early snowmelt dates compared to sites with late snowmelt dates; these sites have communities that tend to have small seed masses in average.
+
Let’s understand this model by programming it from scratch. That’s really the best way to understand models; and whenever possible I try to demonstrate them from ‘scratch’; not always possible depending on the amount of operations (e.g., mixed models) and our abilities to understand large codes . But this one is simple enough that we can do it:
+
+
n.species <-ncol(aravo$spe)
+n.communities <-nrow(aravo$spe)
+n.env.variables <-ncol(E)
+# repeats each trait to make it compatible (vectorized) with the stacked species distributions
+traitVec <- T[rep(1:n.species,each=n.communities),]
+# repeats each environmental variable to make it compatible (vectorized) with the stacked species distributions:
+envVec <-matrix(rep(t(E),n.species),ncol=NCOL(E),byrow=TRUE)
+# creates an intercept for each species:
+species.intercepts <-rep(1:n.species,each=n.communities)
+species.intercepts <-as.factor(species.intercepts)
+mod <-as.formula("~species.intercepts-1")
+species.intercepts <-model.matrix(mod)[,-1]
+# the interaction terms:
+TE <-kronecker(T, E, make.dimnames =TRUE)
+# combining the predictors in a single matrix:
+preds <-cbind(species.intercepts,envVec,envVec^2,TE)
+# running the model
+model.bilinear.negBinom.Brown <-glm.nb(Dist.stacked ~ preds)
+
+
And as we can see, they are exactly the same model:
+
+
BIC(glm.trait.res)
+
+
l
+8104.818
+
+
BIC(model.bilinear.negBinom.Brown)
+
+
[1] 8104.818
+
+
+
We can easily adapt the graphical and diagnostic outputs for this model as well, but we won’t for simplicity.
+
+
summ(model.bilinear.negBinom)
+
+
Error in glm.control(...) :
+ unused argument (family = list("Negative Binomial(0.5278)", "log", function (mu)
+log(mu), function (eta)
+pmax(exp(eta), .Machine$double.eps), function (mu)
+mu + mu^2/.Theta, function (y, mu, wt)
+2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev)
+{
+ term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
+ 2 * sum(term * wt)
+}, function (eta)
+pmax(exp(eta), .Machine$double.eps), expression({
+ if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
+ n <- rep(1, nobs)
+ mustart <- y + (y == 0)/6
+}), function (mu)
+all(mu > 0), function (eta)
+TRUE, function (object, nsim)
+{
+ ftd <- fitted(object)
+ rnegbin(nsim * length(ftd), ftd, .Theta)
+}))
+
+
+
Warning: Something went wrong when calculating the pseudo R-squared. Returning NA
+instead.
+
+
+
+
+
+
+
Observations
+
6150
+
+
+
Dependent variable
+
Dist.stacked
+
+
+
Type
+
Generalized linear model
+
+
+
Family
+
Negative Binomial(0.5278)
+
+
+
Link
+
log
+
+
+
+
+
+
𝛘²(NA)
+
NA
+
+
+
Pseudo-R² (Cragg-Uhler)
+
NA
+
+
+
Pseudo-R² (McFadden)
+
NA
+
+
+
AIC
+
8609.80
+
+
+
BIC
+
8784.63
+
+
+
+
+
+
+
Est.
+
S.E.
+
z val.
+
p
+
+
+
+
+
(Intercept)
+
-1.27
+
0.03
+
-40.67
+
0.00
+
+
+
TEHeight:Aspect
+
0.09
+
0.11
+
0.81
+
0.42
+
+
+
TEHeight:Slope
+
0.08
+
0.05
+
1.56
+
0.12
+
+
+
TEHeight:Snow
+
-0.07
+
0.11
+
-0.69
+
0.49
+
+
+
TESpread:Aspect
+
0.06
+
0.10
+
0.56
+
0.58
+
+
+
TESpread:Slope
+
0.07
+
0.06
+
1.18
+
0.24
+
+
+
TESpread:Snow
+
-0.38
+
0.10
+
-3.82
+
0.00
+
+
+
TEAngle:Aspect
+
0.05
+
0.11
+
0.50
+
0.62
+
+
+
TEAngle:Slope
+
0.21
+
0.07
+
2.99
+
0.00
+
+
+
TEAngle:Snow
+
-0.35
+
0.10
+
-3.52
+
0.00
+
+
+
TEArea:Aspect
+
0.05
+
0.12
+
0.36
+
0.72
+
+
+
TEArea:Slope
+
0.01
+
0.06
+
0.16
+
0.87
+
+
+
TEArea:Snow
+
-0.24
+
0.12
+
-1.90
+
0.06
+
+
+
TEThick:Aspect
+
-0.01
+
0.09
+
-0.09
+
0.93
+
+
+
TEThick:Slope
+
0.10
+
0.05
+
2.09
+
0.04
+
+
+
TEThick:Snow
+
-0.17
+
0.09
+
-1.91
+
0.06
+
+
+
TESLA:Aspect
+
0.45
+
0.19
+
2.43
+
0.02
+
+
+
TESLA:Slope
+
-0.31
+
0.14
+
-2.19
+
0.03
+
+
+
TESLA:Snow
+
-0.42
+
0.15
+
-2.81
+
0.00
+
+
+
TEN_mass:Aspect
+
-0.60
+
0.19
+
-3.10
+
0.00
+
+
+
TEN_mass:Slope
+
-0.06
+
0.15
+
-0.38
+
0.70
+
+
+
TEN_mass:Snow
+
0.67
+
0.15
+
4.54
+
0.00
+
+
+
TESeed:Aspect
+
0.30
+
0.11
+
2.80
+
0.01
+
+
+
TESeed:Slope
+
0.11
+
0.05
+
2.21
+
0.03
+
+
+
TESeed:Snow
+
-0.50
+
0.11
+
-4.39
+
0.00
+
+
+
+ Standard errors: MLE
+
+
+
+
As mentioned earlier, we can get predicted models per species, making stacked models not only a community model but also considering information from multiple sources for single species distributions as well:
Predicted abundance values per species are in columns:
+
View(fitted.by.species)
+
+
+
+A very brief way to understand mixed models: the Simpson’s paradox
+
+
Species and sites are not likely to differ from one another randomly but rather have some sites more similar to others (or more different). We call this a hierarchical structure. For instance, sites that are more close to one another may have more similar values than sites further way. Or some species may be more similar in abundance than others, and so on.
+
When using linear models and GLMS, researchers often ignore the hierarchical structure of the data (some sites have more similar or differences in total abundances of species, some species have more similar or differences in their total abundances). As such, standard GLMs can generate biased variance estimates and increase the likelihood of committing type I errors (i.e., rejecting the statistical null hypothesis more often than set by alpha, i.e., significance level).
+
Although this is very interesting ecologically, it does bring some inferential challenges (parameter estimation and statistical hypothesis testing) when fitting statistical models. GLMM (Generalized Linear Mixed Models) are then used to deal with these issues. This paper by Harrison et al. (2018) provides a great Introduction to GLLMs for ecologists: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/.
+
One common feature here is that species may differ in the way they are structured by environmental and/or trait variation. This can be well described by the Simpson’s paradox (Simpson 1951), which is defined “as a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.”
+
Let’s understand this paradox by simulating some data (code not show here) and graphing it. This sort of demonstration has become somewhat common when explaining the utility of mixed model.
+
+
+
+
+
+
+
+
+
At first glance, the influence of temperature on abundance is positive if we consider the variation across all data points independent of the sites. However, within species, the influence of the environment is negative. As such, it is obvious that there is variation among species that can’t be explained by temperature alone. As such, we should consider a mixed model with temperature as a fixed factor (measured variable) and species as a random factor. Obviously one question of interest is why do species vary in their effects of temperature. But we don’t have other predictors that could assist in explaining these differences (e.g., physiology). Perhaps considering traits could assist in determining this variation (more on that later).
+
The data was saved in a matrix called data.Simpson. For simplicity, we will treat these data as normally distributed. The data were generated assuming normality any way; the goal here is just a demonstration.
+
Let’s analyze these data with a fixed model using a simple regression:
+ Standard errors: OLS; Continuous predictors are mean-centered and scaled by 1 s.d.
+
+
+
+
As we can see, the overal influence of temperature is positive and significant, explaining 12% of the variation in abundance as a function of temperature (i.e., \(R^2=0.12\)).
+
Let’s consider now a mixed effect model that we estimate the variation in intercepts but still assume a common slope for all species. This is a common procedure in mixed model effects, i.e., starting with the simplest model. This fixed effect is coded as usually and the random effect is coded as (1|species), where 1 means the intercepts (common way to code the intercept in statistical models). As such, one intercept per species is estimated. The scale=TRUE in the function summ below reports the analysis with standardized predictors.
+
+
# install.packages("lme4")
+library(lme4)
+
+
Loading required package: Matrix
+
+
+
+Attaching package: 'Matrix'
+
+
+
The following objects are masked from 'package:tidyr':
+
+ expand, pack, unpack
+
+
lm.mod.intercept <-lmer(abundance ~ temperature + (1|species),data=data.Simpson)
+summ(lm.mod.intercept,scale =TRUE)
+
+
+
+
+
+
Observations
+
100
+
+
+
Dependent variable
+
abundance
+
+
+
Type
+
Mixed effects linear regression
+
+
+
+
+
+
AIC
+
343.74
+
+
+
BIC
+
354.16
+
+
+
Pseudo-R² (fixed effects)
+
0.30
+
+
+
Pseudo-R² (total)
+
0.95
+
+
+
+
+
Fixed Effects
+
+
+
Est.
+
S.E.
+
t val.
+
d.f.
+
p
+
+
+
+
+
(Intercept)
+
-0.08
+
1.88
+
-0.04
+
3.94
+
0.97
+
+
+
temperature
+
-2.84
+
0.35
+
-8.10
+
97.70
+
0.00
+
+
+
+ p values calculated using Kenward-Roger standard errors and d.f. ; Continuous predictors are mean-centered and scaled by 1 s.d.
+
+
+
Random Effects
+
+
Group
+
Parameter
+
Std. Dev.
+
+
+
+
+
species
+
(Intercept)
+
4.20
+
+
+
Residual
+
+
1.16
+
+
+
+
+
Grouping Variables
+
+
Group
+
# groups
+
ICC
+
+
+
+
+
species
+
5
+
0.93
+
+
+
+
+
+
Wow, what a change in the interpretation. By considering variation in intercepts across species, the fixed effect influence of temperature is now negative (as we should expect). Note, however, that we had information on a categorical factor, i.e, species, that could be used to estimate random effects related to variation among them. The variation (standard deviation) of intercepts among species is quite large in contrast to residuals; 4.20 against 1.16, respectively. The explanatory power of variation among intercepts make the \(R^2=0.95\) increase dramatically in contrast to the fixed model, demonstrating that the species random effect has a huge effect and ability to improve the model predictive power. We also find the ICC (Intra Class Correlation) which measures how similar the abundance is within groups, i.e., species. The ICC is 0.93, indicating that abundance values are more similar within than among species.
+The variation in intercepts can be plotted as follows:
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+
+
+
+
+
+
The black line represents the common model. Since only intercepts were allowed to vary, the species per slope are the same as the fixed model (after controlling for variation across species, which made the overall slope to be negative).
+Finally, let’s estimate the random intercept and slope model. Here we estimate variation due to differences in intercepts and slopes across species:
+
+
lm.mod.interceptSlope <-lmer(abundance ~ temperature + (1+ temperature|species),data=data.Simpson)
+summ(lm.mod.interceptSlope,scale =TRUE)
+
+
+
+
+
+
Observations
+
100
+
+
+
Dependent variable
+
abundance
+
+
+
Type
+
Mixed effects linear regression
+
+
+
+
+
+
AIC
+
336.10
+
+
+
BIC
+
351.73
+
+
+
Pseudo-R² (fixed effects)
+
0.31
+
+
+
Pseudo-R² (total)
+
0.96
+
+
+
+
+
Fixed Effects
+
+
+
Est.
+
S.E.
+
t val.
+
d.f.
+
p
+
+
+
+
+
(Intercept)
+
0.11
+
1.75
+
0.06
+
3.99
+
0.95
+
+
+
temperature
+
-2.91
+
0.84
+
-3.45
+
3.98
+
0.03
+
+
+
+ p values calculated using Kenward-Roger standard errors and d.f. ; Continuous predictors are mean-centered and scaled by 1 s.d.
+
+
+
Random Effects
+
+
Group
+
Parameter
+
Std. Dev.
+
+
+
+
+
species
+
(Intercept)
+
3.84
+
+
+
species
+
temperature
+
1.73
+
+
+
Residual
+
+
1.05
+
+
+
+
+
Grouping Variables
+
+
Group
+
# groups
+
ICC
+
+
+
+
+
species
+
5
+
0.93
+
+
+
+
+
+
The variation (standard deviation) in intercepts across species in much larger (3.84) than slopes (1.73). Is the predictive power between the two models significant? In order words, does a model that estimate independent slopes for each species explain more variation than one that only considers variation in intercepts? We can simply compare the BIC of both models:
There is more support for the mixed model that considers variation in intercepts and slopes. One can also estimate the p-value that one model fits better than the other:
Finally, we can plot the intercept and slope variation. The common fixed effect slope has been now estimated by pooling the variation among species, hence it become negative in contrast to the original fixed effect slope.
Let’s consider (just visually) an even more extreme case. The fixed effect is very strong but the within species effect is almost zero. Hopefully the “Simpson’s” examples here provide a good intuition on the importance of mixed models.
+
+
+
+
+
+
+
+
+
+Our first GLMM applied to the fourth corner problem treating species as a random effect - the MLML1 model
+
+
There are different ways that we can account for the potential random effects in community ecology data. Here we will review a few of the latest developments.
+
MLM stands for multilevel (i.e., hierarchical) models (MLM). The simplest of these models is the one introduced by Pollock et al. (2012) and is often referred in the ecological literature as MLM1 (see Miller et al. 2018). The model has the following form (following the notation of Gelman and Hill 2007; as in Miller et al. 2018). This is a model that considers species as a random effect while estimating variation in both intercepts and slopes as the last model in the previous session (Simpson’s paradox)
Note that we changed the notation of Brown et al. \(ij\) that served as an index for the \(i^{th}\) site and \(j^{th}\) species to simply one index \(i\) since we are using a stacked model anyway, i.e., only rows for \(Y_{i}\) and one column (i.e., stacked species distributions). As such, the functions \(spp[i]\) and \(site[i]\) map row i onto the corresponding species and sites. \(\beta_{12}\) contains the slopes for the interactions between environment and trait. The fixed effect \(\alpha\) gives the overall average abundance of species among sites (one overall intercept), and the fixed effect \(\beta_{1}\) gives the mean response of species to the different environmental variables. Random effect \(a_{spp[i]}\) allows different species to have different overall abundance (i.e., sum of abundances across species; random intercept model across species), and random effect \(c_{spp[i]}\) allows different species to have different responses to the environmental variables (i.e., random slope model across species). \(a_{spp[i]}\) and \(c_{spp[i]}\) have means zero and variances \(\sigma_a^2\) and \(\sigma_c^2\) (referred as to hyperparameters in mixed model lingo), with \(\rho_{ac}\) denoting the correlation between \(a_{spp[i]}\) and \(c_{spp[i]}\) (i.e., species intercepts and species slopes for the environment can correlate). Finally,random effect \(e_i\) gives observation-level variance; this is necessary here to allow for overdispersion (Harisson, 2014). We can also use the negative binomial as we saw earlier. Note that we are assuming hierarchical variation in variance and not covariance (e.g., phylogenetic and spatial autocorrelation). One step at the time.
+
Let’s start by setting an appropriate data structure to run the mixed model. To make the coding more manageable we will consider here only one trait and one environment. The code can be easily generalized for multiple traits and environments. We found early a strong interaction between seed mass and snow in which species with greater seed mass tended to be found in sites with small levels of snow (i.e., a negative correlation between snow and seed mass).
# to code for observation-level variance:
+obs <-1:(n.species*n.communities)
+# to code for species (as we saw earlier to plot residuals):
+species <-rep(row.names(aravo$traits),each=nrow(aravo$spe))
+sites <-rep(row.names(aravo$spe),each=ncol(aravo$spe))
+# standardizing the data:
+seed.mass <-scale(T[rep(1:n.species,each=n.communities),"Seed"])
+snow.melt.days <-scale(matrix(rep(t(E[,"Snow"]),n.species),ncol=1,byrow=TRUE))
+data.df <-data.frame(abundance=Dist.stacked,snow.melt.days,seed.mass,species,sites,obs)
+
+
Let’s see the data frame:
+
View(data.df)
+
Let’s start by fitting a glm:
+
+
glm.mod <-glm(abundance ~ snow.melt.days + snow.melt.days:seed.mass,data=data.df, family ="poisson")
+summ(glm.mod,scale =TRUE)
+
+
+
+
+
+
Observations
+
6150
+
+
+
Dependent variable
+
abundance
+
+
+
Type
+
Generalized linear model
+
+
+
Family
+
poisson
+
+
+
Link
+
log
+
+
+
+
+
+
𝛘²(2)
+
56.18
+
+
+
Pseudo-R² (Cragg-Uhler)
+
0.01
+
+
+
Pseudo-R² (McFadden)
+
0.01
+
+
+
AIC
+
9421.91
+
+
+
BIC
+
9442.08
+
+
+
+
+
+
+
Est.
+
S.E.
+
z val.
+
p
+
+
+
+
+
(Intercept)
+
-1.17
+
0.02
+
-50.74
+
0.00
+
+
+
snow.melt.days
+
-0.10
+
0.02
+
-4.20
+
0.00
+
+
+
snow.melt.days:seed.mass
+
-0.14
+
0.02
+
-6.39
+
0.00
+
+
+
+ Standard errors: MLE; Continuous predictors are mean-centered and scaled by 1 s.d.
+
+
+
+
Despite the very low \(R^2=0.01\), the coefficiences are all significant and negative.
+
Let’s run the model. Below, both intercepts (coded as 1) and slopes for snow.melt.days (coded as env) are allowed to vary among species (i.e., (1 + env|species)) and we also allow
+ ; Continuous predictors are mean-centered and scaled by 1 s.d.
+
+
+
Random Effects
+
+
Group
+
Parameter
+
Std. Dev.
+
+
+
+
+
obs
+
(Intercept)
+
0.47
+
+
+
species
+
(Intercept)
+
1.46
+
+
+
species
+
snow.melt.days
+
1.08
+
+
+
+
+
Grouping Variables
+
+
Group
+
# groups
+
ICC
+
+
+
+
+
obs
+
6150
+
0.07
+
+
+
species
+
82
+
0.63
+
+
+
+
+
+
The variation among observations (residuals) is not relevant here; we used it to allow for potential overdispersion in abundance (i.e., variance of abundance much greater than the mean). Note that both the intercept and slopes contribute more or less to the same amount of variation (intercept sd=1.46 and slopes = 1.08), indicating that an intercept and slope mixed model is the most appropriate model. We could have fit the two models and tested as before we saw early in the Simpson’s paradox section. Note the huge increase in \(R^2=0.65\), demonstrating that considering a random structure is much better. The coefficient of snow.melt.days remains negative indicating that most species and their variation are negative despite the potential for some species to increase their abundances for large values of snow.melt.days. We will see this below.
+
The residual against the predicted values provide a good indication that the model is appropriate:
And now the Q-Q plot to assess residual normality:
+
+
plotQQunif(DunnSmyth.res)
+
+
DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
+
+
+
+
+
+
Note that the Kolmogorov-Smirnov (KS) provides indication that the residuals cannot be assumed to be normal. That said, the distribution of residuals looks somewhat on the expected. The KS becomes quite significant because of the large number of data points. Finally, GLMs and GLMMs tend to be quite robust even when residuals are not normal.
+
Intercepts in a poisson model are log of species abundances not explained by the predictors model, i.e., when we set them “manually” to take values of zero. Perhaps some sites are more productive than others and we did not measure trait or environmental variables that could account for the slope and intercept variation variation; but the random effects is telling us that there is something we are potentially missing to explain their variation.
+
Moreover, the random effect indicates that species slopes for the environment are strongly and positively correlated with the species intercepts (\(\rho_{ac}=0.73\)). We can plot the intercepts against slopes:
Although the fixed-only effect model run earlier (glm.mod) indicated a significant correlation between snowmelt date and seed mass in driving species distributions, this effect is no longer relevant once the random effects are considered:
+
+
summary(glm.mod)
+
+
+Call:
+glm(formula = abundance ~ snow.melt.days + snow.melt.days:seed.mass,
+ family = "poisson", data = data.df)
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-1.2804 -0.7946 -0.7868 -0.6822 4.3659
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) -1.16767 0.02301 -50.744 < 2e-16 ***
+snow.melt.days -0.09714 0.02313 -4.199 2.68e-05 ***
+snow.melt.days:seed.mass -0.14286 0.02237 -6.387 1.70e-10 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+ Null deviance: 6552.0 on 6149 degrees of freedom
+Residual deviance: 6495.8 on 6147 degrees of freedom
+AIC: 9421.9
+
+Number of Fisher Scoring iterations: 6
+
+
+
This indicates that the random variation across species can account for the initial relationship between snow.melt.days and seed mass. Note that the importance of snow.melt.days remains relevant in driving species abundances across sites, but (again) not its interaction with seed mass.
Finally, as we discussed during the workshop, one needs to be careful while interpreting the significance of interactions between trait and environment. There are bootstrap-based developments to do that for the MLML1 model but they have been show to have inflated type I errors (Miller et al. 2019).
+
Function glmer.nb could have been used to fit the model using the negative binomial family instead.
+Our second GLMM applied to the fourth corner problem treating species and sites as random effects - the MLML2 model
+
+
Jamil et al. (2013) implemented a mixed model version in which in contrast to MLM1, it adds a fixed effect term for traits is included in the model (as in Brown et al. 2014) and also an additional random effect (intercepts) for site:
We won’t diagnostics here for brevity and the code for the MLM1 can be easily adapted here.
+
Although the fixed effects for the trait (seed mass) and interaction (seed mass and snow.melt.day) were not significant, the random effect standard deviation for site is relative large (0.66) and improved the predictive power for abundances across species.
+
+Our last GLMM applied to the fourth corner problem treating species and sites as random effects - the MLML3 model
+
+
ter Braak (2019) proposed yet another version that seems to work better than the previous ones.
For these data, the MLM3 does not improve the MLM2.
+
This is not the end - we will keep updating this page after the workshop
+References (still being filled):
+
AM Brown, DI Warton, NR Andrew, M Binns, G Cassis, H Gibb Methods in Ecology and Evolution 5 (4), 344-352
+
Choler, P. (2005) Consistent shifts in Alpine plant traits along a mesotopographical gradient. Arctic, Antarctic, and Alpine Research, 37,444–453.
+
Dray, S., & Legendre, P. (2008) Testing the species traits-environment relationships : The fourth-corner problem revisited. Ecology, 89, 3400–3412.
+
Dunn, KP & Smyth GK (1996) Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5, 1-10.
+
Gabriel, KR. (1998) Generalised bilinear regression. Biometrika, 85, 689-700.
+
Gelman, A & Hill, J (2007). Data analysis using regression and multi-level/hierarchical models. New York, NY: Cambridge University Press.
+
Harrison, XA (2014). Using observation-level random effects to model overdispersion in count data in ecology and evolution. PeerJ, 2, e616.
+
Jamil, T., Ozinga, W. A., Kleyer, M., & ter Braak, C. J. F. (2013). Selecting traits that explain species-environment relationships: A generalized linear mixed model approach. Journal of Vegetation Science, 24, 988–1000.
+
Peres-Neto, P. R., Dray, S., & ter Braak, C. J. F. (2017). Linking trait variation to the environment: Critical issues with community-weighted mean correlation resolved by the fourth-corner approach. Ecography, 40, 806–816.
+
Pollock, L. J., Morris, W. K., & Vesk, P. A. (2012). The role of functional traits in species distributions revealed through a hierarchical model. Ecography, 35, 716–725.
+
Simpson, EH (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society, Series B., 13, 238–241.
+
ter Braak, CJF, Cormont, A. & Dray, S. (2012). Improved testing of species traits–environment relationships in the fourth‐corner problem. Ecology, 93, 1525-1526.
+
ter Braak, CJF & Looman, CWN. (1986). Weighted averaging, logistic regression and the Gaussian response model. Plant Ecology, 65, 3-11.
+
useful sites for GLMs (will also expand on this later on):
+
+
+
+
\ No newline at end of file
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-1-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-1-1.png
new file mode 100644
index 0000000..c896584
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-1-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-102-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-102-1.png
new file mode 100644
index 0000000..ec1e014
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-102-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-103-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-103-1.png
new file mode 100644
index 0000000..8a03b04
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-103-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-104-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-104-1.png
new file mode 100644
index 0000000..8ec002f
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-104-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-107-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-107-1.png
new file mode 100644
index 0000000..e87d28a
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-107-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-108-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-108-1.png
new file mode 100644
index 0000000..4204bc6
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-108-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-12-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-12-1.png
new file mode 100644
index 0000000..45e8a1d
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-12-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-15-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-15-1.png
new file mode 100644
index 0000000..3a231fe
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-15-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-16-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-16-1.png
new file mode 100644
index 0000000..1982b81
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-16-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-18-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-18-1.png
new file mode 100644
index 0000000..39bddb4
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-18-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-20-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-20-1.png
new file mode 100644
index 0000000..731405a
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-20-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-22-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-22-1.png
new file mode 100644
index 0000000..535360c
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-22-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-31-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-31-1.png
new file mode 100644
index 0000000..a32f970
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-31-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-4-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-4-1.png
new file mode 100644
index 0000000..43ebe60
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-4-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-41-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-41-1.png
new file mode 100644
index 0000000..a40942d
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-41-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-48-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-48-1.png
new file mode 100644
index 0000000..eb4438f
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-48-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-49-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-49-1.png
new file mode 100644
index 0000000..c9fb999
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-49-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-1.png
new file mode 100644
index 0000000..d9ee641
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-2.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-2.png
new file mode 100644
index 0000000..3e457d3
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-50-2.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-52-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-52-1.png
new file mode 100644
index 0000000..27e7ae7
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-52-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-53-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-53-1.png
new file mode 100644
index 0000000..bc8050f
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-53-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-54-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-54-1.png
new file mode 100644
index 0000000..e81176c
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-54-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-55-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-55-1.png
new file mode 100644
index 0000000..a929b6d
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-55-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-56-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-56-1.png
new file mode 100644
index 0000000..a9ef2ad
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-56-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-57-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-57-1.png
new file mode 100644
index 0000000..2ed95d8
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-57-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-61-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-61-1.png
new file mode 100644
index 0000000..9cbf54c
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-61-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-63-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-63-1.png
new file mode 100644
index 0000000..fa7dbbd
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-63-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-7-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..4f43dcf
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-74-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-74-1.png
new file mode 100644
index 0000000..e4bcacb
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-74-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-75-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-75-1.png
new file mode 100644
index 0000000..84f058e
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-75-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-76-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-76-1.png
new file mode 100644
index 0000000..ddc96ba
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-76-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-77-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-77-1.png
new file mode 100644
index 0000000..73f0277
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-77-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-79-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-79-1.png
new file mode 100644
index 0000000..0fad82e
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-79-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-80-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-80-1.png
new file mode 100644
index 0000000..73117a4
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-80-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-88-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-88-1.png
new file mode 100644
index 0000000..989f921
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-88-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-9-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-9-1.png
new file mode 100644
index 0000000..1a9f5b1
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-91-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-91-1.png
new file mode 100644
index 0000000..4844031
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-91-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-95-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-95-1.png
new file mode 100644
index 0000000..9fb04ba
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-95-1.png differ
diff --git a/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-97-1.png b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-97-1.png
new file mode 100644
index 0000000..4eaedb1
Binary files /dev/null and b/docs/posts/2021-07-19-glm-community-ecology/index_files/figure-html/unnamed-chunk-97-1.png differ
diff --git a/docs/search.json b/docs/search.json
new file mode 100644
index 0000000..77db564
--- /dev/null
+++ b/docs/search.json
@@ -0,0 +1,842 @@
+[
+ {
+ "objectID": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html",
+ "href": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html",
+ "title": "Sensibilisation aux réalités autochtones et recherche collaborative",
+ "section": "",
+ "text": "Améliorer notre compréhension du passé et de ses impacts sur nos relations entre le avec les Peuples Autochtones.\nDévelopper des notions et compétences afin d’agir contre les préjugés et le racisme.\n\n\n\n\n\nFaire un survol des événements historiques importants et de leurs impacts à ce jour (Loi sur les Indiens, politiques d’assimilation, les pensionnats, etc.). \nAcquérir des connaissances sur la terminologie autochtone.\nFaire un survol de certains procès et contextes légaux et voir comment ils affectent notre travail en territoire autochtone.\nDans une optique de réconciliation, faire une prise de conscience des préjugés persistants et discuter de stratégies pour améliorer nos relations avec les communautés."
+ },
+ {
+ "objectID": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#objectifs-de-la-formation-1",
+ "href": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#objectifs-de-la-formation-1",
+ "title": "Sensibilisation aux réalités autochtones et recherche collaborative",
+ "section": "Objectifs de la formation :",
+ "text": "Objectifs de la formation :\n\nEntamer une réflexion collective envers nos pratiques de recherche et comment s’engager de manière significative avec les communautés autochtones.\nDévelopper une meilleure compréhension des perceptions et attentes des communautés envers la recherche et les chercheurs."
+ },
+ {
+ "objectID": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#durant-ce-webminaire-nous-allons-1",
+ "href": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#durant-ce-webminaire-nous-allons-1",
+ "title": "Sensibilisation aux réalités autochtones et recherche collaborative",
+ "section": "Durant ce webminaire, nous allons: ",
+ "text": "Durant ce webminaire, nous allons: \n\nMieux comprendre la nécessité de prendre en compte les connaissances autochtones dans divers aspects de la gestion environnementale au Canada; \nDiscuter du désir des communauté d’avoir une présence accrue dans le milieu de la recherche : comment faire?\nAborder et débattre des différentes approches méthodologiques pour établir des ponts en les connaissances autochtones et scientifiques."
+ },
+ {
+ "objectID": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#formatrice",
+ "href": "posts/2020-04-28-sensibilisation-aux-ralits-autochtones-et-recherche-collaborative/index.html#formatrice",
+ "title": "Sensibilisation aux réalités autochtones et recherche collaborative",
+ "section": "Formatrice :",
+ "text": "Formatrice :\nCatherine-Alexandra Gagnon possède une expertise dans le travail collaboratif en milieux autochtones. Elle s’intéresse particulièrement à la mise en commun des savoirs locaux, autochtones et scientifiques. Elle détient un doctorat en Sciences de l’environnement et une maîtrise en Gestion de la faune de l’Université du Québec à Rimouski, un baccalauréat en biologie faunique de l’université McGill ainsi qu’un certificat en Études autochtones de l’université de Montréal. Durant ses études, elle a travaillé sur les connaissances locales et ancestrales des Aîné(e)s et chasseurs Inuit, Inuvialuit et Gwich’in du Nunavut, des Territoires du Nord-Ouest et du Yukon."
+ },
+ {
+ "objectID": "posts/2020-09-21-data-visualization/index.html",
+ "href": "posts/2020-09-21-data-visualization/index.html",
+ "title": "Data Visualization",
+ "section": "",
+ "text": "Welcome!\nThis training covers the general principles of visualization and graphic design, and techniques of tailored visualization. More specifically, the objectives of the training are:"
+ },
+ {
+ "objectID": "posts/2020-09-21-data-visualization/index.html#training-material",
+ "href": "posts/2020-09-21-data-visualization/index.html#training-material",
+ "title": "Data Visualization",
+ "section": "Training material",
+ "text": "Training material\nClick on “Show code” to learn how to do each plot!\n\nInteractive examples\n\n\n\n\nStreamgraph\n\n\nShow code\n# Script to make a streamgraph of the top 10 most popular dog breeds in \n# New York City from 1999 to 2015\n\n# load libraries\nlibrary(lubridate) # dealing with dates\nlibrary(dplyr) # data manipulation\nlibrary(streamgraph) #devtools::install_github(\"hrbrmstr/streamgraph\")\nlibrary(htmlwidgets) # to save the widget!\n\n# load the dataset\n# more information about this dataset can be found here:\n# https://www.kaggle.com/smithaachar/nyc-dog-licensing-clean\nnyc_dogs <- read.csv(\"data/nyc_dogs.csv\")\n\n# convert birth year to date format (and keep only the year)\nnyc_dogs$AnimalBirthYear <- mdy_hms(nyc_dogs$AnimalBirthMonth) %>% year()\n\n# identify 10 most common dogs\ntopdogs <- nyc_dogs %>% count(BreedName) \ntopdogs <- topdogs[order(topdogs$n, decreasing = TRUE),]\n# keep 10 most common breeds (and remove last year of data which is incomplete)\ndf <- filter(nyc_dogs, BreedName %in% topdogs$BreedName[2:11] & AnimalBirthYear < 2016) %>% \n group_by(AnimalBirthYear) %>% \n count(BreedName) %>% ungroup()\n\n# get some nice colours from viridis (magma)\ncols <- viridis::viridis_pal(option = \"magma\")(length(unique(df$BreedName)))\n\n# make streamgraph!\npp <- streamgraph(df, \n key = BreedName, value = n, date = AnimalBirthYear, \n height=\"600px\", width=\"1000px\") %>%\n sg_legend(show=TRUE, label=\"names: \") %>%\n sg_fill_manual(values = cols) \n# saveWidget(pp, file=paste0(getwd(), \"/figures/dogs_streamgraph.html\"))\n\n# plot\npp\n\n\n\n\n\n\n\n\n\n\nInteractive plot\n\n\nShow code\n# Script to generate plots to demonstrate how combinations of information dimensions\n# can become overwhelming and difficult to interpret.\n\n# set-up & data manipulation ---------------------------------------------------\n\n# load packages\nlibrary(ggplot2) # for plots, built layer by layer\nlibrary(dplyr) # for data manipulation\nlibrary(magrittr) # for piping\nlibrary(plotly) # interactive plots\n\n# set ggplot theme\ntheme_set(theme_classic() +\n theme(axis.title = element_text(size = 11, face = \"bold\"),\n axis.text = element_text(size = 11),\n plot.title = element_text(size = 13, face = \"bold\"),\n legend.title = element_text(size = 11, face = \"bold\"),\n legend.text = element_text(size = 10)))\n\n# import data\n# more info on this dataset: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md\npenguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv') \n\n# get some nice colours from viridis (magma)\nsp_cols <- viridis::viridis_pal(option = \"magma\")(5)[2:4]\n\n\n#### Day 1 ####\n\n# 1. Similarity\n\nggplot(penguins) +\n geom_point(aes(y = bill_length_mm, x = bill_depth_mm, col = species), size = 2.5) +\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", col = \"Species\") + # labels\n scale_color_manual(values = sp_cols) # sets the colour scale we created above \n\n\n\n\n\nShow code\nggsave(\"figures/penguins_similarity.png\", width = 6, height = 3, units = \"in\")\n\n# 2. Proximity\n\ndf <- penguins %>% group_by(sex, species) %>% \n summarise(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% na.omit() \nggplot(df) +\n geom_bar(aes(y = mean_mass, x = species, fill = sex), \n position = \"dodge\", stat = \"identity\") +\n labs(x = \"Species\", y = \"Mean body mass (g)\", col = \"Sex\") + # labels\n scale_fill_manual(values = sp_cols) # sets the colour scale we created above\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_proximity.png\", width = 6, height = 3, units = \"in\")\n\n# 3. Enclosure (Ellipses over a fake PCA)\nggplot(data = penguins, \n aes(y = bill_length_mm, x = bill_depth_mm)) +\n geom_point(size = 2.1, col = \"grey30\") +\n stat_ellipse(aes(col = species), lwd = .7) +\n labs(x = \"PCA1\", y = \"PCA2\", col = \"Species\") + # labels\n scale_color_manual(values = sp_cols) + # sets the colour scale we created above\n theme(axis.text = element_blank(), axis.ticks = element_blank())\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_enclosure.png\", width = 6, height = 3, units = \"in\")\n\n# 4. Mismatched combination of principles\ntemp_palette <- rev(c(sp_cols, \"#1f78b4\", \"#33a02c\"))\nggplot(data = penguins, \n aes(y = bill_length_mm, x = bill_depth_mm)) +\n geom_point(aes(col = sex), size = 2.1) +\n stat_ellipse(aes(col = species), lwd = .7) +\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", col = \"?\") + # labels\n scale_color_manual(values = temp_palette) # sets the colour scale we created above\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_mismatchedgestalt.png\", width = 6, height = 3, units = \"in\")\n\n\n\n#### Day 2 ####\n\n# 1. Ineffective combinations: Luminance & shading -----------------------------\n\n# create the plot\nggplot(penguins) +\n geom_point(aes(y = bill_length_mm, x = bill_depth_mm, \n col = species, # hue\n alpha = log(body_mass_g)), # luminance\n size = 2.5) +\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", \n col = \"Species\", alpha = \"Body mass (g)\") +\n scale_color_manual(values = sp_cols)\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_incompatible1.png\", width = 6, height = 3, units = \"in\")\n\n# 2. Ineffective combinations: Sizes and shapes --------------------------------\n\nggplot(penguins) +\n geom_point(aes(y = bill_length_mm, x = bill_depth_mm, \n shape = species, # shape\n size = log(body_mass_g)), alpha = .7) + # size\n scale_size(range = c(.1, 5)) + # make sure the sizes are scaled by area and not by radius\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", \n shape = \"Species\", size = \"Body mass (g)\") \n\n\n\n\n\nShow code\nggsave(\"figures/penguins_incompatible2.png\", width = 6, height = 3, units = \"in\")\n\n# 3. Cognitive overload --------------------------------------------------------\n\n# get some nice colours from viridis (magma)\nsex_cols <- viridis::viridis_pal(option = \"magma\")(8)[c(3,6)]\n\nggplot(na.omit(penguins)) +\n geom_point(aes(y = bill_length_mm, # dimension 1: position along y scale\n x = bill_depth_mm, # dimension 2: position along x scale\n shape = species, # dimension 3: shape\n size = log(body_mass_g), # dimension 4: size\n col = sex), # dimension 5: hue\n alpha = .7) + # size\n scale_size(range = c(.1, 5)) + # make sure the sizes are scaled by area and not by radius\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", \n shape = \"Species\", size = \"Body mass (g)\", col = \"Sex\") +\n scale_color_manual(values = sex_cols)\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_5dimensions.png\", width = 7, height = 4, units = \"in\")\n\n\n# 4. Panels -------------------------------------------------------------------\n\nggplot(na.omit(penguins)) +\n geom_point(aes(y = bill_length_mm, # dimension 1: position along y scale\n x = bill_depth_mm, # dimension 2: position along x scale\n col = log(body_mass_g)), # dimension 3: hue\n alpha = .7, size = 2) + \n facet_wrap(~ species) + # dimension 4: species!\n # this will create a separate panel for each species\n # note: this also automatically uses the same axes for all panels! If you want \n # axes to vary between panels, use the argument scales = \"free\"\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", col = \"Body mass (g)\") +\n scale_color_viridis_c(option = \"magma\", end = .9, direction = -1) +\n theme_linedraw() + theme(panel.grid = element_blank()) # making the panels prettier\n\n\n\n\n\nShow code\nggsave(\"figures/penguins_dimensions_facets.png\", width = 7, height = 4, units = \"in\")\n\n\n# 5. Interactive ---------------------------------------------------------------\n\np <- na.omit(penguins) %>%\n ggplot(aes(y = bill_length_mm, \n x = bill_depth_mm, \n col = log(body_mass_g))) +\n geom_point(size = 2, alpha = .7) + \n facet_wrap(~ species) +\n labs(x = \"Bill depth (mm)\", y = \"Bill length (mm)\", col = \"Body mass (g)\") +\n scale_color_viridis_c(option = \"magma\", end = .9, direction = -1) +\n theme_linedraw() + theme(panel.grid = element_blank()) # making the panels prettier\np <- ggplotly(p)\n#setwd(\"figures\")\nhtmlwidgets::saveWidget(as_widget(p), \"figures/penguins_interactive.html\")\np\n\n\n\n\n\n\n\n\n\nExample figures\n\n\nShow code\n# Script to make animated plot of volcano eruptions over time\n\n# Load libraries:\nlibrary(dplyr) # data manipulation\nlibrary(ggplot2) # plotting\nlibrary(gganimate) # animation\nlibrary(gifski) # creating gifs\n\n# set ggplot theme\ntheme_set(theme_classic() +\n theme(axis.title = element_text(size = 11, face = \"bold\"),\n axis.text = element_text(size = 11),\n plot.title = element_text(size = 13, face = \"bold\"),\n legend.title = element_text(size = 11, face = \"bold\"),\n legend.text = element_text(size = 10)))\n\n# function to floor a year to the decade\nfloor_decade = function(value){return(value - value %% 10)}\n\n# read data \neruptions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')\n\n# select top 5 most frequently exploding volcanoes\ntemp <- group_by(eruptions, volcano_name) %>% tally() \ntemp <- temp[order(temp$n, decreasing = TRUE),]\n\n# make a time series dataset (number of eruptions per year)\neruptions$start_decade = floor_decade(eruptions$start_year)\n\n# filter dataset to subset we want to visualize\ndf <- eruptions %>% \n filter(between(start_decade, 1900, 2019)) %>%\n filter(volcano_name %in% temp$volcano_name[1:5]) %>%\n group_by(start_decade) %>%\n count(volcano_name) %>% ungroup()\n\n# plot!\np <- ggplot(df, aes(x = start_decade, y = n, fill = volcano_name)) +\n geom_area() +\n geom_vline(aes(xintercept = start_decade)) + # line that follows the current decade\n scale_fill_viridis_d(option = \"magma\", end = .8) +\n labs(x = \"\", y = \"Number of eruptions\", fill = \"Volcano\",\n title = 'Eruptions of the top 5 most frequently erupting volcanos worldwide') +\n # gganimate part: reveals each decade\n transition_reveal(start_decade) \nanimate(p, duration = 5, fps = 20, width = 800, height = 300, renderer = gifski_renderer())\n\n\n\n\n\nShow code\n#anim_save(\"figures/volcano_eruptions.gif\")\n\n\n\n\nShow code\n# Script to generate plots with various ways of representing uncertainty, based \n# Coffee & Code dataset from https://www.kaggle.com/devready/coffee-and-code/data\n\n# set-up & data manipulation ---------------------------------------------------\n\n# load packages\nlibrary(ggplot2) # for plots, built layer by layer\nlibrary(dplyr) # for data manipulation\nlibrary(magrittr) # for piping\nlibrary(ggridges) # for density ridge plots\nlibrary(patchwork) # great package for \"patching\" plots together!\n\n# set ggplot theme\ntheme_set(theme_classic() +\n theme(axis.title = element_text(size = 11, face = \"bold\"),\n axis.text = element_text(size = 11),\n plot.title = element_text(size = 13, face = \"bold\"),\n legend.title = element_text(size = 11, face = \"bold\"),\n legend.text = element_text(size = 10)))\n\n# import data\ndf <- read.csv(\"data/coffee_code.csv\")\n\n# set labels to be used in all plots\ncoffee_labels <- labs(title = \"Does coffee help programmers code?\",\n x = \"Coffee while coding\", \n y = \"Time spent coding \\n(hours/day)\") \n\n# the variable CodingWithoutCoffee is negative, which is harder to understand\n# (i.e. \"No\" means they drink coffee...). So, let's transform it into a more \n# intuitive variable!\ndf$CodingWithCoffee <- gsub(\"No\", \"Usually\", df$CodingWithoutCoffee)\ndf$CodingWithCoffee <- gsub(\"Yes\", \"Rarely\\n or never\", df$CodingWithCoffee)\n# convert to factor and set levels so they show up in a logical order\ndf$CodingWithCoffee <- factor(df$CodingWithCoffee,\n levels = c(\"Rarely\\n or never\", \n \"Sometimes\", \n \"Usually\"))\n\n# calculate summary statistics for the variable of choice\ndf_summary <- group_by(df, CodingWithCoffee) %>%\n summarise(\n # mean\n mean_codinghours = mean(CodingHours), \n # standard deviation\n sd_codinghours = sd(CodingHours), \n # standard error\n se_codinghours = sd(CodingHours)/sqrt(length(CodingHours)))\n\n\n# 1. Error bars (standard error) -----------------------------------------------\n\nggplot(df_summary) +\n geom_errorbar(aes(x = CodingWithCoffee, \n ymin = mean_codinghours - se_codinghours,\n ymax = mean_codinghours + se_codinghours), \n width = .2) +\n geom_point(aes(x = CodingWithCoffee, y = mean_codinghours), \n size = 3) +\n coffee_labels + ylim(0,10)\n\n\n\n\n\nShow code\nggsave(\"figures/coffee_errorbars.png\", width = 5, height = 3, units = \"in\")\n\n# 2. Boxplot -------------------------------------------------------------------\n\nggplot(df) +\n geom_boxplot(aes(x = CodingWithCoffee, y = CodingHours)) +\n coffee_labels\n\n\n\n\n\nShow code\nggsave(\"figures/coffee_boxplot.png\", width = 5, height = 3, units = \"in\")\n\n\n# 3. Error bar demonstration ---------------------------------------------------\n\n# get some nice colours from viridis (magma)\nerror_cols <- viridis::viridis_pal(option = \"magma\")(5)[2:4]\n# set labels to be used in the palette\nerror_labels = c(\"standard deviation\",\"95% confidence interval\",\"standard error\")\n\nggplot(df_summary) +\n # show the raw data\n geom_jitter(data = df, aes(x = CodingWithCoffee, \n y = CodingHours),\n alpha = .5, width = .05, col = \"grey\") +\n # standard deviation\n geom_errorbar(aes(x = CodingWithCoffee, \n ymin = mean_codinghours - sd_codinghours,\n ymax = mean_codinghours + sd_codinghours,\n col = \"SD\"), width = .2, lwd = 1) +\n # 95% confidence interval\n geom_errorbar(aes(x = CodingWithCoffee, \n ymin = mean_codinghours - 1.96*se_codinghours,\n ymax = mean_codinghours + 1.96*se_codinghours, \n col = \"CI\"), width = .2, lwd = 1) +\n # standard error\n geom_errorbar(aes(x = CodingWithCoffee, \n ymin = mean_codinghours - se_codinghours,\n ymax = mean_codinghours + se_codinghours, \n col = \"SE\"), width = .2, lwd = 1) +\n geom_point(aes(x = CodingWithCoffee, y = mean_codinghours), \n size = 3) +\n coffee_labels + ylim(c(0,11)) +\n # manual palette/legend set-up!\n scale_colour_manual(name = \"Uncertainty metric\", \n values = c(SD = error_cols[1], \n CI = error_cols[2], \n SE = error_cols[3]),\n labels = error_labels) +\n theme(legend.position = \"top\")\n\n\n\n\n\nShow code\nggsave(\"figures/coffee_bars_demo.png\", width = 7, height = 5, units = \"in\")\n\n\n# 4. Jitter plot with violin ---------------------------------------------------\n\nggplot() +\n geom_jitter(data = df, aes(x = CodingWithCoffee, \n y = CodingHours),\n alpha = .5, width = .05, col = \"grey\") +\n geom_violin(data = df, aes(x = CodingWithCoffee, \n y = CodingHours), alpha = 0) +\n geom_linerange(data = df_summary,\n aes(x = CodingWithCoffee, \n ymin = mean_codinghours - se_codinghours,\n ymax = mean_codinghours + se_codinghours)) +\n geom_point(data = df_summary, \n aes(x = CodingWithCoffee, \n y = mean_codinghours), size = 3) +\n coffee_labels\n\n\n\n\n\nShow code\nggsave(\"figures/coffee_violin_jitter.png\", width = 5, height = 3, units = \"in\")\n\n\n# 5. Density ridge plot --------------------------------------------------------\n\nggplot(df) + \n aes(y = CodingWithCoffee, x = CodingHours, fill = stat(x)) +\n geom_density_ridges_gradient(scale = 1.9, size = .2, rel_min_height = 0.005) +\n # colour palette (gradient according to CodingHours)\n scale_fill_viridis_c(option = \"magma\", direction = -1) +\n # remove legend - it's not necessary here!\n theme(legend.position = \"none\") +\n labs(title = coffee_labels$title, \n x = coffee_labels$y, \n y = \"Coffee \\nwhile coding\") + \n theme(axis.title.y = element_text(angle=0, hjust = 1, vjust = .9, \n margin = margin(t = 0, r = -50, b = 0, l = 0)))\n\n\n\n\n\nShow code\nggsave(\"figures/coffee_density_ridges.png\", width = 5, height = 3, units = \"in\")\n\n# 6. Jitter vs. Rug plot ------------------------------------------------------------------\n\njitterplot <- ggplot(df, aes(x = CoffeeCupsPerDay, y = CodingHours)) +\n geom_jitter(alpha = .2) +\n geom_smooth(fill = error_cols[1], col = \"black\", method = lm, lwd = .7) +\n coffee_labels + ylim(c(0,11)) + labs(x = \"Cups of coffee (per day)\")\n\nrugplot <- ggplot(df, aes(x = CoffeeCupsPerDay, y = CodingHours)) +\n geom_smooth(fill = error_cols[1], col = \"black\", method = lm, lwd = .7) +\n geom_rug(position=\"jitter\", alpha = .7) + ylim(c(0,11)) +\n coffee_labels + labs(x = \"Cups of coffee (per day)\")\n\n# patch the two plots together\njitterplot + rugplot\n\n\n\n\n\nShow code\n#ggsave(\"figures/coffee_jitter_vs_rugplot.png\", width = 10, height = 4, units = \"in\")\n\n\n\n\nShow code\n# Script to generate 95% confidence intervals of a generated random normal distribution\n# as an example in Day 2: Visualizing uncertainty.\n\n# load library\nlibrary(ggplot2)\nlibrary(magrittr)\nlibrary(dplyr)\n\n# set ggplot theme\ntheme_set(theme_classic() +\n theme(axis.title = element_text(size = 11, face = \"bold\"),\n axis.text = element_text(size = 11),\n plot.title = element_text(size = 13, face = \"bold\"),\n legend.title = element_text(size = 11, face = \"bold\"),\n legend.text = element_text(size = 10)))\n\n# set random seed\nset.seed(22)\n\n# generate population (random normal distribution)\ndf <- data.frame(\"value\" = rnorm(50, mean = 0, sd = 1))\n\n# descriptive stats for each distribution\ndesc_stats = df %>% \n summarise(mean_val = mean(value, na.rm = TRUE),\n se_val = sqrt(var(value)/length(value)))\n\n# build density plot!\np <- ggplot(data = df, aes(x = value, y = ..count..)) +\n geom_density(alpha = .2, lwd = .3) +\n xlim(c(min(df$value-1), max(df$value+1))) \n# extract plotted values\nbase_p <- ggplot_build(p)$data[[1]]\n# shade the 95% confidence interval\np + \n geom_area(data = subset(base_p, \n between(x, \n left = (desc_stats$mean_val - 1.96*desc_stats$se_val),\n right = (desc_stats$mean_val + 1.96*desc_stats$se_val))),\n aes(x = x, y = y), fill = \"cadetblue3\", alpha = .6) +\n # add vertical line to show population mean\n geom_vline(aes(xintercept = 0), lty = 2) +\n annotate(\"text\", x = 0.9, y = 19, label = \"True mean\", fontface = \"italic\") +\n # label axis!\n labs(x = \"Variable of interest\", y = \"\") \n\n\n\n\n\nShow code\n#ggsave(\"figures/confidenceinterval_example.png\", width = 5, height = 3.5, units = \"in\")\n\n\n\n\nAnnotated resource library\nThis is an annotated library of data visualization resources we used to build the BIOS² Data Visualization Training, as well as some bonus resources we didn’t have the time to include. Feel free to save this page as a reference for your data visualization adventures!\n\n\nBooks & articles\nFundamentals of Data Visualization A primer on making informative and compelling figures. This is the website for the book “Fundamentals of Data Visualization” by Claus O. Wilke, published by O’Reilly Media, Inc.\nData Visualization: A practical introduction An accessible primer on how to create effective graphics from data using R (mainly ggplot). This book provides a hands-on introduction to the principles and practice of data visualization, explaining what makes some graphs succeed while others fail, how to make high-quality figures from data using powerful and reproducible methods, and how to think about data visualization in an honest and effective way.\nData Science Design (Chapter 6: Visualizing Data) Covers the principles that make standard plot designs work, show how they can be misleading if not properly used, and develop a sense of when graphs might be lying, and how to construct better ones.\nGraphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods Cleveland, William S., and Robert McGill. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association, vol. 79, no. 387, 1984, pp. 531–554. JSTOR, www.jstor.org/stable/2288400. Accessed 9 Oct. 2020.\nGraphical Perception and Graphical Methods for Analyzing Scientific Data Cleveland, William S., and Robert McGill. “Graphical perception and graphical methods for analyzing scientific data.” Science 229.4716 (1985): 828-833.\nFrom Static to Interactive: Transforming Data Visualization to Improve Transparency Weissgerber TL, Garovic VD, Savic M, Winham SJ, Milic NM (2016) designed an interactive line graph that demonstrates how dynamic alternatives to static graphics for small sample size studies allow for additional exploration of empirical datasets. This simple, free, web-based tool demonstrates the overall concept and may promote widespread use of interactive graphics.\nData visualization: ambiguity as a fellow traveler Research that is being done about how to visualize uncertainty in data visualizations. Marx, V. Nat Methods 10, 613–615 (2013). https://doi.org/10.1038/nmeth.2530\nData visualization standards Collection of guidance and resources to help create better data visualizations with less effort.\n\n\n\nDesign principles\nGestalt Principles for Data Visualization: Similarity, Proximity & Enclosure Short visual guide to the Gestalt Principles.\nWhy scientists need to be better at data visualization A great overview of principles that could improve how we visualize scientific data and results.\nA collection of graphic pitfalls A collection of short articles about common issues with data visualizations that can mislead or obscure your message.\n\n\n\nChoosing a visualization\nData Viz Project This is a great place to get inspiration and guidance about how to choose an appropriate visualization. There are many visualizations we are not used to seeing in ecology!\nFrom data to Viz | Find the graphic you need Interactive tool to choose an appropriate visualization type for your data.\n\n\n\nColour\nWhat to consider when choosing colors for data visualization A short, visual guide on things to keep in mind when using colour, such as when and how to use colour gradients, the colour grey, etc.\nColorBrewer: Color Advice for Maps Tool to generate colour palettes for visualizations with colorblind-friendly options. You can also use these palettes in R using the RColorBrewer package, and the scale_*_brewer() (for discrete palettes) or scale_*_distiller() (for continuous palettes) functions in ggplot2.\nColor.review Tool to pick or verify colour palettes with high relative contrast between colours, to ensure your information is readable for everyone.\nCoblis — Color Blindness Simulator Tool to upload an image and view it as they would appear to a colorblind person, with the option to simulate several color-vision deficiencies.\n500+ Named Colours with rgb and hex values List of named colours along with their hex values.\nCartoDB/CartoColor CARTOColors are a set of custom color palettes built on top of well-known standards for color use on maps, with next generation enhancements for the web and CARTO basemaps. Choose from a selection of sequential, diverging, or qualitative schemes for your next CARTO powered visualization using their online module.\n\n\n\nTools\n\nR\nThe R Graph Gallery A collection of charts made with the R programming language. Hundreds of charts are displayed in several sections, always with their reproducible code available. The gallery makes a focus on the tidyverse and ggplot2.\n\nBase R\nCheatsheet: Margins in base R Edit your margins in base R to accommodate axis titles, legends, captions, etc.!\nCustomizing tick marks in base R Seems like a simple thing, but it can be so frustrating! This is a great post about customizing tick marks with base plot in R.\nAnimations in R (for time series) If you want to use animations but don’t want to use ggplot2, this demo might help you!\n\n\nggplot2\nCheatsheet: ggplot2 Cheatsheet for ggplot2 in R - anything you want to do is probably covered here!\nCoding Club tutorial: Data Viz Part 1 - Beautiful and informative data visualization Great tutorial demonstrating how to customize titles, subtitles, captions, labels, colour palettes, and themes in ggplot2.\nCoding Club tutorial: Data Viz Part 2 - Customizing your figures Great tutorial demonstrating how to customize titles, subtitles, captions, labels, colour palettes, and themes in ggplot2.\nggplot flipbook A flipbook-style demonstration that builds and customizes plots line by line using ggplot in R.\ngganimate: A Grammar of Animated Graphics Package to create animated graphics in R (with ggplot2).\n\n\n\nPython\nThe Python Graph Gallery This website displays hundreds of charts, always providing the reproducible python code.\nPython Tutorial: Intro to Matplotlib Introduction to basic functionalities of the Python’s library Matplotlib covering basic plots, plot attributes, subplots and plotting the iris dataset.\nThe Art of Effective Visualization of Multi-dimensional Data Covers both univariate (one-dimension) and multivariate (multi-dimensional) data visualization strategies using the Python machine learning ecosystem.\n\n\nJulia\nJulia Plots Gallery Display of various plots with reproducible code in Julia.\nPlots in Julia Documentation for the Plots package in Julia, including demonstrations for animated plots, and links to tutorials.\nAnimations in Julia How to start making animated plots in Julia.\n\n\n\n\nCustomization\nChart Studio Web editor to create interactive plots with plotly. You can download the image as .html, or static images, without coding the figure yourself.\nPhyloPic Vector images of living organisms. This is great for ecologists who want to add silhouettes of their organisms onto their plots - search anything, and you will likely find it!\nAdd icons on your R plot Add special icons to your plot as a great way to customize it, and save space with labels!\n\n\n\nInspiration (pretty things!)\nInformation is Beautiful Collection of beautiful original visualizations about a variety of topics!\nTidyTuesday A weekly data project aimed at the R ecosystem, where people wrangle and visualize data in loads of creative ways. Browse what people have created (#TidyTuesday on Twitter is great too!), and the visualizations that have inspired each week’s theme.\nWind currents on Earth Dynamic and interactive map of wind currents on Earth.\nA Day in the Life of Americans Dynamic visualisation of how Americans spend their time in an average day.\n2019: The Year in Visual Stories and Graphics Collection of the most popular visualizations by the New York Times in 2019."
+ },
+ {
+ "objectID": "posts/2020-01-14-mathematical-modeling-in-ecology-and-evolution/index.html",
+ "href": "posts/2020-01-14-mathematical-modeling-in-ecology-and-evolution/index.html",
+ "title": "Mathematical Modeling in Ecology and Evolution",
+ "section": "",
+ "text": "In this workshop, I introduce various modelling techniques, using mostly ecological and evolutionary examples, with a focus on how computer software programs can help biologists analyze such models.\n\nContent\nPart 1: Classic one-variable models in ecology and evolution\nPart 2: Equilibria and their stability\nPart 3: Beyond equilibria\nPart 4: Example of building a model from scratch\nPart 5: Extending to models with more than one variable\nPart 6: Another example of building a model from scratch\n\n\nSoftware\nIn my research, I primarily use Mathematica, which is a powerful software package to organize and conduct analytical modelling, but it is not free (at UBC, we have some licenses available). I will also show some example code and provide translation of most of what I present in a free software package called Maxima.\n\nMathematica installation\nThere is a free trial version that you can use for 15 days, if you don’t have a copy (click here to access), or you can buy a student version online. If you want to make sure that all is working, copy the code below, put your cursor over each of the following lines and press enter (on some computers, “enter” is a separate button, on others, press “shift” and “return” at the same time):\nD[x^3,x]\nListPlot[Table[x, {x,1,10}],Joined->True]\nRSolve[{x[t+1]\\[Equal]A x[t],x[0]\\[Equal]x0},x[t],t]\nPDF[NormalDistribution[0,1],x]\nYou should see (a) \\(3x^2\\), (b) a plot of a line, (c) \\({{x[t]->A^t x0}}\\), and (d) \\(\\frac{e^\\frac{-x^2}{2}}{\\sqrt{2\\pi }}\\).\n\n\nMaxima installation:\nOn a Mac, install using the instructions here. For other file systems, download here.\n\n\nMaxima testing\nWhen you first open Maxima, it will give you a choice of GUIs, chose wxMaxima. Once wxMaxima is launched type this command and hit return to see if it answers 4:\n2+2;\nIf it doesn’t, then scan the installation document for the error that you run into.\nIf it does return 4, then type in and enter these commands:\ndiff(x^3, x);\n\nwxplot2d (3*x, [x, 0, 2*%pi]);\n\nload(\"solve_rec\")$\nsolve_rec(x[t+1] = A*x[t], x[t], x[0]=x0);\n\nload(\"distrib\")$\npdf_normal(x,0,1);\nYou should see (a) \\(3x^2\\), (b) a plot of a line, (c) \\({{x[t]->A^t x0}}\\), and (d) \\(\\frac{e^\\frac{-x^2}{2}}{\\sqrt{2\\pi }}\\).\n\n\n\nMaterial\n\n\n\nMathematica\nMaxima\nPDF\n\n\n\n\nNotebook\nNotebook\nEmbeded below\n\n\nHints and solutions\nHints and solutions\n\n\n\n\n\nHomework\n\n\nHomework answers\n\nHomework answers\n\n\nGuide\nGuide\n\n\n\n\n\nFollow along PDF\nThis PDF was generated from the Mathematica notebook linked above. It doesn’t include dynamic plots, but it’s a good alternative if you want to print out or have a quick reference at hand.\n\n\n \n\n\n\nStability analysis of a recursion equation in a discrete-time model.\n\n\n\n\n\nOther resources\n\nAn Introduction to Mathematical Modeling in Ecology and Evolution (Otto and Day 2007).\nBiomathematical modeling lecture notes.\nMathematica labs UBC.\n\n\n\nThanks\nNiki Love and Gil Henriques did a great job of translating the code into wxMaxima, with limited help from me. Thanks, Niki and Gil!!\n\n\n\n\n\nReferences\n\nOtto, Sarah P, and Troy Day. 2007. A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution. Vol. 13. Princeton University Press."
+ },
+ {
+ "objectID": "posts/2020-06-15-science-communication/index.html",
+ "href": "posts/2020-06-15-science-communication/index.html",
+ "title": "Science Communication",
+ "section": "",
+ "text": "The objective of this training is to share and discuss the concepts and tools that contribute to effective science communication. The training is split into two sessions, which cover the basic concepts of effective science communication and how social media tools can be used to boost the signal of your research and extend your research network. Each training takes the form of a presentation interspersed with several short activity modules, where participants are invited to use the tools we will be discussing to kickstart their own science communication.\nThis training was given on June 1 and 2, 2020. You can view recordings of each session here:"
+ },
+ {
+ "objectID": "posts/2020-06-15-science-communication/index.html#session-1-the-basics-of-science-communication",
+ "href": "posts/2020-06-15-science-communication/index.html#session-1-the-basics-of-science-communication",
+ "title": "Science Communication",
+ "section": "Session 1: The basics of science communication",
+ "text": "Session 1: The basics of science communication\n\nObjectives:\n\nDiscuss what science communication (or SciComm) can be, and its potential role in boosting the signal of your research\nMake an overview of basic concepts and tools that you can use in any medium (blog posts, presentations, conversations, twitter, etc.) to do effective science communication\n\nDuring this session, we:\n\nDiscuss the potential pitfalls of science communication (notably, diversity and inclusivity problems).\nCover the basic concepts of science communication, including the Golden Circle method, the creation of personas, and storytelling techniques.\nHave short activities where participants can try to use some of the techniques we will be covering, such as filling in their own Golden Circle and explaining a blog post as a storyboard."
+ },
+ {
+ "objectID": "posts/2020-06-15-science-communication/index.html#session-2-social-media-as-a-science-communication-tool",
+ "href": "posts/2020-06-15-science-communication/index.html#session-2-social-media-as-a-science-communication-tool",
+ "title": "Science Communication",
+ "section": "Session 2: Social media as a science communication tool",
+ "text": "Session 2: Social media as a science communication tool\n\nObjectives:\n\nRethink the way we write about science by exploring the world of blog posts\nClarify the mechanics of Twitter and how it can be used effectively for science communication\n\nDuring this session, we:\n\nDiscuss how to create a story structure using titles and the flow of ideas in blog posts, especially when we are used to writing scientific articles\nCover the basics of how Twitter works (retweets, threads, replies, hashtags, photo captions, etc.) and how to find helpful connections\nHave short activities where participants will be invited to write their own Twitter biographies and to create a Twitter thread explaining a project of their choice."
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html",
+ "title": "Introduction to Shiny Apps",
+ "section": "",
+ "text": "There are many reasons to consider using Shiny for a project:\n\nSharing results from a paper with your readers;\nHelping you explore a model, mathematics, simulations;\nLetting non R users use R."
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#hello-shiny",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#hello-shiny",
+ "title": "Introduction to Shiny Apps",
+ "section": "Hello Shiny!",
+ "text": "Hello Shiny!\nHere is an example of a Shiny app that RStudio generates when you open a new Shiny Web App file:\n\n# Define UI for app that draws a histogram ----\nui <- fluidPage(\n\n # App title ----\n titlePanel(\"Hello Shiny!\"),\n\n # Sidebar layout with input and output definitions ----\n sidebarLayout(\n\n # Sidebar panel for inputs ----\n sidebarPanel(\n\n # Input: Slider for the number of bins ----\n sliderInput(inputId = \"bins\",\n label = \"Number of bins:\",\n min = 1,\n max = 50,\n value = 30)\n\n ),\n\n # Main panel for displaying outputs ----\n mainPanel(\n\n # Output: Histogram ----\n plotOutput(outputId = \"distPlot\")\n\n )\n )\n)"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#building-blocks",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#building-blocks",
+ "title": "Introduction to Shiny Apps",
+ "section": "Building blocks",
+ "text": "Building blocks\nWe’ve now seen the basic building blocks of a Shiny app:\n\nThe user interface, which determines how the app “looks”. This is how we tell Shiny where to ask for user inputs, and where to put any outputs we create.\nReactive values, which are values that change according to user inputs. These are values that affect the outputs we create in the Shiny app, such as tables or plots.\nThe server, where we use reactive values to generate some outputs.\n\n\nIDs\nThe user interface and server communicate through IDs that we assign to inputs from the user and outputs from the server.\n\nWe use an ID (in orange) to link the user input in the UI to the reactive values used in the server:\n\nWe use another ID (in blue) to link the output created in the server to the output shown in the user interface:\n\n\n\nOrganisation\nThese elements can all be placed in one script named app.R or separately in scripts named ui.R and server.R. The choice is up to you, although it becomes easier to work in separate ui.R and server.R scripts when the Shiny app becomes more complex.\nExample 1: Everything in app.R\n Example 2: Split things into ui.R and server.R"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#plots",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#plots",
+ "title": "Introduction to Shiny Apps",
+ "section": "Plots",
+ "text": "Plots\nShiny is an excellent tool for visual exploration - it is at its most useful when a user can see something change before their eyes according to some selections. This is a great way to allow users to explore a dataset, explore the results of some analyses according to different parameters, and so on!\nLet’s now add a plot to our Shiny app, to visualize the distribution of a variable depending on user input. We’ll be adding the ggplot2 and ggridges packages in the set-up step at the top of our app.R to allow us to make a plot.\n\n# load packages\nlibrary(shiny)\nlibrary(ggridges)\nlibrary(ggplot2)\nlibrary(here)\nlibrary(readr)\n\n\nUser interface\nTo add a plot in our Shiny, we need to indicate where the plot should appear in the app. We can do this with plotOutput(), a similar function to tableOutput() in the previous section that is meant for plot outputs, as the name suggests.\n\n# Define UI for application that makes a table andplots the Volcano Explosivity \n# Index for the most eruptive volcanoes within a selected range of years\n\nui <- fluidPage(\n \n # Application title ----\n \n titlePanel(\"Exploring volcano explosivity\"),\n \n # Input interface ----\n \n sidebarLayout(\n sidebarPanel(\n \n # Sidebar with a slider range input\n sliderInput(\"years\", # the id your server needs to use the selected value\n label = h3(\"Years\"),\n min = 1900, max = 2020, # maximum range that can be selected\n value = c(2010, 2020) # this is the default slider position\n )\n )\n ),\n \n # Show the outputs from the server ---------------\n mainPanel(\n \n # Show a ridgeplot of explosivity index for selected volcanoes\n plotOutput(\"ridgePlot\"),\n \n # then, show the table we made in the previous step\n tableOutput(\"erupt_table\")\n \n )\n)\n\nNow our Shiny app knows where we want to place our plot.\n\n\nServer\nWe now need to create the plot we want to show in our app. This plot will change depending on one or several reactive values that the user can input or select in our UI.\nWe link the UI and server together with IDs that are assigned to each object. Above, we told the UI to expect a plot output with the ID \"ridgePlot\". In the server, we will create a plot and render it as a plot object using renderPlot(), and we will assign this plot output to the ID we call in the UI (as output$ridgePlot).\n\n# Define server logic required to make your output(s)\nserver <- function(input, output) {\n\n \n # prepare the data\n # ----------------------------------------------------------\n \n # read the dataset\n eruptions <- readr::read_rds(here::here(\"data\", \"eruptions.rds\"))\n \n # filter the dataset to avoid overloading the plot \n eruptions <- eruptions[which(eruptions$volcano_name %in% names(which(table(eruptions$volcano_name) > 30))),]\n # this subsets to volcanoes that have erupted more than 30 times\n \n \n # make reactive dataset\n # ----------------------------------------------------------\n \n # subset volcano data with input year range\n eruptions_filtered <- reactive({\n subset(eruptions, start_year >= input$years[1] & end_year <= input$years[2])\n })\n \n \n # create and render the outputs\n # ----------------------------------------------------------\n \n # create the table of volcanoes\n output$erupt_table <- renderTable({\n head(eruptions_filtered())\n })\n \n # render the plot output\n output$ridgePlot <- renderPlot({\n \n # create the plot\n ggplot(data = eruptions_filtered(),\n aes(x = vei,\n y = volcano_name,\n fill = volcano_name)) +\n # we are using a ridgeplot geom here, from the ggridges package\n geom_density_ridges( size = .5) + # line width\n \n # label the axes\n labs(x = \"Volcano Explosivity Index\", y = \"\") +\n \n # adjust the ggplot theme to make the plot \"prettier\"\n theme_classic() + \n theme(legend.position = \"none\",\n axis.text = element_text(size = 12, face = \"bold\"),\n axis.title = element_text(size = 14, face = \"bold\"))\n })\n}\n\n\n\nThe Shiny app\nNow, if we run the Shiny app, we have a plot above the table we made previously. They are positioned in this way because the plotOutput() comes before the tableOutput() in the UI.\n\n# Run the application\nshinyApp(ui = ui, server = server)"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#customising-the-theme",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#customising-the-theme",
+ "title": "Introduction to Shiny Apps",
+ "section": "Customising the theme",
+ "text": "Customising the theme\nIf you’d like to go one step further, you can also customize the appearance of your Shiny app using built-in themes, or creating your own themes.\n\nUsing built-in themes\nThere are several built-in themes in Shiny, which allow you to quickly change the appearance of your app. You can browse a gallery of available themes here here, or test themes out interactively here.\nLet’s try the darkly theme on our Shiny app. To do this, we will need the shinythemes package.\n\nlibrary(shinythemes)\n\nWe can change the theme of our previous app with one line of code:\n\n# Define UI for application that makes a table andplots the Volcano Explosivity \n# Index for the most eruptive volcanoes within a selected range of years\n\nui <- fluidPage(\n \n # Application title ----\n \n titlePanel(\"Exploring volcano explosivity\"),\n \n # Input interface ----\n \n sidebarLayout(\n sidebarPanel(\n \n # Sidebar with a slider range input\n sliderInput(\"years\", # the id your server needs to use the selected value\n label = h3(\"Years\"),\n min = 1900, max = 2020, # maximum range that can be selected\n value = c(2010, 2020) # this is the default slider position\n )\n )\n ),\n \n # Show the outputs from the server ---------------\n mainPanel(\n \n # Show a ridgeplot of explosivity index for selected volcanoes\n plotOutput(\"ridgePlot\"),\n \n # then, show the table we made in the previous step\n tableOutput(\"erupt_table\")\n \n ),\n \n # Customize the theme ----------------------\n \n # Use the darkly theme\n theme = shinythemes::shinytheme(\"darkly\")\n)\n\nNow, if we run the app, it looks a little different:\n\n\n\nUsing a custom theme\nYou can also go beyond the built-in themes, and create your own custom theme with the fonts and colours of your choice. You can also apply this theme to the outputs rendered in the app, to bring all the visuals together for a more cohesive look.\n\nCustomizing a theme\nTo create a custom theme, we will be using the bs_theme() function from the bslib package.\n\nlibrary(bslib)\n\n\n# Create a custom theme \ncute_theme <- bslib::bs_theme(\n \n bg = \"#36393B\", # background colour\n fg = \"#FFD166\", # most of the text on your app\n primary = \"#F26430\", # buttons, ...\n \n # you can also choose fonts\n base_font = font_google(\"Open Sans\"),\n heading_font = font_google(\"Open Sans\")\n)\n\nTo apply this theme to our Shiny app (and the outputs), we will be using the thematic package.\n\nlibrary(thematic)\n\nThere are two essential steps to apply a custom theme to a Shiny app:\n\nActivating thematic.\nSetting the user interface’s theme to the custom theme (cute_theme).\n\n\n# Activate thematic\n# so your R outputs will be changed to match up with your chosen styling\nthematic::thematic_shiny()\n\n# Define UI for application that makes a table andplots the Volcano Explosivity \n# Index for the most eruptive volcanoes within a selected range of years\n\nui <- fluidPage(\n \n # Application title ----\n \n titlePanel(\"Exploring volcano explosivity\"),\n \n # Input interface ----\n \n sidebarLayout(\n sidebarPanel(\n \n # Sidebar with a slider range input\n sliderInput(\"years\", # the id your server needs to use the selected value\n label = h3(\"Years\"),\n min = 1900, max = 2020, # maximum range that can be selected\n value = c(2010, 2020) # this is the default slider position\n )\n )\n ),\n \n # Show the outputs from the server ---------------\n mainPanel(\n \n # Show a ridgeplot of explosivity index for selected volcanoes\n plotOutput(\"ridgePlot\"),\n \n # then, show the table we made in the previous step\n tableOutput(\"erupt_table\")\n \n ),\n \n # Customize the theme ----------------------\n \n # Use our custom theme\n theme = cute_theme\n)\n\nNow, if we run the app, the user interface and plot theme is set to the colours and fonts we set in cute_theme:\n\nHere, thematic is not changing the colours used to represent a variable in our plot, because this is an informative colour scale (unlike the colour of axis labels, lines, and the plot background). However, if we remove this colour variable in our ridgeplot in the server, thematic will change the plot colours as well. Here is a simplified example of our server to see what these changes would look like:\n\n# Define server logic required to make your output(s)\nserver <- function(input, output) {\n \n #... (all the good stuff we wrote above)\n \n # render the plot output\n output$ridgePlot <- renderPlot({\n \n # create the plot\n ggplot(data = eruptions_filtered(),\n aes(x = vei,\n y = volcano_name)) + # we are no longer setting \n # the fill argument to a variable\n \n # we are using a ridgeplot geom here, from the ggridges package\n geom_density_ridges(size = .5) + \n \n # label the axes\n labs(x = \"Volcano Explosivity Index\", y = \"\") +\n \n # remove the \"classic\" ggplot2 so it doesn't override thematic's changes\n # theme_classic() + \n theme(legend.position = \"none\",\n axis.text = element_text(size = 12, face = \"bold\"),\n axis.title = element_text(size = 14, face = \"bold\"))\n })\n }\n\nNow, our plot’s theme follows the app’s custom theme as well:"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#taking-advantage-of-good-defaults",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#taking-advantage-of-good-defaults",
+ "title": "Introduction to Shiny Apps",
+ "section": "Taking advantage of good defaults",
+ "text": "Taking advantage of good defaults\nHere, we will use shiny extension shinyDashboards and leaflet to construct a custom Shiny App to map volcanoes of the world. First, we need a few additional packages.\nNote: All Source code for this app can be found here on the BIOS2 Github.\n\n# load packages\nlibrary(shiny)\nlibrary(shinydashboard) # dashboard layout package\nlibrary(shinyWidgets) # fancy widgets package\nlibrary(leaflet) # interactive maps package\nlibrary(dplyr)\nlibrary(ggplot2)\n\n\nUsing ShinyDashboard\nWe will create our app using defaults from the ShinyDashboard package, which always includes three main components: a header, using dashboardHeader(), a sidebar, using dashboardSidebar(), and a body, using dashboardBody(). These are then added together using the dashboardPage() function.\nBuilding these elements is less like usual R coding, and more like web design, since we are, in fact, designing a unser interface for a web app. Here, we’ll make a basic layout before populating it.\n\n# create the header of our app\nheader <- dashboardHeader(\n title = \"Exploring Volcanoes of the World\",\n titleWidth = 350 # since we have a long title, we need to extend width element in pixels\n)\n\n\n# create dashboard body - this is the major UI element\nbody <- dashboardBody(\n\n # make first row of elements (actually, this will be the only row)\n fluidRow(\n \n # make first column, 25% of page - width = 3 of 12 columns\n column(width = 3,\n \n \n # Box 1: text explaining what this app is\n #-----------------------------------------------\n box( width = NULL,\n status=\"primary\", # this line can change the automatic color of the box.\n title = NULL,\n p(\"here, we'll include some info about this app\")\n\n \n ), # end box 1\n \n \n # box 2 : input for selecting volcano type\n #-----------------------------------------------\n box(width = NULL, status = \"primary\",\n title = \"Selection Criteria\", solidHeader = T, \n \n p(\"here, we'll add a UI element for selecting volcano types\"),\n\n ), # end box 2\n \n \n \n # box 3: ggplot of selected volcanoes by continent\n #------------------------------------------------\n box(width = NULL, status = \"primary\",\n solidHeader = TRUE, collapsible = T,\n title = \"Volcanoes by Continent\",\n p(\"here, we'll add a bar plot of volcanoes in each continent\")\n ) # end box 3\n \n ), # end column 1\n \n # second column - 75% of page (9 of 12 columns)\n #--------------------------------------------------\n column(width = 9,\n # Box 4: leaflet map\n box(width = NULL, background = \"light-blue\", height = 850,\n p(\"here, we'll show volcanoes on a map\"),\n ) # end box with map\n ) # end second column\n \n ) # end fluidrow\n) # end body\n\n\n# add elements together\ndashboardPage(\n skin = \"blue\",\n header = header,\n sidebar = dashboardSidebar(disable = TRUE), # here, we only have one tab of our app, so we don't need a sidebar\n body = body\n)"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#populating-the-layout",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#populating-the-layout",
+ "title": "Introduction to Shiny Apps",
+ "section": "Populating the Layout",
+ "text": "Populating the Layout\nNow, we are going to fill out app with elements. In this app, we will only have one user input: a selection of the volcano type to show. We will use this input (input$volcano_type), which will be used to filter data in the server (i.e. make a smaller dataset using only volcanoes of the selected types), then use this filtered dataset to create output elements (plots and maps).\nBelow, we show the necessary code to include in both the UI and the Server to create each plot element. Notice that after the reactive value selected_volcanoes is created in the selection box, this is the only object that is used to create the other elements in the app.\n\n\n\n\n\n\n\n\n\nLocation\nElement\nUI\nServer\n\n\n\n\nBox 1\nIntro Textbox\nMarkdown/HTML text code\n\n\n\nBox 2\nSelection Wigets\ncheckboxGroupButtons( inputID = \"volcano_type\")\nselected_volcanoes <- reactive({ volcano_df %>% filter(type %in% input$volcano_type)}) to create a filtered dataset that will react to user input\n\n\nBox 3\nBar Graph\nplotOutput(\"continentplot\")\noutput$continentplot <- renderPlot(...)) which will plot from the selectied_volcanoes reactive object\n\n\nBox 4\nLeaflet Map\nleafletOutput(\"volcanomap\")\noutput$volcanomap <- renderLeaflet(...) to map points from the selectied_volcanoes reactive object"
+ },
+ {
+ "objectID": "posts/2021-06-22-introduction-to-shiny-apps/index.html#challenge",
+ "href": "posts/2021-06-22-introduction-to-shiny-apps/index.html#challenge",
+ "title": "Introduction to Shiny Apps",
+ "section": "Challenge!",
+ "text": "Challenge!\nUse the code provided to add your own additional user input to the Shiny App. The code (which you can access here leaves a space for an additional UI input inside box 2). Then, you’ll need to use your new input element to the reactive value in the Server, as noted in the server code.\nUse the Default Shiny Widgets or shinyWidgets extended package galleries to explore the types of elements you can add.\n\n\nSee the completed app\nSee our completed app HERE"
+ },
+ {
+ "objectID": "posts/2021-05-04-building-r-packages/index.html",
+ "href": "posts/2021-05-04-building-r-packages/index.html",
+ "title": "Building R packages",
+ "section": "",
+ "text": "via GIPHY\nR packages! they are kind of like cookies:\nBut most of all: cookies are delicious for what they contain: chocolate chunks, candy, oats, cocoa. However, all cookies share some fundamental ingredients and nearly identical structure. Flour, saturated with fat and sugar hydrated only with an egg, flavoured with vanilla and salt. The basic formula is invariant and admits only slight deviation – otherwise, it becomes something other than a cookie.\nThis workshop is devoted to the study of cookie dough."
+ },
+ {
+ "objectID": "posts/2021-05-04-building-r-packages/index.html#the-structure-flour-and-sugar",
+ "href": "posts/2021-05-04-building-r-packages/index.html#the-structure-flour-and-sugar",
+ "title": "Building R packages",
+ "section": "The structure: flour and sugar",
+ "text": "The structure: flour and sugar\n\nNo cookies without carbs\n\nAn R package is essentially a folder on your computer with specific structure. We will begin by creating an empty R package and taking a tour!\nOpen your R code editor, and find out where you are:\ngetwd()\nThis is to prepare for the next step, where we will choose a location for our R package folder. Please be intentional about where you place your R package! Do not place it in the same space as another package, Rstudio project or other project. Create a new and isolated location for it.\nI am working from an existing R project in my typical R Projects folder, so I go up one level:\nusethis::create_package(\"../netwerk\")\n\nwe are sticking with usethis because we want to keep this general. All of these steps can be manual, and indeed for many years they were!\n\n\nLet’s run R CMD CHECK right away. We will do this MANY TIMES.\ndevtools::check()\nWe should see some warnings! let’s keep these in mind as we continue our tour.\n\nThe DESCRIPTION file\nThe most important file to notice is the DESCRIPTION. This gives general information about the entire package. It is written in a specific file format\nPackage: netwerk\nTitle: Werks with Networks\nVersion: 0.0.0.9000\nAuthors@R: \n person(given = \"Andrew\",\n family = \"MacDonald\",\n role = c(\"aut\", \"cre\"),\n email = \"\")\nDescription: it does networks.\nLicense: MIT + file LICENSE\nEncoding: UTF-8\nLazyData: true\nRoxygen: list(markdown = TRUE)\nRoxygenNote: 7.1.1\nSuggests: \n testthat (>= 3.0.0)\nConfig/testthat/edition: 3\nHere are some things to edit manually in DESCRIPTION:\n\npackage name [tk naming of R packages] – make it short and convenient if you can!\nTitle: write this part In Title Case. Don’t end the title with a period.\nDescription: Describe the package in a short block of text. This should end with a period.\nAuthors: Add your name here and the name of anyone building the package with you. usethis will have done the first step for you, and filled in the structure. Only “aut” (author) and “cre” (creator) are essential. but many others are possible\n\nAdd your name here.\nAdd a license\nusethis::use_mit_license(copyright_holder = \"\")\nnote about the different roles taht R package authors can have. Funny ones. but creator and maintainer are the key ones.\nNote the R folder. We’ll get much more into that later\n\nRbuildignore"
+ },
+ {
+ "objectID": "posts/2021-05-04-building-r-packages/index.html#keeping-notes",
+ "href": "posts/2021-05-04-building-r-packages/index.html#keeping-notes",
+ "title": "Building R packages",
+ "section": "Keeping notes",
+ "text": "Keeping notes\ncreate an R file\nusethis::use_build_ignore(\"dev.R\")\nthe docs folder\nhere we have a very minimal version of an R packages we’re going to be adding to it as the course progresses.\nOne thing we can do right away is build and check the R package\nWhat exactly is happining here? slide from R package tutorial.\nLots of checkpoints and progress confrimations along the way.\nOK so what is that all about? we have compiled the R package and it has gone to where the R packages on our computer go.\nThere is a natural cycle to how the different steps in an R package workflow proceed – see the documentation for this lesson – we will be following this process (TK another pictures?\nOk so now that we ahve the basic structure, let’s talk about some content for the R package. I received the donation of a little R function already that we can use to create this workflow in a nice way\nThis R function (explain what the function does)\nOK so let’s focus on just one part of this function.\nload all – shortcut\n\nhow do we do this in VScode?\n\n\nhow to add something to the .Rbuildignore? it would be nice to have a little .dev script as a space to create all the ohter dependencies that are involved in making an R package.\n\n\n\n✔ Setting active project to '/Users/katherine/Documents/GitHub/bios2.github.io-quarto'\n✔ Adding '^development\\\\.R$' to 'posts/2021-05-04-building-r-packages/.Rbuildignore'"
+ },
+ {
+ "objectID": "posts/2021-05-04-building-r-packages/index.html#useful-links",
+ "href": "posts/2021-05-04-building-r-packages/index.html#useful-links",
+ "title": "Building R packages",
+ "section": "Useful links",
+ "text": "Useful links\nThis workshop borrows heavily from some excellent sources:\n\nthe R packages book especially the “Whole Game” chapter!\nrOpenSci Packages: Development, Maintenance, and Peer Review\n\nhttps://builder.r-hub.io/about.html"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "",
+ "text": "Version en français à la suite."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#course-outline",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#course-outline",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Course outline",
+ "text": "Course outline\n\n\n\nDay\nTopics (EN)\n\n\n\n\n1\n• Introduction to spatial statistics • Point pattern analysis\n\n\n2\n• Spatial correlation • Geostatistical models\n\n\n3\n• Areal data • Moran’s I • Spatial autoregression models • Analysis of areal data in R\n\n\n4\n• GLMM with spatial Gaussian process • GLMM with spatial autoregression"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#types-of-spatial-analyses",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#types-of-spatial-analyses",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Types of spatial analyses",
+ "text": "Types of spatial analyses\nIn this training, we will discuss three types of spatial analyses: point pattern analysis, geostatistical models and models for areal data.\nIn point pattern analysis, we have point data representing the position of individuals or events in a study area and we assume that all individuals or events have been identified in that area. That analysis focuses on the distribution of the positions of the points themselves. Here are some typical questions for the analysis of point patterns:\n\nAre the points randomly arranged or clustered?\nAre two types of points arranged independently?\n\nGeostatistical models represent the spatial distribution of continuous variables that are measured at certain sampling points. They assume that measurements of those variables at different points are correlated as a function of the distance between the points. Applications of geostatistical models include the smoothing of spatial data (e.g., producing a map of a variable over an entire region based on point measurements) and the prediction of those variables for non-sampled points.\nAreal data are measurements taken not at points, but for regions of space represented by polygons (e.g. administrative divisions, grid cells). Models representing these types of data define a network linking each region to its neighbours and include correlations in the variable of interest between neighbouring regions."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#stationarity-and-isotropy",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#stationarity-and-isotropy",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Stationarity and isotropy",
+ "text": "Stationarity and isotropy\nSeveral spatial analyses assume that the variables are stationary in space. As with stationarity in the time domain, this property means that summary statistics (mean, variance and correlations between measures of a variable) do not vary with translation in space. For example, the spatial correlation between two points may depend on the distance between them, but not on their absolute position.\nIn particular, there cannot be a large-scale trend (often called gradient in a spatial context), or this trend must be taken into account before modelling the spatial correlation of residuals.\nIn the case of point pattern analysis, stationarity (also called homogeneity) means that point density does not follow a large-scale trend.\nIn a isotropic statistical model, the spatial correlations between measurements at two points depend only on the distance between the points, not on the direction. In this case, the summary statistics do not change under a spatial rotation of the data."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#georeferenced-data",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#georeferenced-data",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Georeferenced data",
+ "text": "Georeferenced data\nEnvironmental studies increasingly use data from geospatial data sources, i.e. variables measured over a large part of the globe (e.g. climate, remote sensing). The processing of these data requires concepts related to Geographic Information Systems (GIS), which are not covered in this workshop, where we focus on the statistical aspects of spatially varying data.\nThe use of geospatial data does not necessarily mean that spatial statistics are required. For example, we will often extract values of geographic variables at study points to explain a biological response observed in the field. In this case, the use of spatial statistics is only necessary when there is a spatial correlation in the residuals, after controlling for the effect of the predictors."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#point-pattern-and-point-process",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#point-pattern-and-point-process",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Point pattern and point process",
+ "text": "Point pattern and point process\nA point pattern describes the spatial position (most often in 2D) of individuals or events, represented by points, in a given study area, often called the observation “window”.\nIt is assumed that each point has a negligible spatial extent relative to the distances between the points. More complex methods exist to deal with spatial patterns of objects that have a non-negligible width, but this topic is beyond the scope of this workshop.\nA point process is a statistical model that can be used to simulate point patterns or explain an observed point pattern."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#complete-spatial-randomness",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#complete-spatial-randomness",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Complete spatial randomness",
+ "text": "Complete spatial randomness\nComplete spatial randomness (CSR) is one of the simplest point patterns, which serves as a null model for evaluating the characteristics of real point patterns. In this pattern, the presence of a point at a given position is independent of the presence of points in a neighbourhood.\nThe process creating this pattern is a homogeneous Poisson process. According to this model, the number of points in any area \\(A\\) follows a Poisson distribution: \\(N(A) \\sim \\text{Pois}(\\lambda A)\\), where \\(\\lambda\\) is the intensity of the process (i.e. the density of points per unit area). \\(N\\) is independent between two disjoint regions, no matter how those regions are defined.\nIn the graph below, only the pattern on the right is completely random. The pattern on the left shows point aggregation (higher probability of observing a point close to another point), while the pattern in the center shows repulsion (low probability of observing a point very close to another)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exploratory-or-inferential-analysis-for-a-point-pattern",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exploratory-or-inferential-analysis-for-a-point-pattern",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exploratory or inferential analysis for a point pattern",
+ "text": "Exploratory or inferential analysis for a point pattern\nSeveral summary statistics are used to describe the characteristics of a point pattern. The simplest is the intensity \\(\\lambda\\), which as mentioned above represents the density of points per unit area. If the point pattern is heterogeneous, the intensity is not constant, but depends on the position: \\(\\lambda(x, y)\\).\nCompared to intensity, which is a first-order statistic, second-order statistics describe how the probability of the presence of a point in a region depends on the presence of other points. The Ripley’s \\(K\\) function presented in the next section is an example of a second-order summary statistic.\nStatistical inferences on point patterns usually consist of testing the hypothesis that the point pattern corresponds to a given null model, such as CSR or a more complex null model. Even for the simplest null models, we rarely know the theoretical distribution for a summary statistic of the point pattern under the null model. Hypothesis tests on point patterns are therefore performed by simulation: a large number of point patterns are simulated from the null model and the distribution of the summary statistics of interest for these simulations is compared to their values for the observed point pattern."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#ripleys-k-function",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#ripleys-k-function",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Ripley’s K function",
+ "text": "Ripley’s K function\nRipley’s K function \\(K(r)\\) is defined as the mean number of points within a circle of radius \\(r\\) around a point in the pattern, standardized by the intensity \\(\\lambda\\).\nUnder the CSR null model, the mean number of points in any circle of radius \\(r\\) is \\(\\lambda \\pi r^2\\), thus in theory \\(K(r) = \\pi r^2\\) for that model. A higher value of \\(K(r)\\) means that there is an aggregation of points at the scale \\(r\\), whereas a lower value means that there is repulsion.\nIn practice, \\(K(r)\\) is estimated for a specific point pattern by the equation:\n\\[ K(r) = \\frac{A}{n(n-1)} \\sum_i \\sum_{j > i} I \\left( d_{ij} \\le r \\right) w_{ij}\\]\nwhere \\(A\\) is the area of the observation window and \\(n\\) is the number of points in the pattern, so \\(n(n-1)\\) is the number of distinct pairs of points. We take the sum for all pairs of points of the indicator function \\(I\\), which takes a value of 1 if the distance between points \\(i\\) and \\(j\\) is less than or equal to \\(r\\). Finally, the term \\(w_{ij}\\) is used to give extra weight to certain pairs of points to account for edge effects, as discussed in the next section.\nFor example, the graphs below show the estimated \\(K(r)\\) function for the patterns shown above, for values of \\(r\\) up to 1/4 of the window width. The red dashed curve shows the theoretical value for CSR and the gray area is an “envelope” produced by 99 simulations of that null pattern. The aggregated pattern shows an excess of neighbours up to \\(r = 0.25\\) and the pattern with repulsion shows a significant deficit of neighbours for small values of \\(r\\).\n\n\n\n\n\nIn addition to \\(K\\), there are other statistics to describe the second-order properties of point patterns, such as the mean distance between a point and its nearest \\(N\\) neighbours. You can refer to the Wiegand and Moloney (2013) textbook in the references to learn more about different summary statistics for point patterns."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#edge-effects",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#edge-effects",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Edge effects",
+ "text": "Edge effects\nIn the context of point pattern analysis, edge effects are due to the fact that we have incomplete knowledge of the neighbourhood of points near the edge of the observation window, which can induce a bias in the calculation of statistics such as Ripley’s \\(K\\).\nDifferent methods have been developed to correct the bias due to edge effects. In Ripley’s edge correction method, the contribution of a neighbour \\(j\\) located at a distance \\(r\\) from a point \\(i\\) receives a weight \\(w_{ij} = 1/\\phi_i(r)\\), where \\(\\phi_i(r)\\) is the fraction of the circle of radius \\(r\\) around \\(i\\) contained in the observation window. For example, if 2/3 of the circle is in the window, this neighbour counts as 3/2 neighbours in the calculation of a statistic like \\(K\\).\n\nRipley’s method is one of the simplest to correct for edge effects, but is not necessarily the most efficient; in particular, larger weights given to certain pairs of points tend to increase the variance of the calculated statistic. Other correction methods are presented in specialized textbooks, such as Wiegand and Moloney (2013)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#example",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#example",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Example",
+ "text": "Example\nFor this example, we use the dataset semis_xy.csv, which represents the \\((x, y)\\) coordinates for seedlings of two species (sp, B = birch and P = poplar) in a 15 x 15 m plot.\n\nsemis <- read.csv(\"data/semis_xy.csv\")\nhead(semis)\n\n x y sp\n1 14.73 0.05 P\n2 14.72 1.71 P\n3 14.31 2.06 P\n4 14.16 2.64 P\n5 14.12 4.15 B\n6 9.88 4.08 B\n\n\nThe spatstat package provides tools for point pattern analysis in R. The first step consists in transforming our data frame into a ppp object (point pattern) with the function of the same name. In this function, we specify which columns contain the coordinates x and y as well as the marks, which here will be the species codes. We also need to specify an observation window (window) using the owin function, where we provide the plot limits in x and y.\n\nlibrary(spatstat)\n\nsemis <- ppp(x = semis$x, y = semis$y, marks = as.factor(semis$sp),\n window = owin(xrange = c(0, 15), yrange = c(0, 15)))\nsemis\n\nMarked planar point pattern: 281 points\nMultitype, with levels = B, P \nwindow: rectangle = [0, 15] x [0, 15] units\n\n\nMarks can be numeric or categorical. Note that for categorical marks as is the case here, the variable must be explicitly converted to a factor.\nThe plot function applied to a point pattern shows a diagram of the pattern.\n\nplot(semis)\n\n\n\n\nThe intensity function calculates the density of points of each species by unit area (here, by \\(m^2\\)).\n\nintensity(semis)\n\n B P \n0.6666667 0.5822222 \n\n\nTo first analyze the distribution of each species separately, we split the pattern with split. Since the pattern contains categorical marks, it is automatically split according to the values of those marks. The result is a list of two point patterns.\n\nsemis_split <- split(semis)\nplot(semis_split)\n\n\n\n\nThe Kest function calculates Ripley’s \\(K\\) for a series of distances up to (by default) 1/4 of the width of the window. Here we apply it to the first pattern (birch) by choosing semis_split[[1]]. Note that double square brackets are necessary to choose an item from a list in R.\nThe argument correction = \"iso\" tells the function to apply Ripley’s correction for edge effects.\n\nk <- Kest(semis_split[[1]], correction = \"iso\")\nplot(k)\n\n\n\n\nAccording to this graph, there seems to be an excess of neighbours for distances of 1 m and above. To check if this is a significant difference, we produce a simulation envelope with the envelope function. The first argument of envelope is a point pattern to which the simulations will be compared, the second one is a function to be computed (here, Kest) for each simulated pattern, then we add the arguments of the Kest function (here, only correction).\n\nplot(envelope(semis_split[[1]], Kest, correction = \"iso\"))\n\nGenerating 99 simulations of CSR ...\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,\n41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,\n81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.\n\nDone.\n\n\n\n\n\nAs indicated by the message, by default the function performs 99 simulations of the null model corresponding to complete spatial randomness (CSR).\nThe observed curve falls outside the envelope of the 99 simulations near \\(r = 2\\). We must be careful not to interpret too quickly a result that is outside the envelope. Although there is about a 1% probability of obtaining a more extreme result under the null hypothesis at a given distance, the envelope is calculated for a large number of values of \\(r\\) and is not corrected for multiple comparisons. Thus, a significant difference for a very small range of values of \\(r\\) may be simply due to chance.\n\nExercise 1\nLooking at the graph of the second point pattern (poplar seedlings), can you predict where Ripley’s \\(K\\) will be in relation to the null hypothesis of complete spatial randomness? Verify your prediction by calculating Ripley’s \\(K\\) for this point pattern in R."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effect-of-heterogeneity",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effect-of-heterogeneity",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Effect of heterogeneity",
+ "text": "Effect of heterogeneity\nThe graph below illustrates a heterogeneous point pattern, i.e. it shows an density gradient (more points on the left than on the right).\n\n\n\n\n\nA density gradient can be confused with an aggregation of points, as can be seen on the graph of the corresponding Ripley’s \\(K\\). In theory, these are two different processes:\n\nHeterogeneity: The density of points varies in the study area, for example due to the fact that certain local conditions are more favorable to the presence of the species of interest.\nAggregation: The mean density of points is homogeneous, but the presence of one point increases the presence of other points in its vicinity, for example due to positive interactions between individuals.\n\nHowever, it may be difficult to differentiate between the two in practice, especially since some patterns may be both heterogeneous and aggregated.\nLet’s take the example of the poplar seedlings from the previous exercise. The density function applied to a point pattern performs a kernel density estimation of the density of the seedlings across the plot. By default, this function uses a Gaussian kernel with a standard deviation sigma specified in the function, which determines the scale at which density fluctuations are “smoothed”. Here, we use a value of 2 m for sigma and we first represent the estimated density with plot, before overlaying the points (add = TRUE means that the points are added to the existing plot rather than creating a new plot).\n\ndens_p <- density(semis_split[[2]], sigma = 2)\nplot(dens_p)\nplot(semis_split[[2]], add = TRUE)\n\n\n\n\nTo measure the aggregation or repulsion of points in a heterogeneous pattern, we must use the inhomogeneous version of the \\(K\\) statistic (Kinhom in spatstat). This statistic is still equal to the mean number of neighbours within a radius \\(r\\) of a point in the pattern, but rather than standardizing this number by the overall intensity of the pattern, it is standardized by the local estimated density. As above, we specify sigma = 2 to control the level of smoothing for the varying density estimate.\n\nplot(Kinhom(semis_split[[2]], sigma = 2, correction = \"iso\"))\n\n\n\n\nTaking into account the heterogeneity of the pattern at a scale sigma of 2 m, there seems to be a deficit of neighbours starting at a radius of about 1.5 m. We can now check whether this deviation is significant.\nAs before, we use envelope to simulate the Kinhom statistic under the null model. However, the null model here is not a homogeneous Poisson process (CSR). It is instead a heterogeneous Poisson process simulated by the function rpoispp(dens_p), i.e. the points are independent of each other, but their density is heterogeneous and given by dens_p. The simulate argument of the envelope function specifies the function used for simulations under the null model; this function must have one argument, here x, even if it is not used.\nFinally, in addition to the arguments needed for Kinhom, i.e. sigma and correction, we also specify nsim = 199 to perform 199 simulations and nrank = 5 to eliminate the 5 most extreme results on each side of the envelope, i.e. the 10 most extreme results out of 199, to achieve an interval containing about 95% of the probability under the null hypothesis.\n\nkhet_p <- envelope(semis_split[[2]], Kinhom, sigma = 2, correction = \"iso\",\n nsim = 199, nrank = 5, simulate = function(x) rpoispp(dens_p))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\nplot(khet_p)\n\n\n\n\nNote: For a hypothesis test based on simulations of a null hypothesis, the \\(p\\)-value is estimated by \\((m + 1)/(n + 1)\\), where \\(n\\) is the number of simulations and \\(m\\) is the number of simulations where the value of the statistic is more extreme than that of the observed data. This is why the number of simulations is often chosen to be 99, 199, etc.\n\nExercise 2\nRepeat the heterogeneous density estimation and Kinhom calculation with a standard deviation sigma of 5 rather than 2. How does the smoothing level for the density estimation influence the conclusions?\nTo differentiate between a variation in the density of points from an interaction (aggregation or repulsion) between these points with this type of analysis, it is generally assumed that the two processes operate at different scales. Typically, we can test whether the points are aggregated at a small scale after accounting for a variation in density at a larger scale."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#relationship-between-two-point-patterns",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#relationship-between-two-point-patterns",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Relationship between two point patterns",
+ "text": "Relationship between two point patterns\nLet’s consider a case where we have two point patterns, for example the position of trees of two species in a plot (orange and green points in the graph below). Each of the two patterns may or may not present an aggregation of points.\n\n\n\n\n\nRegardless of whether points are aggregated at the species level, we want to determine whether the two species are arranged independently. In other words, does the probability of observing a tree of one species depend on the presence of a tree of the other species at a given distance?\nThe bivariate version of Ripley’s \\(K\\) allows us to answer this question. For two patterns noted 1 and 2, the function \\(K_{12}(r)\\) calculates the mean number of points in pattern 2 within a radius \\(r\\) from a point in pattern 1, standardized by the density of pattern 2.\nIn theory, this function is symmetrical, so \\(K_{12}(r) = K_{21}(r)\\) and the result would be the same whether the points of pattern 1 or 2 are chosen as “focal” points for the analysis. However, the estimation of the two quantities for an observed pattern may differ, in particular because of edge effects. The variance of \\(K_{12}\\) and \\(K_{21}\\) between simulations of a null model may also differ, so the null hypothesis test may have more or less power depending on the choice of the focal species.\nThe choice of an appropriate null model is important here. In order to determine whether there is a significant attraction or repulsion between the two patterns, the position of one of the patterns must be randomly moved relative to that of the other pattern, while keeping the spatial structure of each pattern taken in isolation.\nOne way to do this randomization is to shift one of the two patterns horizontally and/or vertically by a random distance. The part of the pattern that “comes out” on one side of the window is attached to the other side. This method is called a toroidal shift, because by connecting the top and bottom as well as the left and right of a rectangular surface, we obtain the shape of a torus (a three-dimensional “donut”).\n\n\n\n\n\nThe graph above shows a translation of the green pattern to the right, while the orange pattern remains in the same place. The green points in the shaded area are brought back on the other side. Note that while this method generally preserves the structure of each pattern while randomizing their relative position, it can have some drawbacks, such as dividing point clusters that are near the cutoff point.\nLet’s now check whether the position of the two species (birch and poplar) is independent in our plot. The function Kcross calculates the bivariate \\(K_{ij}\\), we must specify which type of point (mark) is considered as the focal species \\(i\\) and the neighbouring species \\(j\\).\n\nplot(Kcross(semis, i = \"P\", j = \"B\", correction = \"iso\"))\n\n\n\n\nHere, the observed \\(K\\) is lower than the theoretical value, indicating a possible repulsion between the two patterns.\nTo determine the envelope of the \\(K\\) under the null hypothesis of independence of the two patterns, we must specify that the simulations are based on a translation of the patterns. We indicate that the simulations use the function rshift (random translation) with the argument simulate = function(x) rshift(x, which = \"B\"); here, the x argument in simulate corresponds to the original point pattern and the which argument indicates which of the patterns is translated. As in the previous case, the arguments needed for Kcross, i.e. i, j and correction, must be repeated in the envelope function.\n\nplot(envelope(semis, Kcross, i = \"P\", j = \"B\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = function(x) rshift(x, which = \"B\")))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nHere, the observed curve is totally within the envelope, so we do not reject the null hypothesis of independence of the two patterns.\n\nQuestions\n\nWhat would be one reason for our choice to translate the points of the birch rather than poplar?\nWould the simulations generated by random translation be a good null model if the two patterns were heterogeneous?"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#marked-point-patterns",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#marked-point-patterns",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Marked point patterns",
+ "text": "Marked point patterns\nThe fir.csv dataset contains the \\((x, y)\\) coordinates of 822 fir trees in a 1 hectare plot and their status (A = alive, D = dead) following a spruce budworm outbreak.\n\nfir <- read.csv(\"data/fir.csv\")\nhead(fir)\n\n x y status\n1 31.50 1.00 A\n2 85.25 30.75 D\n3 83.50 38.50 A\n4 84.00 37.75 A\n5 83.00 33.25 A\n6 33.25 0.25 A\n\n\n\nfir <- ppp(x = fir$x, y = fir$y, marks = as.factor(fir$status),\n window = owin(xrange = c(0, 100), yrange = c(0, 100)))\nplot(fir)\n\n\n\n\nSuppose that we want to check whether fir mortality is independent or correlated between neighbouring trees. How does this question differ from the previous example, where we wanted to know if the position of the points of two species was independent?\nIn the previous example, the independence or interaction between the species referred to the formation of the pattern itself (whether or not seedlings of one species establish near those of the other species). Here, the characteristic of interest (survival) occurs after the establishment of the pattern, assuming that all those trees were alive at first and that some died as a result of the outbreak. So we take the position of the trees as fixed and we want to know whether the distribution of status (dead, alive) among those trees is random or shows a spatial pattern.\nIn Wiegand and Moloney’s textbook, the first situation (establishment of seedlings of two species) is called a bivariate pattern, so it is really two interacting patterns, while the second is a single pattern with a qualitative mark. The spatstat package in R does not differentiate between the two in terms of pattern definition (types of points are always represented by the marks argument), but the analysis methods applied to the two questions differ.\nIn the case of a pattern with a qualitative mark, we can define a mark connection function \\(p_{ij}(r)\\). For two points separated by a distance \\(r\\), this function gives the probability that the first point has the mark \\(i\\) and the second the mark \\(j\\). Under the null hypothesis where the marks are independent, this probability is equal to the product of the proportions of each mark in the entire pattern, \\(p_{ij}(r) = p_i p_j\\) independently of \\(r\\).\nIn spatstat, the mark connection function is computed with the markconnect function, where the marks \\(i\\) and \\(j\\) and the type of edge correction must be specified. In our example, we see that two closely spaced points are less likely to have a different status (A and D) than expected under the assumption of random and independent distribution of marks (red dotted line).\n\nplot(markconnect(fir, i = \"A\", j = \"D\", correction = \"iso\"))\n\n\n\n\nIn this graph, the fluctuations in the function are due to the estimation error of a continuous \\(r\\) function from a limited number of discrete point pairs.\nTo simulate the null model in this case, we use the rlabel function, which randomly reassigns the marks among the points of the pattern, keeping the points’ positions fixed.\n\nplot(envelope(fir, markconnect, i = \"A\", j = \"D\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nNote that since the rlabel function has only one required argument corresponding to the original point pattern, it was not necessary to specify: simulate = function(x) rlabel(x).\nHere are the results for tree pairs of the same status A or D:\n\npar(mfrow = c(1, 2))\nplot(envelope(fir, markconnect, i = \"A\", j = \"A\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\nplot(envelope(fir, markconnect, i = \"D\", j = \"D\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nIt therefore appears that fir mortality due to this outbreak is spatially aggregated, since trees located in close proximity to each other have a greater probability of sharing the same status than predicted by the null hypothesis."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#references",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#references",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "References",
+ "text": "References\nFortin, M.-J. and Dale, M.R.T. (2005) Spatial Analysis: A Guide for Ecologists. Cambridge University Press: Cambridge, UK.\nWiegand, T. and Moloney, K.A. (2013) Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.\nThe dataset in the last example is a subet of the Lake Duparquet Research and Teaching Forest (LDRTF) data, available on Dryad here."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#intrinsic-or-induced-dependence",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#intrinsic-or-induced-dependence",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Intrinsic or induced dependence",
+ "text": "Intrinsic or induced dependence\nThere are two basic types of spatial dependence on a measured variable \\(y\\): an intrinsic dependence on \\(y\\), or a dependence induced by external variables influencing \\(y\\), which are themselves spatially correlated.\nFor example, suppose that the abundance of a species is correlated between two sites located near each other:\n\nthis spatial dependence can be induced if it is due to a spatial correlation of habitat factors that are favorable or unfavorable to the species;\nor it can be intrinsic if it is due to the dispersion of individuals to nearby sites.\n\nIn many cases, both types of dependence affect a given variable.\nIf the dependence is simply induced and the external variables that cause it are included in the model explaining \\(y\\), then the model residuals will be independent and we can use all the methods already seen that ignore spatial correlation.\nHowever, if the dependence is intrinsic or due to unmeasured external factors, then the spatial correlation of the residuals in the model will have to be taken into account."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#different-ways-to-model-spatial-effects",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#different-ways-to-model-spatial-effects",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Different ways to model spatial effects",
+ "text": "Different ways to model spatial effects\nIn this training, we will directly model the spatial correlations of our data. It is useful to compare this approach to other ways of including spatial aspects in a statistical model.\nFirst, we could include predictors in the model that represent position (e.g., longitude, latitude). Such predictors may be useful for detecting a systematic large-scale trend or gradient, whether or not the trend is linear (e.g., with a generalized additive model).\nIn contrast to this approach, the models we will see now serve to model a spatial correlation in the random fluctuations of a variable (i.e., in the residuals after removing any systematic effect).\nMixed models use random effects to represent the non-independence of data on the basis of their grouping, i.e., after accounting for systematic fixed effects, data from the same group are more similar (their residual variation is correlated) than data from different groups. These groups were sometimes defined according to spatial criteria (observations grouped into sites).\nHowever, in the context of a random group effect, all groups are as different from each other, e.g., two sites within 100 km of each other are no more or less similar than two sites 2 km apart.\nThe methods we will see here and in the next parts of the training therefore allow us to model non-independence on a continuous scale (closer = more correlated) rather than just discrete (hierarchy of groups)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogram",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogram",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Variogram",
+ "text": "Variogram\nA central aspect of geostatistics is the estimation of the variogram \\(\\gamma_z\\) . The variogram is equal to half the mean square difference between the values of \\(z\\) for two points \\((x_i, y_i)\\) and \\((x_j, y_j)\\) separated by a distance \\(h\\).\n\\[\\gamma_z(h) = \\frac{1}{2} \\text{E} \\left[ \\left( z(x_i, y_i) - z(x_j, y_j) \\right)^2 \\right]_{d_{ij} = h}\\]\nIn this equation, the \\(\\text{E}\\) function with the index \\(d_{ij}=h\\) designates the statistical expectation (i.e., the mean) of the squared deviation between the values of \\(z\\) for points separated by a distance \\(h\\).\nIf we want instead to express the autocorrelation \\(\\rho_z(h)\\) between measures of \\(z\\) separated by a distance \\(h\\), it is related to the variogram by the equation:\n\\[\\gamma_z = \\sigma_z^2(1 - \\rho_z)\\] ,\nwhere \\(\\sigma_z^2\\) is the global variance of \\(z\\).\nNote that \\(\\gamma_z = \\sigma_z^2\\) when we reach a distance where the measurements of \\(z\\) are independent, so \\(\\rho_z = 0\\). In this case, we can see that \\(\\gamma_z\\) is similar to a variance, although it is sometimes called “semivariogram” or “semivariance” because of the 1/2 factor in the above equation."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#theoretical-models-for-the-variogram",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#theoretical-models-for-the-variogram",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Theoretical models for the variogram",
+ "text": "Theoretical models for the variogram\nSeveral parametric models have been proposed to represent the spatial correlation as a function of the distance between sampling points. Let us first consider a correlation that decreases exponentially:\n\\[\\rho_z(h) = e^{-h/r}\\]\nHere, \\(\\rho_z = 1\\) for \\(h = 0\\) and the correlation is multiplied by \\(1/e \\approx 0.37\\) each time the distance increases by \\(r\\). In this context, \\(r\\) is called the range of the correlation.\nFrom the above equation, we can calculate the corresponding variogram.\n\\[\\gamma_z(h) = \\sigma_z^2 (1 - e^{-h/r})\\]\nHere is a graphical representation of this variogram.\n\n\n\n\n\nBecause of the exponential function, the value of \\(\\gamma\\) at large distances approaches the global variance \\(\\sigma_z^2\\) without exactly reaching it. This asymptote is called a sill in the geostatistical context and is represented by the symbol \\(s\\).\nFinally, it is sometimes unrealistic to assume a perfect correlation when the distance tends towards 0, because of a possible variation of \\(z\\) at a very small scale. A nugget effect, denoted \\(n\\), can be added to the model so that \\(\\gamma\\) approaches \\(n\\) (rather than 0) if \\(h\\) tends towards 0. The term nugget comes from the mining origin of these techniques, where a nugget could be the source of a sudden small-scale variation in the concentration of a mineral.\nBy adding the nugget effect, the remainder of the variogram is “compressed” to keep the same sill, resulting in the following equation.\n\\[\\gamma_z(h) = n + (s - n) (1 - e^{-h/r})\\]\nIn the gstat package that we use below, the term \\((s-n)\\) is called a partial sill or psill for the exponential portion of the variogram.\n\n\n\n\n\nIn addition to the exponential model, two other common theoretical models for the variogram are the Gaussian model (where the correlation follows a half-normal curve), and the spherical model (where the variogram increases linearly at the start and then curves and reaches the plateau at a distance equal to its range \\(r\\)). The spherical model thus allows the correlation to be exactly 0 at large distances, rather than gradually approaching zero in the case of the other models.\n\n\n\n\n\n\n\n\nModel\n\\(\\rho(h)\\)\n\\(\\gamma(h)\\)\n\n\n\n\nExponential\n\\(\\exp\\left(-\\frac{h}{r}\\right)\\)\n\\(s \\left(1 - \\exp\\left(-\\frac{h}{r}\\right)\\right)\\)\n\n\nGaussian\n\\(\\exp\\left(-\\frac{h^2}{r^2}\\right)\\)\n\\(s \\left(1 - \\exp\\left(-\\frac{h^2}{r^2}\\right)\\right)\\)\n\n\nSpherical \\((h < r)\\) *\n\\(1 - \\frac{3}{2}\\frac{h}{r} + \\frac{1}{2}\\frac{h^3}{r^3}\\)\n\\(s \\left(\\frac{3}{2}\\frac{h}{r} - \\frac{1}{2}\\frac{h^3}{r^3} \\right)\\)\n\n\n\n* For the spherical model, \\(\\rho = 0\\) and \\(\\gamma = s\\) if \\(h \\ge r\\)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#empirical-variogram",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#empirical-variogram",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Empirical variogram",
+ "text": "Empirical variogram\nTo estimate \\(\\gamma_z(h)\\) from empirical data, we need to define distance classes, thus grouping different distances within a margin of \\(\\pm \\delta\\) around a distance \\(h\\), then calculating the mean square deviation for the pairs of points in that distance class.\n\\[\\hat{\\gamma_z}(h) = \\frac{1}{2 N_{\\text{paires}}} \\sum \\left[ \\left( z(x_i, y_i) - z(x_j, y_j) \\right)^2 \\right]_{d_{ij} = h \\pm \\delta}\\]\nWe will see in the next section how to estimate a variogram in R."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#regression-model-with-spatial-correlation",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#regression-model-with-spatial-correlation",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Regression model with spatial correlation",
+ "text": "Regression model with spatial correlation\nThe following equation represents a multiple linear regression including residual spatial correlation:\n\\[v = \\beta_0 + \\sum_i \\beta_i u_i + z + \\epsilon\\]\nHere, \\(v\\) designates the response variable and \\(u\\) the predictors, to avoid confusion with the spatial coordinates \\(x\\) and \\(y\\).\nIn addition to the residual \\(\\epsilon\\) that is independent between observations, the model includes a term \\(z\\) that represents the spatially correlated portion of the residual variance.\nHere are suggested steps to apply this type of model:\n\nFit the regression model without spatial correlation.\nVerify the presence of spatial correlation from the empirical variogram of the residuals.\nFit one or more regression models with spatial correlation and select the one that shows the best fit to the data."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#regression-with-spatial-correlation",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#regression-with-spatial-correlation",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Regression with spatial correlation",
+ "text": "Regression with spatial correlation\nWe have seen above that the gstat package allows us to estimate the variogram of the residuals of a linear model. In our example, the magnesium concentration was modeled as a function of pH, with spatially correlated residuals.\nAnother tool to fit this same type of model is the gls function of the nlme package, which is included with the installation of R.\nThis function applies the generalized least squares method to fit linear regression models when the residuals are not independent or when the residual variance is not the same for all observations. Since the estimates of the coefficients depend on the estimated correlations between the residuals and the residuals themselves depend on the coefficients, the model is fitted by an iterative algorithm:\n\nA classical linear regression model (without correlation) is fitted to obtain residuals.\nThe spatial correlation model (variogram) is fitted with those residuals.\nThe regression coefficients are re-estimated, now taking into account the correlations.\n\nSteps 2 and 3 are repeated until the estimates are stable at a desired precision.\nHere is the application of this method to the same model for the magnesium concentration in the oxford dataset. In the correlation argument of gls, we specify an exponential correlation model as a function of our spatial coordinates and we include a possible nugget effect.\nIn addition to the exponential correlation corExp, the gls function can also estimate a Gaussian (corGaus) or spherical (corSpher) model.\n\nlibrary(nlme)\ngls_mg <- gls(MG1 ~ PH1, oxford, \n correlation = corExp(form = ~ XCOORD + YCOORD, nugget = TRUE))\nsummary(gls_mg)\n\nGeneralized least squares fit by REML\n Model: MG1 ~ PH1 \n Data: oxford \n AIC BIC logLik\n 1278.65 1292.751 -634.325\n\nCorrelation Structure: Exponential spatial correlation\n Formula: ~XCOORD + YCOORD \n Parameter estimate(s):\n range nugget \n478.0322964 0.2944753 \n\nCoefficients:\n Value Std.Error t-value p-value\n(Intercept) 391.1387 50.42343 7.757084 0\nPH1 -41.0836 6.15662 -6.673079 0\n\n Correlation: \n (Intr)\nPH1 -0.891\n\nStandardized residuals:\n Min Q1 Med Q3 Max \n-2.1846957 -0.6684520 -0.3687813 0.4627580 3.1918604 \n\nResidual standard error: 53.8233 \nDegrees of freedom: 126 total; 124 residual\n\n\nTo compare this result with the adjusted variogram above, the parameters given by gls must be transformed. The range has the same meaning in both cases and corresponds to 478 m for the result of gls. The global variance of the residuals is the square of Residual standard error. The nugget effect here (0.294) is expressed as a fraction of that variance. Finally, to obtain the partial sill of the exponential part, the nugget effect must be subtracted from the total variance.\nAfter performing these calculations, we can give these parameters to the vgm function of gstat to superimpose this variogram estimated by gls on our variogram of the residuals of the classical linear model.\n\ngls_range <- 478\ngls_var <- 53.823^2\ngls_nugget <- 0.294 * gls_var\ngls_psill <- gls_var - gls_nugget\n\ngls_vgm <- vgm(\"Exp\", psill = gls_psill, range = gls_range, nugget = gls_nugget)\n\nplot(var_mg, gls_vgm, col = \"black\", ylim = c(0, 4000))\n\n\n\n\nDoes the model fit the data less well here? In fact, this empirical variogram represented by the points was obtained from the residuals of the linear model ignoring the spatial correlation, so it is a biased estimate of the actual spatial correlations. The method is still adequate to quickly check if spatial correlations are present. However, to simultaneously fit the regression coefficients and the spatial correlation parameters, the generalized least squares (GLS) approach is preferable and will produce more accurate estimates.\nFinally, note that the result of the gls model also gives the AIC, which we can use to compare the fit of different models (with different predictors or different forms of spatial correlation)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercise",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercise",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exercise",
+ "text": "Exercise\nThe bryo_belg.csv dataset is adapted from the data of this study:\n\nNeyens, T., Diggle, P.J., Faes, C., Beenaerts, N., Artois, T. et Giorgi, E. (2019) Mapping species richness using opportunistic samples: a case study on ground-floor bryophyte species richness in the Belgian province of Limburg. Scientific Reports 9, 19122. https://doi.org/10.1038/s41598-019-55593-x\n\nThis data frame shows the specific richness of ground bryophytes (richness) for different sampling points in the Belgian province of Limburg, with their position (x, y) in km, in addition to information on the proportion of forest (forest) and wetlands (wetland) in a 1 km^2$ cell containing the sampling point.\n\nbryo_belg <- read.csv(\"data/bryo_belg.csv\")\nhead(bryo_belg)\n\n richness forest wetland x y\n1 9 0.2556721 0.5036614 228.9516 220.8869\n2 6 0.6449114 0.1172068 227.6714 219.8613\n3 5 0.5039905 0.6327003 228.8252 220.1073\n4 3 0.5987329 0.2432942 229.2775 218.9035\n5 2 0.7600775 0.1163538 209.2435 215.2414\n6 10 0.6865434 0.0000000 210.4142 216.5579\n\n\nFor this exercise, we will use the square root of the specific richness as the response variable. The square root transformation often allows to homogenize the variance of the count data in order to apply a linear regression.\n\nFit a linear model of the transformed species richness to the proportion of forest and wetlands, without taking into account spatial correlations. What is the effect of the two predictors in this model?\nCalculate the empirical variogram of the model residuals in (a). Does there appear to be a spatial correlation between the points?\n\nNote: The cutoff argument to the variogram function specifies the maximum distance at which the variogram is calculated. You can manually adjust this value to get a good view of the sill.\n\nRe-fit the linear model in (a) with the gls function in the nlme package, trying different types of spatial correlations (exponential, Gaussian, spherical). Compare the models (including the one without spatial correlation) with the AIC.\nWhat is the effect of the proportion of forests and wetlands according to the model in (c)? Explain the differences between the conclusions of this model and the model in (a)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#conditional-autoregressive-car-model",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#conditional-autoregressive-car-model",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Conditional autoregressive (CAR) model",
+ "text": "Conditional autoregressive (CAR) model\nIn the conditional autoregressive model, the value of \\(z_i\\) for the region \\(i\\) follows a normal distribution: its mean depends on the value \\(z_j\\) of neighbouring regions, multiplied by the weight \\(w_{ij}\\) and a correlation coefficient \\(\\rho\\); its standard deviation \\(\\sigma_{z_i}\\) may vary from one region to another.\n\\[z_i \\sim \\text{N}\\left(\\sum_j \\rho w_{ij} z_j,\\sigma_{z_i} \\right)\\]\nIn this model, if \\(w_{ij}\\) is a binary matrix (0 for non-neighbours, 1 for neighbours), then \\(\\rho\\) is the coefficient of partial correlation between neighbouring regions. This is similar to a first-order autoregressive model in the context of time series, where the autoregression coefficient indicates the partial correlation."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#simultaneous-autoregressive-sar-model",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#simultaneous-autoregressive-sar-model",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Simultaneous autoregressive (SAR) model",
+ "text": "Simultaneous autoregressive (SAR) model\nIn the simultaneous autoregressive model, the value of \\(z_i\\) is given directly by the sum of contributions from neighbouring values \\(z_j\\), multiplied by \\(\\rho w_{ij}\\), with an independent residual \\(\\nu_i\\) of standard deviation \\(\\sigma_z\\).\n\\[z_i = \\sum_j \\rho w_{ij} z_j + \\nu_i\\]\nAt first glance, this looks like a temporal autoregressive model. However, there is an important conceptual difference. For temporal models, the causal influence is directed in only one direction: \\(v(t-2)\\) affects \\(v(t-1)\\) which then affects \\(v(t)\\). For a spatial model, each \\(z_j\\) that affects \\(z_i\\) depends in turn on \\(z_i\\). Thus, to determine the joint distribution of \\(z\\), a system of equations must be solved simultaneously (hence the name of the model).\nFor this reason, although this model resembles the formula of CAR model, the solutions of the two models differ and in the case of SAR, the coefficient \\(\\rho\\) is not directly equal to the partial correlation due to each neighbouring region.\nFor more details on the mathematical aspects of these models, see the article by Ver Hoef et al. (2018) suggested in reference.\nFor the moment, we will consider SAR and CAR as two types of possible models to represent a spatial correlation on a network. We can always fit several models and compare them with the AIC to choose the best form of correlation or the best weight matrix.\nThe CAR and SAR models share an advantage over geostatistical models in terms of efficiency. In a geostatistical model, spatial correlations are defined between each pair of points, although they become negligible as distance increases. For a CAR or SAR model, only neighbouring regions contribute and most weights are equal to 0, making these models faster to fit than a geostatistical model when the data are massive."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#definition-of-the-neighbourhood-network",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#definition-of-the-neighbourhood-network",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Definition of the neighbourhood network",
+ "text": "Definition of the neighbourhood network\nThe poly2nb function of the spdep package defines a neighbourhood network from polygons. The result vois is a list of 125 elements where each element contains the indices of the neighbouring (bordering) polygons of a given polygon.\n\nvois <- poly2nb(elect2018)\nvois[[1]]\n\n[1] 2 37 63 88 101 117\n\n\nThus, the first riding (Abitibi-Est) has 6 neighbouring ridings, for which the names can be found as follows:\n\nelect2018$circ[vois[[1]]]\n\n[1] \"Abitibi-Ouest\" \"Gatineau\" \n[3] \"Laviolette-Saint-Maurice\" \"Pontiac\" \n[5] \"Rouyn-Noranda-Témiscamingue\" \"Ungava\" \n\n\nWe can illustrate this network by extracting the coordinates of the center of each district, creating a blank map with plot(elect2018[\"geometry\"]), then adding the network as an additional layer with plot(vois, add = TRUE, coords = coords).\n\ncoords <- st_centroid(elect2018) %>%\n st_coordinates()\nplot(elect2018[\"geometry\"])\nplot(vois, add = TRUE, col = \"red\", coords = coords)\n\n\n\n\nWe can “zoom” on southern Québec by choosing the limits xlim and ylim.\n\nplot(elect2018[\"geometry\"], \n xlim = c(400000, 800000), ylim = c(100000, 500000))\nplot(vois, add = TRUE, col = \"red\", coords = coords)\n\n\n\n\nWe still have to add weights to each network link with the nb2listw function. The style of weights “B” corresponds to binary weights, i.e. 1 for the presence of link and 0 for the absence of link between two ridings.\nOnce these weights are defined, we can verify with Moran’s test whether there is a significant autocorrelation of votes obtained by the CAQ between neighbouring ridings.\n\npoids <- nb2listw(vois, style = \"B\")\n\nmoran.test(elect2018$propCAQ, poids)\n\n\n Moran I test under randomisation\n\ndata: elect2018$propCAQ \nweights: poids \n\nMoran I statistic standard deviate = 13.148, p-value < 2.2e-16\nalternative hypothesis: greater\nsample estimates:\nMoran I statistic Expectation Variance \n 0.680607768 -0.008064516 0.002743472 \n\n\nThe value \\(I = 0.68\\) is very significant judging by the \\(p\\)-value of the test.\nLet’s verify if the spatial correlation persists after taking into account the four characteristics of the population, therefore by inspecting the residuals of a linear model including these four predictors.\n\nelect_lm <- lm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, data = elect2018)\nsummary(elect_lm)\n\n\nCall:\nlm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018)\n\nResiduals:\n Min 1Q Median 3Q Max \n-30.9890 -4.4878 0.0562 6.2653 25.8146 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.354e+01 1.836e+01 0.737 0.463 \nage_moy -9.170e-01 3.855e-01 -2.378 0.019 * \npct_frn 4.588e+01 5.202e+00 8.820 1.09e-14 ***\npct_prp 3.582e+01 6.527e+00 5.488 2.31e-07 ***\nrev_med -2.624e-05 2.465e-04 -0.106 0.915 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 9.409 on 120 degrees of freedom\nMultiple R-squared: 0.6096, Adjusted R-squared: 0.5965 \nF-statistic: 46.84 on 4 and 120 DF, p-value: < 2.2e-16\n\nmoran.test(residuals(elect_lm), poids)\n\n\n Moran I test under randomisation\n\ndata: residuals(elect_lm) \nweights: poids \n\nMoran I statistic standard deviate = 6.7047, p-value = 1.009e-11\nalternative hypothesis: greater\nsample estimates:\nMoran I statistic Expectation Variance \n 0.340083290 -0.008064516 0.002696300 \n\n\nMoran’s \\(I\\) has decreased but remains significant, so some of the previous correlation was induced by these predictors, but there remains a spatial correlation due to other factors."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#spatial-autoregression-models",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#spatial-autoregression-models",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Spatial autoregression models",
+ "text": "Spatial autoregression models\nFinally, we fit SAR and CAR models to these data with the spautolm (spatial autoregressive linear model) function of spatialreg. Here is the code for a SAR model including the effect of the same four predictors.\n\nelect_sar <- spautolm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids)\nsummary(elect_sar)\n\n\nCall: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids)\n\nResiduals:\n Min 1Q Median 3Q Max \n-23.08342 -4.10573 0.24274 4.29941 23.08245 \n\nCoefficients: \n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 15.09421119 16.52357745 0.9135 0.36098\nage_moy -0.70481703 0.32204139 -2.1886 0.02863\npct_frn 39.09375061 5.43653962 7.1909 6.435e-13\npct_prp 14.32329345 6.96492611 2.0565 0.03974\nrev_med 0.00016730 0.00023209 0.7208 0.47101\n\nLambda: 0.12887 LR test value: 42.274 p-value: 7.9339e-11 \nNumerical Hessian standard error of lambda: 0.012069 \n\nLog likelihood: -433.8862 \nML residual variance (sigma squared): 53.028, (sigma: 7.282)\nNumber of observations: 125 \nNumber of parameters estimated: 7 \nAIC: 881.77\n\n\nThe value given by Lambda in the summary corresponds to the coefficient \\(\\rho\\) in our description of the model. The likelihood-ratio test (LR test) confirms that this residual spatial correlation (after controlling for the effect of predictors) is significant.\nThe estimated effects for the predictors are similar to those of the linear model without spatial correlation. The effects of mean age, fraction of francophones and fraction of homeowners remain significant, although their magnitude has decreased somewhat.\nTo fit a CAR rather than SAR model, we must specify family = \"CAR\".\n\nelect_car <- spautolm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids, family = \"CAR\")\nsummary(elect_car)\n\n\nCall: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids, family = \"CAR\")\n\nResiduals:\n Min 1Q Median 3Q Max \n-21.73315 -4.24623 -0.24369 3.44228 23.43749 \n\nCoefficients: \n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 16.57164696 16.84155327 0.9840 0.325128\nage_moy -0.79072151 0.32972225 -2.3981 0.016478\npct_frn 38.99116707 5.43667482 7.1719 7.399e-13\npct_prp 17.98557474 6.80333470 2.6436 0.008202\nrev_med 0.00012639 0.00023106 0.5470 0.584364\n\nLambda: 0.15517 LR test value: 40.532 p-value: 1.9344e-10 \nNumerical Hessian standard error of lambda: 0.0026868 \n\nLog likelihood: -434.7573 \nML residual variance (sigma squared): 53.9, (sigma: 7.3416)\nNumber of observations: 125 \nNumber of parameters estimated: 7 \nAIC: 883.51\n\n\nFor a CAR model with binary weights, the value of Lambda (which we called \\(\\rho\\)) directly gives the partial correlation coefficient between neighbouring districts. Note that the AIC here is slightly higher than the SAR model, so the latter gave a better fit."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercise-3",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercise-3",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exercise",
+ "text": "Exercise\nThe rls_covid dataset, in shapefile format, contains data on detected COVID-19 cases (cas), number of cases per 1000 people (taux_1k) and the population density (dens_pop) in each of Quebec’s local health service networks (RLS) (Source: Data downloaded from the Institut national de santé publique du Québec as of January 17, 2021).\n\nrls_covid <- read_sf(\"data/rls_covid.shp\")\nhead(rls_covid)\n\nSimple feature collection with 6 features and 5 fields\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: 785111.2 ymin: 341057.8 xmax: 979941.5 ymax: 541112.7\nProjected CRS: Conique_conforme_de_Lambert_du_MTQ_utilis_e_pour_Adresse_Qu_be\n# A tibble: 6 × 6\n RLS_code RLS_nom cas taux_1k dens_…¹ geometry\n \n1 0111 RLS de Kamouraska 152 7.34 6.76 (((827028.3 412772.4, 82…\n2 0112 RLS de Rivière-du-Lo… 256 7.34 19.6 (((855905 452116.9, 8557…\n3 0113 RLS de Témiscouata 81 4.26 4.69 (((911829.4 441311.2, 91…\n4 0114 RLS des Basques 28 3.3 5.35 (((879249.6 471975.6, 87…\n5 0115 RLS de Rimouski 576 9.96 15.5 (((917748.1 503148.7, 91…\n6 0116 RLS de La Mitis 76 4.24 5.53 (((951316 523499.3, 9525…\n# … with abbreviated variable name ¹dens_pop\n\n\nFit a linear model of the number of cases per 1000 as a function of population density (it is suggested to apply a logarithmic transform to the latter). Check whether the model residuals are correlated between bordering RLS with a Moran’s test and then model the same data with a conditional autoregressive model."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#reference",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#reference",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Reference",
+ "text": "Reference\nVer Hoef, J.M., Peterson, E.E., Hooten, M.B., Hanks, E.M. and Fortin, M.-J. (2018) Spatial autoregressive models for statistical inference from ecological data. Ecological Monographs 88: 36-59."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#data",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#data",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Data",
+ "text": "Data\nThe gambia dataset found in the geoR package presents the results of a study of malaria prevalence among children of 65 villages in The Gambia. We will use a slightly transformed version of the data found in the file gambia.csv.\n\nlibrary(geoR)\n\ngambia <- read.csv(\"data/gambia.csv\")\nhead(gambia)\n\n id_village x y pos age netuse treated green phc\n1 1 349.6313 1458.055 1 1783 0 0 40.85 1\n2 1 349.6313 1458.055 0 404 1 0 40.85 1\n3 1 349.6313 1458.055 0 452 1 0 40.85 1\n4 1 349.6313 1458.055 1 566 1 0 40.85 1\n5 1 349.6313 1458.055 0 598 1 0 40.85 1\n6 1 349.6313 1458.055 1 590 1 0 40.85 1\n\n\nHere are the fields in that dataset:\n\nid_village: Identifier of the village.\nx and y: Spatial coordinates of the village (in kilometers, based on UTM coordinates).\npos: Binary response, whether the child tested positive for malaria.\nage: Age of the child in days.\nnetuse: Whether or not the child sleeps under a bed net.\ntreated: Whether or not the bed net is treated.\ngreen: Remote sensing based measure of greenness of vegetation (measured at the village level).\nphc: Presence or absence of a public health centre for the village.\n\nWe can count the number of positive cases and total children tested by village to map the fraction of positive cases (or prevalence, prev).\n\n# Create village-level dataset\ngambia_agg <- group_by(gambia, id_village, x, y, green, phc) %>%\n summarize(pos = sum(pos), total = n()) %>%\n mutate(prev = pos / total) %>%\n ungroup()\n\n`summarise()` has grouped output by 'id_village', 'x', 'y', 'green'. You can\noverride using the `.groups` argument.\n\nhead(gambia_agg)\n\n# A tibble: 6 × 8\n id_village x y green phc pos total prev\n \n1 1 350. 1458. 40.8 1 17 33 0.515\n2 2 359. 1460. 40.8 1 19 63 0.302\n3 3 360. 1460. 40.1 0 7 17 0.412\n4 4 364. 1497. 40.8 0 8 24 0.333\n5 5 366. 1460. 40.8 0 10 26 0.385\n6 6 367. 1463. 40.8 0 7 18 0.389\n\n\n\nggplot(gambia_agg, aes(x = x, y = y)) +\n geom_point(aes(color = prev)) +\n geom_path(data = gambia.borders, aes(x = x / 1000, y = y / 1000)) +\n coord_fixed() +\n theme_minimal() +\n scale_color_viridis_c()\n\n\n\n\nWe use the gambia.borders dataset from the geoR package to trace the country boundaries with geom_path. Since those boundaries are in meters, we divide by 1000 to get the same scale as our points. We also use coord_fixed to ensure a 1:1 aspect ratio between the axes and use the viridis color scale, which makes it easier to visualize a continuous variable compared with the default gradient scale in ggplot2.\nBased on this map, there seems to be spatial correlation in malaria prevalence, with the eastern cluster of villages showing more high prevalence values (yellow-green) and the middle cluster showing more low prevalence values (purple)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#non-spatial-glmm",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#non-spatial-glmm",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Non-spatial GLMM",
+ "text": "Non-spatial GLMM\nFor this first example, we will ignore the spatial aspect of the data and model the presence of malaria (pos) as a function of the use of a bed net (netuse) and the presence of a public health centre (phc). Since we have a binary response, we need to use a logistic regression model (a GLM). Since we have predictors at both the individual and village level, and we expect that children of the same village have more similar probabilities of having malaria even after accounting for those predictors, we need to add a random effect of the village. The result is a GLMM that we fit using the glmer function in the lme4 package.\n\nlibrary(lme4)\n\nmod_glmm <- glmer(pos ~ netuse + phc + (1 | id_village), \n data = gambia, family = binomial)\nsummary(mod_glmm)\n\nGeneralized linear mixed model fit by maximum likelihood (Laplace\n Approximation) [glmerMod]\n Family: binomial ( logit )\nFormula: pos ~ netuse + phc + (1 | id_village)\n Data: gambia\n\n AIC BIC logLik deviance df.resid \n 2428.0 2450.5 -1210.0 2420.0 2031 \n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.1286 -0.7120 -0.4142 0.8474 3.3434 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n id_village (Intercept) 0.8149 0.9027 \nNumber of obs: 2035, groups: id_village, 65\n\nFixed effects:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 0.1491 0.2297 0.649 0.5164 \nnetuse -0.6044 0.1442 -4.190 2.79e-05 ***\nphc -0.4985 0.2604 -1.914 0.0556 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr) netuse\nnetuse -0.422 \nphc -0.715 -0.025\n\n\nAccording to these results, both netuse and phc result in a decrease of malaria prevalence, although the effect of phc is not significant at a threshold \\(\\alpha = 0.05\\). The intercept (0.149) is the logit of the probability of malaria presence for a child with no bednet and no public health centre, but it is the mean intercept across all villages, and there is a lot of variation between villages, based on the random effect standard deviation of 0.90. We can get the estimated intercept for each village with the function coef:\n\nhead(coef(mod_glmm)$id_village)\n\n (Intercept) netuse phc\n1 0.93727515 -0.6043602 -0.4984835\n2 0.09204843 -0.6043602 -0.4984835\n3 0.22500620 -0.6043602 -0.4984835\n4 -0.46271089 -0.6043602 -0.4984835\n5 0.13680037 -0.6043602 -0.4984835\n6 -0.03723346 -0.6043602 -0.4984835\n\n\nSo for example, the intercept for village 1 is around 0.94, equivalent to a probability of 72%:\n\nplogis(0.937)\n\n[1] 0.7184933\n\n\nwhile the intercept in village 2 is equivalent to a probability of 52%:\n\nplogis(0.092)\n\n[1] 0.5229838\n\n\nThe DHARMa package provides a general method for checking whether the residuals of a GLMM are distributed according to the specified model and whether there is any residual trend. The package works by simulating replicates of each observation according to the fitted model and then determining a “standardized residual”, which is the relative position of the observed value with respect to the simulated values, e.g. 0 if the observation is smaller than all the simulations, 0.5 if it is in the middle, etc. If the model represents the data well, each value of the standardized residual between 0 and 1 should be equally likely, so the standardized residuals should produce a uniform distribution between 0 and 1.\nThe simulateResiduals function performs the calculation of the standardized residuals, then the plot function plots the diagnostic graphs with the results of certain tests.\n\nlibrary(DHARMa)\nres_glmm <- simulateResiduals(mod_glmm)\nplot(res_glmm)\n\n\n\n\nThe graph on the left is a quantile-quantile plot of standardized residuals. The results of three statistical tests also also shown: a Kolmogorov-Smirnov (KS) test which checks whether there is a deviation from the theoretical distribution, a dispersion test that checks whether there is underdispersion or overdispersion, and an outlier test based on the number of residuals that are more extreme than all the simulations. Here, we get a significant result for the outliers, though the message indicates that this result might have an inflated type I error rate in this case.\nOn the right, we generally get a graph of standardized residuals (in y) as a function of the rank of the predicted values, in order to check for any leftover trend in the residual. Here, the predictions are binned by quartile, so it might be better to instead aggregate the predictions and residuals by village, which we can do with the recalculateResiduals function.\n\nplot(recalculateResiduals(res_glmm, group = gambia$id_village))\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\n\nThe plot to the right now shows individual points, along with a quantile regression for the 1st quartile, the median and the 3rd quartile. In theory, these three curves should be horizontal straight lines (no leftover trend in the residuals vs. predictions). The curve for the 3rd quartile (in red) is significantly different from a horizontal line, which could indicate some systematic effect that is missing from the model."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#spatial-glmm-with-spamm",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#spatial-glmm-with-spamm",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Spatial GLMM with spaMM",
+ "text": "Spatial GLMM with spaMM\nThe spaMM (spatial mixed models) package is a relatively new R package that can perform approximate maximum likelihood estimation of parameters for GLMM with spatial dependence, modelled either as a Gaussian process or with a CAR (we will see the latter in the last section). The package implements different algorithms, but there is a single fitme function that chooses the appropriate algorithm for each model type. For example, here is the same (non-spatial) model as above fit with spaMM.\n\nlibrary(spaMM)\n\nmod_spamm_glmm <- fitme(pos ~ netuse + phc + (1 | id_village),\n data = gambia, family = binomial)\nsummary(mod_spamm_glmm)\n\nformula: pos ~ netuse + phc + (1 | id_village)\nEstimation of lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.1491 0.2287 0.6519\nnetuse -0.6045 0.1420 -4.2567\nphc -0.4986 0.2593 -1.9231\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n id_village : 0.8151 \n --- Coefficients for log(lambda):\n Group Term Estimate Cond.SE\n id_village (Intercept) -0.2045 0.2008\n# of obs: 2035; # of groups: id_village, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1210.016\n\n\nNote that the estimates of the fixed effects as well as the variance of random effects are nearly identical to those obtained by glmer above.\nWe can now use spaMM to fit the same model with the addition of spatial correlations between villages. In the formula of the model, this is represented as a random effect Matern(1 | x + y), which means that the intercepts are spatially correlated between villages following a Matérn correlation function of coordinates (x, y). The Matérn function is a flexible function for spatial correlation that includes a shape parameter \\(\\nu\\) (nu), so that when \\(\\nu = 0.5\\) it is equivalent to the exponential correlation but as \\(\\nu\\) grows to large values, it approaches a Gaussian correlation. We could let the function estimate \\(\\nu\\), but here we will fix it to 0.5 with the fixed argument of fitme.\n\nmod_spamm <- fitme(pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village),\n data = gambia, family = binomial, fixed = list(nu = 0.5))\n\nIncrease spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').\n\nsummary(mod_spamm)\n\nformula: pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village)\nEstimation of corrPars and lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nEstimation of lambda by 'outer' ML, maximizing logL.\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.06861 0.3352 0.2047\nnetuse -0.51719 0.1407 -3.6757\nphc -0.44416 0.2052 -2.1648\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Correlation parameters:\n 1.nu 1.rho \n0.50000000 0.05128692 \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n x + y : 0.6421 \n id_village : 0.1978 \n# of obs: 2035; # of groups: x + y, 65; id_village, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1197.968\n\n\nLet’s first check the random effects of the model. The spatial correlation function has a parameter rho equal to 0.0513. This parameter in spaMM is the inverse of the range, so here the range of exponential correlation is 1/0.0513 or around 19.5 km. There are now two variance prameters, the one identified as x + y is the long-range variance (i.e. sill) for the exponential correlation model whereas the one identified as id_village shows the non-spatially correlated portion of the variation between villages.\nIn fact, while we left the random effects (1 | id_village) in the formula to represent the non-spatial portion of variation between villages, we could also represent this with a nugget effect in the geostatistical model. In both cases, it would represent the idea that even two villages very close to each other would have different baseline prevalences in the model.\nBy default, the Matern function has no nugget effect, but we can add one by specifying a non-zero Nugget in the initial parameter list init.\n\nmod_spamm2 <- fitme(pos ~ netuse + phc + Matern(1 | x + y),\n data = gambia, family = binomial, fixed = list(nu = 0.5),\n init = list(Nugget = 0.1))\n\nIncrease spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').\n\nsummary(mod_spamm2)\n\nformula: pos ~ netuse + phc + Matern(1 | x + y)\nEstimation of corrPars and lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nEstimation of lambda by 'outer' ML, maximizing logL.\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.06861 0.3352 0.2047\nnetuse -0.51719 0.1407 -3.6757\nphc -0.44416 0.2052 -2.1648\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Correlation parameters:\n 1.nu 1.Nugget 1.rho \n0.50000000 0.23551027 0.05128692 \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n x + y : 0.8399 \n# of obs: 2035; # of groups: x + y, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1197.968\n\n\nAs you can see, all estimates are the same, except that the variance of the spatial portion (sill) is now 0.84 and the nugget is equal to a fraction 0.235 of that sill, so a variance of 0.197, which is the same as the id_village random effect in the version above. Thus the two formulations are equivalent.\nNow, recall the coefficients we obtained for the non-spatial GLMM:\n\nsummary(mod_glmm)$coefficients\n\n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 0.1490596 0.2296971 0.6489399 5.163772e-01\nnetuse -0.6043602 0.1442448 -4.1898243 2.791706e-05\nphc -0.4984835 0.2604083 -1.9142382 5.558973e-02\n\n\nIn the spatial version, both fixed effects have moved slightly towards zero, but the standard error of the effect of phc has decreased. It is interesting that the inclusion of spatial dependence has allowed us to estimate more precisely the effect of having a public health centre in the village. This would not always be the case: for a predictor that is also strongly correlated in space, spatial correlation in the response makes it harder to estimate the effect of this predictor, since it is confounded with the spatial effect. However, for a predictor that is not correlated in space, including the spatial effect reduces the residual (non-spatial) variance and may thus increase the precision of the predictor’s effect.\nThe spaMM package is also compatible with DHARMa for residual diagnostics. (You can in fact ignore the warning that it is not in the class of supported models, this is due to using the fitme function rather than a specific algorithm function in spaMM.)\n\nres_spamm <- simulateResiduals(mod_spamm2)\nplot(res_spamm)\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\nplot(recalculateResiduals(res_spamm, group = gambia$id_village))\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\n\nFinally, while we will show how to make and visualize spatial predictions below, we can produce a quick map of the estimated spatial effects in a spaMM model with the filled.mapMM function.\n\nfilled.mapMM(mod_spamm2)"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#gaussian-process-models-vs.-smoothing-splines",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#gaussian-process-models-vs.-smoothing-splines",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Gaussian process models vs. smoothing splines",
+ "text": "Gaussian process models vs. smoothing splines\nIf you are familiar with generalized additive models (GAM), you might think that the spatial variation in malaria prevalence (as shown in the map above) could be represented by a 2D smoothing spline (as a function of \\(x\\) and \\(y\\)) within a GAM.\nThe code below fits the GAM equivalent of our Gaussian process GLMM above with the gam function in the mgcv package. The spatial effect is represented by the 2D spline s(x, y) whereas the non-spatial random effect of village is represented by s(id_village, bs = \"re\"), which is the same as (1 | id_village) in the previous models. Note that for the gam function, categorical variables must be explicitly converted to factors.\n\nlibrary(mgcv)\ngambia$id_village <- as.factor(gambia$id_village)\nmod_gam <- gam(pos ~ netuse + phc + s(id_village, bs = \"re\") + s(x, y), \n data = gambia, family = binomial)\n\nTo visualize the 2D spline, we will use the gratia package.\n\nlibrary(gratia)\ndraw(mod_gam)\n\n\n\n\nNote that the plot of the spline s(x, y) (top right) does not extend too far from the locations of the data (other areas are blank). In this graph, we can also see that the village random effects follow the expected Gaussian distribution (top left).\nNext, we will use both the spatial GLMM from the previous section and this GAMM to predict the mean prevalence on a spatial grid of points contained in the file gambia_pred.csv. The graph below adds those prediction points (in black) on the previous map of the data points.\n\ngambia_pred <- read.csv(\"data/gambia_pred.csv\")\n\nggplot(gambia_agg, aes(x = x, y = y)) +\n geom_point(data = gambia_pred) +\n geom_point(aes(color = prev)) +\n geom_path(data = gambia.borders, aes(x = x / 1000, y = y / 1000)) +\n coord_fixed() +\n theme_minimal() +\n scale_color_viridis_c()\n\n\n\n\nTo make predictions from the GAMM model at those points, the code below goes through the following steps:\n\nAll predictors in the model must be in the prediction data frame, so we add constant values of netuse and phc (both equal to 1) for all points. Thus, we will make predictions of malaria prevalence in the case where a net is used and a public health centre is present. We also add a constant id_village, although it will not be used in predictions (see below).\nWe call the predict function on the output of gam to produce predictions at the new data points (argument newdata), including standard errors (se.fit = TRUE) and excluding the village random effects, so the prediction is made for an “average village”. The resulting object gam_pred will have columns fit (mean prediction) and se.fit (standard error). Those predictions and standard errors are on the link (logit) scale.\nWe add the original prediction data frame to gam_pred with cbind.\nWe add columns for the mean prediction and 50% confidence interval boundaries (mean \\(\\pm\\) 0.674 standard error), converted from the logit scale to the probability scale with plogis. We choose a 50% interval since a 95% interval may be too wide here to contrast the different predictions on the map at the end of this section.\n\n\ngambia_pred <- mutate(gambia_pred, netuse = 1, phc = 1, id_village = 1)\n\ngam_pred <- predict(mod_gam, newdata = gambia_pred, se.fit = TRUE, \n exclude = \"s(id_village)\")\ngam_pred <- cbind(gambia_pred, as.data.frame(gam_pred))\ngam_pred <- mutate(gam_pred, pred = plogis(fit), \n lo = plogis(fit - 0.674 * se.fit), # 50% CI\n hi = plogis(fit + 0.674 * se.fit))\n\nNote: The reason we do not make predictions directly on the probability (response) scale is that the normal formula for confidence intervals applies more accurately on the logit scale. Adding a certain number of standard errors around the mean on the probability scale would lead to less accurate intervals and maybe even confidence intervals outside the possible range (0, 1) for a probability.\nWe apply the same strategy to make predictions from the spaMM spatial GLMM model. There are a few differences in the predict method compared with the GAMM case.\n\nThe argument binding = \"fit\" means that mean predictions (fit column) will be attached to the prediction dataset and returned as spamm_pred.\nThe variances = list(linPred = TRUE) tells predict to calculate the variance of the linear predictor (so the square of the standard error). However, it appears as an attribute predVar in the output data frame rather than a se.fit column, so we move it to a column on the next line.\n\n\nspamm_pred <- predict(mod_spamm, newdata = gambia_pred, type = \"link\",\n binding = \"fit\", variances = list(linPred = TRUE))\nspamm_pred$se.fit <- sqrt(attr(spamm_pred, \"predVar\"))\nspamm_pred <- mutate(spamm_pred, pred = plogis(fit), \n lo = plogis(fit - 0.674 * se.fit),\n hi = plogis(fit + 0.674 * se.fit))\n\nFinally, we combine both sets of predictions as different rows of a pred_all dataset with bind_rows. The name of the dataset each prediction originates from (gam or spamm) will appear in the “model” column (argument .id). To simplify production of the next plot, we then use pivot_longer in the tidyr package to change the three columns “pred”, “lo” and “hi” to two columns, “stat” and “value” (pred_tall has thus three rows for every row in pred_all).\n\npred_all <- bind_rows(gam = gam_pred, spamm = spamm_pred, .id = \"model\")\n\nlibrary(tidyr)\npred_tall <- pivot_longer(pred_all, c(pred, lo, hi), names_to = \"stat\",\n values_to = \"value\")\n\nHaving done these steps, we can finally look at the prediction maps (mean, lower and upper bounds of the 50% confidence interval) with ggplot. The original data points are shown in red.\n\nggplot(pred_tall, aes(x = x, y = y)) +\n geom_point(aes(color = value)) +\n geom_point(data = gambia_agg, color = \"red\", size = 0) +\n coord_fixed() +\n facet_grid(stat~model) +\n scale_color_viridis_c() +\n theme_minimal()\n\n\n\n\nWhile both models agree that there is a higher prevalence near the eastern cluster of villages, the GAMM also estimates a higher prevalence at a few points (western edge and around the center) where there is no data. This is an artifact of the shape of the spline fit around the data points, since a spline is meant to fit a global, although nonlinear, trend. In contrast, the geostatistical model represents the spatial effect as local correlations and reverts to the overall mean prevalence when far from any data points, which is a safer assumption. This is one reason to choose a geostatistical / Gaussian process model in this case."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#bayesian-methods-for-glmms-with-gaussian-processes",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#bayesian-methods-for-glmms-with-gaussian-processes",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Bayesian methods for GLMMs with Gaussian processes",
+ "text": "Bayesian methods for GLMMs with Gaussian processes\nBayesian models provide a flexible framework to express models with complex dependence structure among the data, including spatial dependence. However, fitting a Gaussian process model with a fully Bayesian approach can be slow, due the need to compute a spatial covariance matrix between all point pairs at each iteration.\nThe INLA (integrated nested Laplace approximation) method performs an approximate calculation of the Bayesian posterior distribution, which makes it suitable for spatial regression problems. We do not cover it in this course, but I recommend the textbook by Paula Moraga (in the references section below) that provides worked examples of using INLA for various geostatistical and areal data models, in the context of epidemiology, including models with both space and time dependence. The book presents the same Gambia malaria data as an example of a geostatistical dataset, which inspired its use in this course."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#reference-1",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#reference-1",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Reference",
+ "text": "Reference\nMoraga, Paula (2019) Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. Available online at https://www.paulamoraga.com/book-geospatial/."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#plan-du-cours",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#plan-du-cours",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Plan du cours",
+ "text": "Plan du cours\n\n\n\nJour\nSujets\n\n\n\n\n1\n• Introduction aux statistiques spatiales • Analyse des patrons de points \n\n\n2\n• Corrélation spatiale d’une variable • Modèles géostatistiques\n\n\n3\n• Données aréales • Indice de Moran • Modèles d’autorégression spatiale • Analyse des données aréales dans R\n\n\n4\n• GLMM avec processus spatial gaussien • GLMM avec autorégression spatiale"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#types-danalyses-spatiales",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#types-danalyses-spatiales",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Types d’analyses spatiales",
+ "text": "Types d’analyses spatiales\nDans le cadre de cette formation, nous discuterons de trois types d’analyses spatiales: l’analyse des patrons de points, les modèles géostatistiques et les modèles de données aréales.\nDans l’analyse des patrons de points, nous avons des données ponctuelles représentant la position d’individus ou d’événements dans une région d’étude et nous supposons que tous les individus ou événements ont été recensés dans cette région. Cette analyse s’intéresse à la distribution des positions des points eux-mêmes. Voici quelques questions typiques de l’analyse des patrons de points:\n\nLes points sont-ils disposés aléatoirement ou agglomérés?\nDeux types de points sont-ils disposés indépendamment?\n\nLes modèles géostatistiques visent à représenter la distribution spatiale de variables continues qui sont mesurés à certains points d’échantillonnage. Ils supposent que les mesures de ces variables à différents points sont corrélées en fonction de la distance entre ces points. Parmi les applications des modèles géostatistiques, notons le lissage des données spatiales (ex.: produire une carte d’une variable sur l’ensemble d’une région en fonction des mesures ponctuelles) et la prédiction de ces variables pour des points non-échantillonnés.\nLes données aréales sont des mesures prises non pas à des points, mais pour des régions de l’espace représentées par des polygones (ex.: divisions du territoire, cellules d’une grille). Les modèles représentant ces types de données définissent un réseau de voisinage reliant les régions et incluent une corrélation spatiale entre régions voisines."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#stationnarité-et-isotropie",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#stationnarité-et-isotropie",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Stationnarité et isotropie",
+ "text": "Stationnarité et isotropie\nPlusieurs analyses spatiales supposent que les variables sont stationnaires dans l’espace. Comme pour la stationnarité dans le domaine temporel, cette propriété signifie que les statistiques sommaires (moyenne, variance et corrélations entre mesures d’une variable) ne varient pas avec une translation dans l’espace. Par exemple, la corrélation spatiale entre deux points peut dépendre de la distance les séparant, mais pas de leur position absolue.\nEn particulier, il ne peut pas y avoir de tendance à grande échelle (souvent appelée gradient dans un contexte spatial), ou bien cette tendance doit être prise en compte afin de modéliser la corrélation spatiale des résidus.\nDans le cas de l’analyse des patrons de points, la stationnarité (aussi appelée homogénéité dans ce contexte) signifie que la densité des points ne suit pas de tendance à grande échelle.\nDans un modèle statistique isotropique, les corrélations spatiales entre les mesures à deux points dépendent seulement de la distance entre ces points, pas de la direction. Dans ce cas, les statistiques sommaires ne varient pas si on effectue une rotation dans l’espace."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#données-géoréférencées",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#données-géoréférencées",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Données géoréférencées",
+ "text": "Données géoréférencées\nLes études environnementales utilisent de plus en plus de données provenant de sources de données géospatiales, c’est-à-dire des variables mesurées sur une grande partie du globe (ex.: climat, télédétection). Le traitement de ces données requiert des concepts liés aux systèmes d’information géographique (SIG), qui ne sont pas couverts dans cet atelier, alors que nous nous concentrons sur les aspects statistiques de données variant dans l’espace.\nL’utilisation de données géospatiales ne signifie pas nécessairement qu’il faut avoir recours à des statistiques spatiales. Par exemple, il est courant d’extraire les valeurs de ces variables géographiques à des points d’étude pour expliquer une réponse biologique observée sur le terrain. Dans ce cas, l’utilisation de statistiques spatiales est seulement nécessaire en présence d’une corrélation spatiale dans les résidus, après avoir tenu compte de l’effet des prédicteurs."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#patron-de-points-et-processus-ponctuel",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#patron-de-points-et-processus-ponctuel",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Patron de points et processus ponctuel",
+ "text": "Patron de points et processus ponctuel\nUn patron de points (point pattern) décrit la position spatiale (le plus souvent en 2D) d’individus ou d’événements, représentés par des points, dans une aire d’étude donnée, souvent appelée la fenêtre d’observation.\nOn suppose que chaque point a une étendue spatiale négligeable par rapport aux distances entre les points. Des méthodes plus complexes existent pour traiter des patrons spatiaux d’objets qui ont une largeur non-néligeable, mais ce sujet dépasse la portée de cet atelier.\nUn processus ponctuel (point process) est un modèle statistique qui peut être utilisé pour simuler des patrons de points ou expliquer un patron de points observé."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#structure-spatiale-totalement-aléatoire",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#structure-spatiale-totalement-aléatoire",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Structure spatiale totalement aléatoire",
+ "text": "Structure spatiale totalement aléatoire\nUne structure spatiale totalement aléatoire (complete spatial randomness) est un des patrons les plus simples, qui sert de modèle nul pour évaluer les caractéristiques de patrons de points réels. Dans ce patron, la présence d’un point à une position donnée est indépendante de la présence de points dans un voisinage.\nLe processus créant ce patron est un processus de Poisson homogène. Selon ce modèle, le nombre de points dans toute région de superficie \\(A\\) suit une distribution de Poisson: \\(N(A) \\sim \\text{Pois}(\\lambda A)\\), où \\(\\lambda\\) est l’intensité du processus (i.e. la densité de points). \\(N\\) est indépendant entre deux régions disjointes, peu importe comment ces régions sont définies.\nDans le graphique ci-dessous, seul le patron à droite est totalement aléatoire. Le patron à gauche montre une agrégation des points (probabilité plus grande d’observer un point si on est à proximité d’un autre point), tandis que le patron du centre montre une répulsion (faible probabilité d’observer un point très près d’un autre)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#analyse-exploratoire-ou-inférentielle-pour-un-patron-de-points",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#analyse-exploratoire-ou-inférentielle-pour-un-patron-de-points",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Analyse exploratoire ou inférentielle pour un patron de points",
+ "text": "Analyse exploratoire ou inférentielle pour un patron de points\nPlusieurs statistiques sommaires sont utilisées pour décrire les caractéristiques un patron de points. La plus simple est l’intensité \\(\\lambda\\), qui comme mentionné plus haut représente la densité de points par unité de surface. Si le patron de points est hétérogène, l’intensité n’est pas constante, mais dépend de la position: \\(\\lambda(x, y)\\).\nPar rapport à l’intensité qui est une statistique dite de premier ordre, les statistiques de second ordre décrivent comment la probabilité de présence d’un point dans une région dépend de la présence d’autres points. L’indice \\(K\\) de Ripley présenté dans la prochaine section est un exemple de statistique sommaire de second ordre.\nLes inférences statistiques réalisées sur des patrons de points consistent habituellement à tester l’hypothèse que le patron de points correspond à un modèle nul donné, par exemple une structure spatiale totalement aléatoire, ou un modèle nul plus complexe. Même pour les modèles nuls les plus simples, nous connaissons rarement la distribution théorique pour une statistique sommaire du patron de points sous le modèle nul. Les tests d’hypothèses sur les patrons de points sont donc réalisés par simulation: on simule un grand nombre de patrons de points à partir du modèle nul et on compare la distribution des statistiques sommaires qui nous intéressent pour ces simulations à la valeur des statistiques pour le patron de points observé."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#indice-k-de-ripley",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#indice-k-de-ripley",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Indice \\(K\\) de Ripley",
+ "text": "Indice \\(K\\) de Ripley\nL’indice de Ripley \\(K(r)\\) est défini comme le nombre moyen de points se trouvant dans un cercle de rayon \\(r\\) donné autour d’un point du patron, normalisé par l’intensité \\(\\lambda\\).\nPour un patron totalement aléatoire, le nombre moyen de points dans un cercle de rayon \\(r\\) est \\(\\lambda \\pi r^2\\), donc en théorie \\(K(r) = \\pi r^2\\) pour ce modèle nul. Une valeur de \\(K(r)\\) supérieure signifie qu’il y a agrégation des points à l’échelle \\(r\\), tandis qu’une valeur inférieure signifie qu’il y a une répulsion.\nEn pratique, \\(K(r)\\) est estimé pour un patron de points donné par l’équation:\n\\[ K(r) = \\frac{A}{n(n-1)} \\sum_i \\sum_{j > i} I \\left( d_{ij} \\le r \\right) w_{ij}\\]\noù \\(A\\) est l’aire de la fenêtre d’observation et \\(n\\) est le nombre de points du patron, donc \\(n(n-1)\\) est le nombre de paires de points distinctes. On fait la somme pour toutes les paires de points de la fonction indicatrice \\(I\\), qui prend une valeur de 1 si la distance entre les points \\(i\\) et \\(j\\) est inférieure ou égale à \\(r\\). Finalement, le terme \\(w_{ij}\\) permet de donner un poids supplémentaire à certaines paires de points pour tenir compte des effets de bordure, tel que discuté dans la section suivante.\nPar exemple, les graphiques ci-dessous présentent la fonction estimée \\(K(r)\\) pour les patrons illustrés ci-dessus, pour des valeurs de \\(r\\) allant jusqu’à 1/4 de la largeur de la fenêtre. La courbe pointillée rouge indique la valeur théorique pour une structure spatiale totalement aléatoire et la zone grise est une “enveloppe” produite par 99 simulations de ce modèle nul. Le patron agrégé montre un excès de voisins jusqu’à \\(r = 0.25\\) et le patron avec répulsion montre un déficit significatif de voisins pour les petites valeurs de \\(r\\).\n\n\n\n\n\nOutre le \\(K\\), il existe d’autres statistiques pour décrire les propriétés de second ordre du patron, par exemple la distance moyenne entre un point et ses \\(N\\) plus proches voisins. Vous pouvez consulter le manuel de Wiegand et Moloney (2013) suggéré en référence pour en apprendre plus sur différentes statistiques sommaires des patrons de points."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effets-de-bordure",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effets-de-bordure",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Effets de bordure",
+ "text": "Effets de bordure\nDans le contexte de l’analyse de patrons de points, l’effet de bordure (“edge effect”) est dû au fait que nous avons une connaissance incomplète du voisinage des points près du bord de la fenêtre d’observation, ce qui peut induire un biais dans le calcul des statistiques comme le \\(K\\) de Ripley.\nDifférentes méthodes ont été développées pour corriger le biais dû aux effets de bordure. Selon la méthode de Ripley, la contribution d’un voisin \\(j\\) situé à une distance \\(r\\) d’un point \\(i\\) reçoit un poids \\(w_{ij} = 1/\\phi_i(r)\\), où \\(\\phi_i(r)\\) est la fraction du cercle de rayon \\(r\\) autour de \\(i\\) contenu dans la fenêtre d’observation. Par exemple, si 2/3 du cercle se trouve dans la fenêtre, ce voisin compte pour 3/2 voisins dans le calcul d’une statistique comme \\(K\\).\n\nLa méthode de Ripley est une des plus simples pour corriger les effets de bordure, mais n’est pas nécessairement la plus efficace; notamment, les poids plus grands donnés à certaines paires de points tend à accroître la variance du calcul de la statistique. D’autres méthodes de correction sont présentées dans les manuels spécialisés, comme celui de Wiegand et Moloney (2013) en référence."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exemple",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exemple",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exemple",
+ "text": "Exemple\nPour cet exemple, nous utilisons le jeu de données semis_xy.csv, qui représente les coordonnées \\((x, y)\\) de semis de deux espèces (sp, B = bouleau et P = peuplier) dans une placette de 15 x 15 m.\n\nsemis <- read.csv(\"data/semis_xy.csv\")\nhead(semis)\n\n x y sp\n1 14.73 0.05 P\n2 14.72 1.71 P\n3 14.31 2.06 P\n4 14.16 2.64 P\n5 14.12 4.15 B\n6 9.88 4.08 B\n\n\nLe package spatstat permet d’effectuer des analyses de patrons de point dans R. La première étape consiste à transformer notre tableau de données en objet ppp (patron de points) avec la fonction du même nom. Dans cette fonction, nous spécifions quelles colonnes contiennent les coordonnées x et y ainsi que les marques (marks), qui seront ici les codes d’espèce. Il faut aussi spécifier une fenêtre d’observation (window) à l’aide de la fonction owin, à laquelle nous indiquons les limites de la placette en x et y.\n\nlibrary(spatstat)\n\nsemis <- ppp(x = semis$x, y = semis$y, marks = as.factor(semis$sp),\n window = owin(xrange = c(0, 15), yrange = c(0, 15)))\nsemis\n\nMarked planar point pattern: 281 points\nMultitype, with levels = B, P \nwindow: rectangle = [0, 15] x [0, 15] units\n\n\nLes marques peuvent être numériques ou catégorielles. Notez que pour des marques catégorielles comme c’est le cas ici, il faut convertir explicitement la variable en facteur.\nLa fonction plot appliquée à un patron de points montre un diagramme du patron.\n\nplot(semis)\n\n\n\n\nLa fonction intensity calcule la densité des points de chaque espèce par unité de surface, ici en \\(m^2\\).\n\nintensity(semis)\n\n B P \n0.6666667 0.5822222 \n\n\nPour analyser d’abord séparément la distribution de chaque espèce, nous séparons le patron avec split. Puisque le patron contient des marques catégorielles, la séparation se fait automatiquement en fonction de la valeur des marques. Le résultat est une liste de deux patrons de points.\n\nsemis_split <- split(semis)\nplot(semis_split)\n\n\n\n\nLa fonction Kest calcule le \\(K\\) de Ripley pour une série de distances allant (par défaut) jusqu’à 1/4 de la largeur de la fenêtre. Ici, nous l’appliquons au premier patron (bouleau) en choisissant semis_split[[1]]. Notez que les doubles crochets sont nécessaires pour choisir un élément d’une liste dans R.\nL’argument correction = \"iso\" indique d’appliquer la méthode de Ripley pour corriger les effets de bordure.\n\nk <- Kest(semis_split[[1]], correction = \"iso\")\nplot(k)\n\n\n\n\nSelon ce graphique, il semble y avoir une excès de voisins à partir d’un rayon de 1 m. Pour vérifier s’il s’agit d’un écart significatif, nous produisons une enveloppe de simulation avec la fonction envelope. Le permier argument d’envelope est un patron de point auquel les simulations seront comparées, le deuxième une fonction à calculer (ici, Kest) pour chaque patron simulé, puis on y ajoute les arguments de la fonction Kest (ici, seulement correction).\n\nplot(envelope(semis_split[[1]], Kest, correction = \"iso\"))\n\nGenerating 99 simulations of CSR ...\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,\n41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,\n81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.\n\nDone.\n\n\n\n\n\nTel qu’indiqué par le message, cette fonction effectue par défaut 99 simulations de l’hypothèse nulle correspondant à une structure spatiale totalement aléatoire (CSR, pour complete spatial randomness).\nLa courbe observée sort de l’enveloppe des 99 simulations près de \\(r = 2\\). Il faut être prudent de ne pas interpréter trop rapidement un résultat sortant de l’enveloppe. Même s’il y a environ une probabilité de 1% d’obtenir un résultat plus extrême selon l’hypothèse nulle à une distance donnée, l’enveloppe est calculée pour un grand nombre de valeurs de la distance et nous n’effectuons pas de correction pour les comparaisons multiples. Ainsi, un écart significatif pour une très petite plage de valeurs de \\(r\\) peut être simplement dû au hasard.\n\nExercice 1\nEn regardant le graphique du deuxième patron de points (semis de peuplier), pouvez-vous prédire où se situera le \\(K\\) de Ripley par rapport à l’hypothèse nulle d’une structure spatiale totalement aléatoire? Vérifiez votre prédiction en calculant le \\(K\\) de Ripley pour ce patron de points dans R."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effet-de-lhétérogénéité",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#effet-de-lhétérogénéité",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Effet de l’hétérogénéité",
+ "text": "Effet de l’hétérogénéité\nLe graphique ci-dessous illustre un patron de points hétérogène, c’est-à-dire qu’il présente un gradient d’intensité (plus de points à gauche qu’à droite).\n\n\n\n\n\nUn gradient de densité peut être confondu avec une agrégation des points, comme on peut voir sur le graphique du \\(K\\) de Ripley correspondant. En théorie, il s’agit de deux processus différents:\n\nHétérogénéité: La densité de points varie dans la région d’étude, par exemple dû au fait que certaines conditions locales sont plus propices à la présence de l’espèce étudiée.\nAgrégation: La densité moyenne des points est homogène, mais la présence d’un point augmente la présence d’autre points dans son voisinage, par exemple en raison d’interactions positives entre les individus.\n\nCependant, il peut être difficile de différencier les deux en pratique, surtout que certains patrons peuvent être à la fois hétérogènes et agrégés.\nPrenons l’exemple des semis de peuplier de l’exercice précédent. La fonction density appliquée à un patron de points effectue une estimation par noyau (kernel density estimation) de la densité des semis à travers la placette. Par défaut, cette fonction utilise un noyau gaussien avec un écart-type sigma spécifié dans la fonction, qui détermine l’échelle à laquelle les fluctuations de densité sont “lissées”. Ici, nous utilisons une valeur de 2 m pour sigma et nous représentons d’abord la densité estimée avec plot, avant d’y superposer les points (add = TRUE signifie que les points sont ajoutés au graphique existant plutôt que de créer un nouveau graphique).\n\ndens_p <- density(semis_split[[2]], sigma = 2)\nplot(dens_p)\nplot(semis_split[[2]], add = TRUE)\n\n\n\n\nPour mesurer l’agrégation ou la répulsion des points d’un patron hétérogène, nous devons utilisé la version non-homogène de la statistique \\(K\\) (Kinhom dans spatstat). Cette statistique est toujours égale au nombre moyen de voisins dans un rayon \\(r\\) d’un point du patron, mais plutôt que de normaliser ce nombre par l’intensité globale du patron, il est normalisé par l’estimation locale de la densité de points. Comme ci-dessus, nous spécifions sigma = 2 pour contrôler le niveau de lissage de l’estimation de la densité variable.\n\nplot(Kinhom(semis_split[[2]], sigma = 2, correction = \"iso\"))\n\n\n\n\nEn tenant compte de l’hétérogénéité du patron à une échelle sigma de 2 m, il semble donc y avoir un déficit de voisins à partir d’environ 1.5 m des points du patron. Il reste à voir si cette déviation est significative.\nComme précédemment, nous utilisons envelope pour simuler la statistique Kinhom sous le modèle nul. Cependant, ici le modèle nul n’est pas un processus de Poisson homogène (structure spatiale totalement aléatoire). Il s’agit plutôt d’un processus de Poisson hétérogène simulé par la fonction rpoispp(dens_p), c’est-à-dire que les points sont indépendants les uns des autres, mais leur densité est hétérogène et donnée par dens_p. L’argument simulate de la fonction envelope permet de spécifier une fonction utilisée pour les simulations sous le modèle nul; cette fonction doit avoir un argument, ici x, même s’il n’est pas utilisé.\nFinalement, en plus des arguments nécessaires pour Kinhom, soit sigma et correction, nous spécifions aussi nsim = 199 pour réaliser 199 simulations et nrank = 5 pour éliminer les 5 résultats les plus extrêmes de chaque côté de l’enveloppe, donc les 10 plus extrêmes sur 199, pour réaliser un intervalle contenant environ 95% de la probabilité sous l’hypothèse nulle.\n\nkhet_p <- envelope(semis_split[[2]], Kinhom, sigma = 2, correction = \"iso\",\n nsim = 199, nrank = 5, simulate = function(x) rpoispp(dens_p))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\nplot(khet_p)\n\n\n\n\nNote: Pour un test d’hypothèse basé sur des simulations d’une hypothèse nulle, la valeur \\(p\\) est estimée par \\((m + 1)/(n + 1)\\), où \\(n\\) est le nombre de simulations et \\(m\\) est le nombre de simulations où la valeur de la statistique est plus extrême que celle des données observées. C’est pour cette raison qu’on choisit un nombre de simulations comme 99, 199, etc.\n\nExercice 2\nRépétez l’estimation de la densité hétérogène et le calcul de Kinhom avec un écart-type sigma de 5 plutôt que 2. Comment le niveau de lissage pour la densité influence-t-il les conclusions?\nPour différencier une variation de densité des points et d’une interaction (agrégation ou répulsion) entre ces points avec ce type d’analyse, il faut généralement supposer que les deux processus opèrent à différentes échelles. Typiquement, nous pouvons tester si les points sont agrégés à petite échelle après avoir tenu compte d’une variation de la densité à une échelle plus grande."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#relation-entre-deux-patrons-de-points",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#relation-entre-deux-patrons-de-points",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Relation entre deux patrons de points",
+ "text": "Relation entre deux patrons de points\nConsidérons un cas où nous avons deux patrons de points, par exemple la position des arbres de deux espèces dans une parcelle (points oranges et verts dans le graphique ci-dessous). Chacun des deux patrons peut présenter ou non des agrégations de points.\n\n\n\n\n\nSans égard à cette agrégation au niveau de l’espèce, nous voulons déterminer si les deux espèces sont disposées indépendamment. Autrement dit, la probabilité d’observer un arbre d’une espèce dépend-elle de la présence d’un arbre de l’autre espèce à une distance donnée?\nLa version bivariée du \\(K\\) de Ripley permet de répondre à cette question. Pour deux patrons désignés 1 et 2, l’indice \\(K_{12}(r)\\) calcule le nombre moyen de points du patron 2 dans un rayon \\(r\\) autour d’un point du patron 1, normalisé par la densité du patron 2.\nEn théorie, cet indice est symétrique, donc \\(K_{12}(r) = K_{21}(r)\\) et le résultat serait le même si on choisit les points du patron 1 ou 2 comme points “focaux” pour l’analyse. Cependant, l’estimation des deux quantités pour un patron observé peut différer, notamment en raison des effets de bord. La variabilité peut aussi être différente pour \\(K_{12}\\) et \\(K_{21}\\) entre les simulations d’un modèle nul, donc le test de l’hypothèse nulle peut avoir une puissance différente selon le choix de l’espèce focale.\nLe choix d’un modèle nul approprié est important ici. Afin de déterminer s’il existe une attraction ou une répulsion significative entre les deux patrons, il faut déplacer aléatoirement la position d’un des patrons relative à celle de l’autre patron, tout en conservant la structure spatiale de chaque patron pris isolément.\nUne des façons d’effectuer cette randomisation consiste à décaler l’un des deux patrons horizontalement et/ou verticalement d’une distance aléatoire. La partie du patron qui “sort” d’un côté de la fenêtre est rattachée de l’autre côté. Cette méthode s’appelle une translation toroïdale (toroidal shift), car en connectant le haut et le bas ainsi que la gauche et la droite d’une surface rectangulaire, on obtient la forme d’un tore (un “beigne” en trois dimensions).\n\n\n\n\n\nLe graphique ci-dessus illustre une translation du patron vert vers la droite, tandis que le patron orange reste au même endroit. Les points verts dans la zone ombragée sont ramenés de l’autre côté. Notez que si cette méthode préserve de façon générale la structure de chaque patron tout en randomisant leur position relative, elle peut comporter certains inconvénients, comme de diviser des amas de points qui se trouvent près du point de coupure.\nVérifions maintenant s’il y a une dépendance entre la position des deux espèces (bouleau et peuplier) dans notre placette. La fonction Kcross calcule l’indice bivarié \\(K_{ij}\\), il faut spécifier quel type de point est considéré comme l’espèce focale \\(i\\) et l’espèce voisine \\(j\\).\n\nplot(Kcross(semis, i = \"P\", j = \"B\", correction = \"iso\"))\n\n\n\n\nIci, le \\(K\\) observé est inférieur à la valeur théorique, indiquant une répulsion possible des deux patrons.\nPour déterminer l’enveloppe du \\(K\\) selon l’hypothèse nulle d’indépendance des deux patrons, nous devons spécifier que les simulations doivent être basées sur une translation des patrons. Nous indiquons que les simulations doivent utiliser la fonction rshift (translation aléatoire) avec l’argument simulate = function(x) rshift(x, which = \"B\"); ici, l’argument x de simulate correspond au patron de points original et l’argument which indique quel type de points subit la translation. Comme pour le cas précédent, il faut répéter dans la fonction envelope les arguments nécessaires pour Kcross, soit i, j et correction.\n\nplot(envelope(semis, Kcross, i = \"P\", j = \"B\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = function(x) rshift(x, which = \"B\")))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nIci, la courbe observée se situe totalement dans l’enveloppe, donc nous ne rejetons pas l’hypothèse nulle d’indépendance des deux patrons.\n\nQuestions\n\nQuelle raison pourrait justifier ici notre choix d’effectuer la translation des points du bouleau plutôt que du peuplier?\nEst-ce que les simulations générées par translation aléatoire constitueraient un bon modèle nul si les deux patrons étaient hétérogènes?"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#patrons-de-points-marqués",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#patrons-de-points-marqués",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Patrons de points marqués",
+ "text": "Patrons de points marqués\nLe jeu de données fir.csv contient les coordonnées \\((x, y)\\) de 822 sapins dans une placette d’un hectare et leur statut (A = vivant, D = mort) suivant une épidémie de tordeuse des bourgeons de l’épinette.\n\nfir <- read.csv(\"data/fir.csv\")\nhead(fir)\n\n x y status\n1 31.50 1.00 A\n2 85.25 30.75 D\n3 83.50 38.50 A\n4 84.00 37.75 A\n5 83.00 33.25 A\n6 33.25 0.25 A\n\n\n\nfir <- ppp(x = fir$x, y = fir$y, marks = as.factor(fir$status),\n window = owin(xrange = c(0, 100), yrange = c(0, 100)))\nplot(fir)\n\n\n\n\nSupposons que nous voulons vérifier si la mortalité des sapins est indépendante ou corrélée entre arbres rapprochés. En quoi cette question diffère-t-elle de l’exemple précédent où nous voulions savoir si la position des points de deux espèces était indépendante?\nDans l’exemple précédent, l’indépendance ou l’interaction entre les espèces référait à la formation du patron lui-même (que des semis d’une espèce s’établissent ou non à proximité de ceux de l’autre espèce). Ici, la caractéristique qui nous intéresse (survie des sapins) est postérieure à l’établissement du patron, en supposant que tous ces arbres étaient vivants d’abord et que certains sont morts suite à l’épidémie. Donc nous prenons la position des arbres comme fixe et nous voulons savoir si la distribution des statuts (mort, vivant) entre ces arbres est aléatoire ou présente un patron spatial.\nDans le manuel de Wiegand et Moloney, la première situation (établissement de semis de deux espèces) est appelé patron bivarié, donc il s’agit vraiment de deux patrons qui interagissent, tandis que la deuxième est un seul patron avec une marque qualitative. Le package spatstat dans R ne fait pas de différences entre les deux au niveau de la définition du patron (les types de points sont toujours représentés par l’argument marks), mais les méthodes d’analyse appliquées aux deux questions diffèrent.\nDans le cas d’un patron avec une marque qualitative, nous pouvons définir une fonction de connexion de marques (mark connection function) \\(p_{ij}(r)\\). Pour deux points séparés par une distance \\(r\\), cette fonction donne la probabilité que le premier point porte la marque \\(i\\) et le deuxième la marque \\(j\\). Selon l’hypothèse nulle où les marques sont indépendantes, cette probabilité est égale au produit des proportions de chaque marque dans le patron entier, \\(p_{ij}(r) = p_i p_j\\) indépendamment de \\(r\\).\nDans spatstat, la fonction de connexion de marques est calculée avec la fonction markconnect, où il faut spécifier les marques \\(i\\) et \\(j\\) ainsi que le type de correction des effets de bord. Dans notre exemple, nous voyons que deux points rapprochés ont moins de chance d’avoir une statut différent (A et D) que prévu selon l’hypothèse de distribution aléatoire et indépendante des marques (ligne rouge pointillée).\n\nplot(markconnect(fir, i = \"A\", j = \"D\", correction = \"iso\"))\n\n\n\n\nDans ce graphique, les ondulations dans la fonction sont dues à l’erreur d’estimation d’une fonction continue de \\(r\\) à partir d’un nombre limité de paires de points discrètes.\nPour simuler le modèle nul dans ce cas-ci, nous utilisons la fonction rlabel qui réassigne aléatoirement les marques parmi les points du patron, en maintenant la position des points.\n\nplot(envelope(fir, markconnect, i = \"A\", j = \"D\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nNotez que puisque la fonction rlabel a un seul argument obligatoire correspondant au patron de points original, il n’était pas nécessaire de spécifier au long: simulate = function(x) rlabel(x).\nVoici les résultats pour les paires d’arbres du même statut A ou D:\n\npar(mfrow = c(1, 2))\nplot(envelope(fir, markconnect, i = \"A\", j = \"A\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\nplot(envelope(fir, markconnect, i = \"D\", j = \"D\", correction = \"iso\", \n nsim = 199, nrank = 5, simulate = rlabel))\n\nGenerating 199 simulations by evaluating function ...\n1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40\n.42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80\n.82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120\n.122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160\n.162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198 199.\n\nDone.\n\n\n\n\n\nIl semble donc que la mortalité des sapins due à cette épidémie est agrégée spatialement, puisque les arbres situés à proximité l’un de l’autre ont une plus grande probabilité de partager le même statut que prévu par l’hypothèse nulle."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#références",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#références",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Références",
+ "text": "Références\nFortin, M.-J. et Dale, M.R.T. (2005) Spatial Analysis: A Guide for Ecologists. Cambridge University Press: Cambridge, UK.\nWiegand, T. et Moloney, K.A. (2013) Handbook of Spatial Point-Pattern Analysis in Ecology, CRC Press.\nLe jeu de données du dernier exemple est tiré des données de la Forêt d’enseignement et de recherche du Lac Duparquet (FERLD), disponibles sur Dryad en suivant ce lien."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#dépendance-intrinsèque-ou-induite",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#dépendance-intrinsèque-ou-induite",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Dépendance intrinsèque ou induite",
+ "text": "Dépendance intrinsèque ou induite\nIl existe deux types fondamentaux de dépendance spatiale sur une variable mesurée \\(y\\): une dépendance intrinsèque à \\(y\\), ou une dépendance induite par des variables externes influençant \\(y\\), qui sont elles-mêmes corrélées dans l’espace.\nPar exemple, supposons que l’abondance d’une espèce soit corrélée entre deux sites rapprochés:\n\ncette dépendance spatiale peut être induite si elle est due à une corrélation spatiale des facteurs d’habitat qui favorisent ou défavorisent l’espèce;\nou elle peut être intrinsèque si elle est due à la dispersion d’individus entre sites rapprochés.\n\nDans plusieurs cas, les deux types de dépendance affectent une variable donnée.\nSi la dépendance est simplement induite et que les variables externes qui en sont la cause sont incluses dans le modèle expliquant \\(y\\), alors les résidus du modèle seront indépendants et nous pouvons utiliser toutes les méthodes déjà vues qui ignorent la dépendance spatiale.\nCependant, si la dépendance est intrinsèque ou due à des influences externes non-mesurées, alors il faudra tenir compte de la dépendance spatiale des résidus dans le modèle."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#différentes-façons-de-modéliser-les-effets-spatiaux",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#différentes-façons-de-modéliser-les-effets-spatiaux",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Différentes façons de modéliser les effets spatiaux",
+ "text": "Différentes façons de modéliser les effets spatiaux\nDans cette formation, nous modéliserons directement les corrélations spatiales de nos données. Il est utile de comparer cette approche à d’autres façons d’inclure des aspects spatiaux dans un modèle statistique.\nD’abord, nous pourrions inclure des prédicteurs dans le modèle qui représentent la position (ex.: longitude, latitude). De tels prédicteurs peuvent être utiles pour détecter une tendance ou un gradient systématique à grande échelle, que cette tendance soit linéaire ou non (par exemple, avec un modèle additif généralisé).\nEn contraste à cette approche, les modèles que nous verrons maintenant servent à modéliser une corrélation spatiale dans les fluctuations aléatoires d’une variable (i.e., dans les résidus après avoir enlevé tout effet systématique).\nLes modèles mixtes utilisent des effets aléatoires pour représenter la non-indépendance de données sur la base de leur groupement, c’est-à-dire qu’après avoir tenu compte des effets fixes systématiques, les données d’un même groupe sont plus semblables (leur variation résiduelle est corrélée) par rapport aux données de groupes différents. Ces groupes étaient parfois définis selon des critères spatiaux (observations regroupées en sites).\nCependant, dans un contexte d’effet aléatoire de groupe, tous les groupes sont aussi différents les uns des autres, ex.: deux sites à 100 km l’un de l’autre ne sont pas plus ou moins semblables que deux sites distants de 2 km.\nLes méthodes que nous verrons ici et dans les prochains parties de la formation nous permettent donc ce modéliser la non-indépendance sur une échelle continue (plus proche = plus corrélé) plutôt que seulement discrète (hiérarchie de groupements)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogramme",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogramme",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Variogramme",
+ "text": "Variogramme\nUn aspect central de la géostatistique est l’estimation du variogramme \\(\\gamma_z\\) de la variable \\(z\\). Le variogramme est égal à la moitié de l’écart carré moyen entre les valeurs de \\(z\\) pour deux points \\((x_i, y_i)\\) et \\((x_j, y_j)\\) séparés par une distance \\(h\\).\n\\[\\gamma_z(h) = \\frac{1}{2} \\text{E} \\left[ \\left( z(x_i, y_i) - z(x_j, y_j) \\right)^2 \\right]_{d_{ij} = h}\\]\nDans cette équation, la fonction \\(\\text{E}\\) avec l’indice \\(d_{ij}=h\\) désigne l’espérance statistique (autrement dit, la moyenne) de l’écart au carré entre les valeurs de \\(z\\) pour les points séparés par une distance \\(h\\).\nSi on préfère exprimer l’autocorrélation \\(\\rho_z(h)\\) entre mesures de \\(z\\) séparées par une distance \\(h\\), celle-ci est reliée au variogramme par l’équation:\n\\[\\gamma_z = \\sigma_z^2(1 - \\rho_z)\\] ,\noù \\(\\sigma_z^2\\) est la variance globale de \\(z\\).\nNotez que \\(\\gamma_z = \\sigma_z^2\\) si nous sommes à une distance où les mesures de \\(z\\) sont indépendantes, donc \\(\\rho_z = 0\\). Dans ce cas, on voit bien que \\(\\gamma_z\\) s’apparente à une variance, même s’il est parfois appelé “semivariogramme” ou “semivariance” en raison du facteur 1/2 dans l’équation ci-dessus."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèles-théoriques-du-variogramme",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèles-théoriques-du-variogramme",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Modèles théoriques du variogramme",
+ "text": "Modèles théoriques du variogramme\nPlusieurs modèles paramétriques ont été proposés pour représenter la corrélation spatiale en fonction de la distance entre points d’échantillonnage. Considérons d’abord une corrélation qui diminue de façon exponentielle:\n\\[\\rho_z(h) = e^{-h/r}\\]\nIci, \\(\\rho_z = 1\\) pour \\(h = 0\\) et la corréaltion est multipliée par \\(1/e \\approx 0.37\\) pour chaque augmentation de \\(r\\) de la distance. Dans ce contexte, \\(r\\) se nomme la portée (range) de la corrélation.\nÀ partir de l’équation ci-dessus, nous pouvons calculer le variogramme correspondant.\n\\[\\gamma_z(h) = \\sigma_z^2 (1 - e^{-h/r})\\]\nVoici une représentation graphique de ce variogramme.\n\n\n\n\n\nEn raison de la fonction exponentielle, la valeur de \\(\\gamma\\) à des grandes distances s’approche de la variance globale \\(\\sigma_z^2\\) sans exactement l’atteindre. Cette asymptote est appelée palier (sill) dans le contexte géostatistique et représentée par le symbole \\(s\\).\nFinalement, il n’est parfois pas réaliste de supposer une corrélation parfaite lorsque la distance tend vers 0, en raison d’une variation possible de \\(z\\) à très petite échelle. On peut ajouter au modèle un effet de pépite (nugget), noté \\(n\\), pour que \\(\\gamma\\) s’approche de \\(n\\) (plutôt que 0) si \\(h\\) tend vers 0. Le terme pépite provient de l’origine minière de ces techniques, où une pépite d’un minerai pourrait être la source d’une variation abrupte de la concentration à petite échelle.\nEn ajoutant l’effet de pépite, le reste du variogramme est “compressé” pour conserver le même palier, ce qui résulte en l’équation suivante.\n\\[\\gamma_z(h) = n + (s - n) (1 - e^{-h/r})\\]\nDans le package gstat que nous utiliserons ci-dessous, le terme \\((s - n)\\) est le palier partiel (partial sill, ou psill) pour la partie exponentielle.\n\n\n\n\n\nEn plus du modèle exponentiel, deux autres modèles théoriques courants pour le variogramme sont le modèle gaussien (où la corrélation suit une courbe demi-normale), ainsi que le modèle sphérique (où le variogramme augmente de façon linéaire au départ pour ensuite courber et atteindre le palier à une distance égale à sa portée \\(r\\)). Le modèle sphérique permet donc à la corrélation d’être exactement 0 à grande distance, plutôt que de s’approcher graduellement de zéro dans le cas des autres modèles.\n\n\n\n\n\n\n\n\nModèle\n\\(\\rho(h)\\)\n\\(\\gamma(h)\\)\n\n\n\n\nExponentiel\n\\(\\exp\\left(-\\frac{h}{r}\\right)\\)\n\\(s \\left(1 - \\exp\\left(-\\frac{h}{r}\\right)\\right)\\)\n\n\nGaussien\n\\(\\exp\\left(-\\frac{h^2}{r^2}\\right)\\)\n\\(s \\left(1 - \\exp\\left(-\\frac{h^2}{r^2}\\right)\\right)\\)\n\n\nSphérique \\((h < r)\\) *\n\\(1 - \\frac{3}{2}\\frac{h}{r} + \\frac{1}{2}\\frac{h^3}{r^3}\\)\n\\(s \\left(\\frac{3}{2}\\frac{h}{r} - \\frac{1}{2}\\frac{h^3}{r^3} \\right)\\)\n\n\n\n* Pour le modèle sphérique, \\(\\rho = 0\\) et \\(\\gamma = s\\) si \\(h \\ge r\\)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogramme-empirique",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#variogramme-empirique",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Variogramme empirique",
+ "text": "Variogramme empirique\nPour estimer \\(\\gamma_z(h)\\) à partir de données empiriques, nous devons définir des classes de distance, donc grouper différentes distances dans une marge \\(\\pm \\delta\\) autour d’une distance \\(h\\), puis calculer l’écart-carré moyen pour les paires de points dans cette classe de distance.\n\\[\\hat{\\gamma_z}(h) = \\frac{1}{2 N_{\\text{paires}}} \\sum \\left[ \\left( z(x_i, y_i) - z(x_j, y_j) \\right)^2 \\right]_{d_{ij} = h \\pm \\delta}\\]\nNous verrons dans la partie suivante comment estimer un variogramme dans R."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèle-de-régression-avec-corrélation-spatiale",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèle-de-régression-avec-corrélation-spatiale",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Modèle de régression avec corrélation spatiale",
+ "text": "Modèle de régression avec corrélation spatiale\nL’équation suivante représente une régression linéaire multiple incluant une corrélation spatiale résiduelle:\n\\[v = \\beta_0 + \\sum_i \\beta_i u_i + z + \\epsilon\\]\nIci, \\(v\\) désigne la variable réponse et \\(u\\) les prédicteurs, pour ne pas confondre avec les coordonnées spatiales \\(x\\) et \\(y\\).\nEn plus du résidu \\(\\epsilon\\) qui est indépendant entre les observations, le modèle inclut un terme \\(z\\) qui représente la portion spatialement corrélée de la variance résiduelle.\nVoici une suggestions d’étapes à suivre pour appliquer ce type de modèle:\n\nAjuster le modèle de régression sans corrélation spatiale.\nVérifier la présence de corrélation spatiale à partir du variogramme empirique des résidus.\nAjuster un ou plusieurs modèles de régression avec corrélation spatiale et choisir celui qui montre le meilleur ajustement aux données."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#régression-avec-corrélation-spatiale",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#régression-avec-corrélation-spatiale",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Régression avec corrélation spatiale",
+ "text": "Régression avec corrélation spatiale\nNous avons vu ci-dessus que le package gstat permet d’estimer le variogramme des résidus d’un modèle linéaire. Dans notre exemple, la concentration de magnésium était modélisée en fonction du pH, avec des résidus spatialement corrélés.\nUn autre outil pour ajuster ce même type de modèle est la fonction gls du package nlme, qui est inclus avec l’installation de R.\nCette fonction applique la méthode des moindres carrés généralisés (generalized least squares) pour ajuster des modèles de régression linéaire lorsque les résidus ne sont pas indépendants ou lorsque la variance résiduelle n’est pas la même pour toutes les observations. Comme les estimés des coefficients dépendent de l’estimé des corrélations entre les résidus et que ces derniers dépendent eux-mêmes des coefficients, le modèle est ajusté par un algorithme itératif:\n\nOn ajuste un modèle de régression linéaire classique (sans corrélation) pour obtenir des résidus.\nOn ajuste le modèle de corrélation spatiale (variogramme) avec ses résidus.\nOn ré-estime les coefficients de la régression en tenant compte maintenant des corrélations.\n\nLes étapes 2 et 3 sont répétées jusqu’à ce que les estimés soient stables à une précision voulue.\nVoici l’application de cette méthode au même modèle pour la concentration de magnésium dans le jeu de données oxford. Dans l’argument correlation de gls, nous spécifions un modèle de corrélation exponentielle en fonction de nos coordonnées spatiales et indiquons que nous voulons aussi estimer un effet de pépite.\nEn plus de la corrélation exponentielle corExp, la fonction gls peut aussi estimer un modèle gaussien (corGaus) ou sphérique (corSpher).\n\nlibrary(nlme)\ngls_mg <- gls(MG1 ~ PH1, oxford, \n correlation = corExp(form = ~ XCOORD + YCOORD, nugget = TRUE))\nsummary(gls_mg)\n\nGeneralized least squares fit by REML\n Model: MG1 ~ PH1 \n Data: oxford \n AIC BIC logLik\n 1278.65 1292.751 -634.325\n\nCorrelation Structure: Exponential spatial correlation\n Formula: ~XCOORD + YCOORD \n Parameter estimate(s):\n range nugget \n478.0322964 0.2944753 \n\nCoefficients:\n Value Std.Error t-value p-value\n(Intercept) 391.1387 50.42343 7.757084 0\nPH1 -41.0836 6.15662 -6.673079 0\n\n Correlation: \n (Intr)\nPH1 -0.891\n\nStandardized residuals:\n Min Q1 Med Q3 Max \n-2.1846957 -0.6684520 -0.3687813 0.4627580 3.1918604 \n\nResidual standard error: 53.8233 \nDegrees of freedom: 126 total; 124 residual\n\n\nPour comparer ce résultat au variogramme ajusté ci-dessus, il faut transformer les paramètres donnés par gls. La portée (range) a le même sens dans les deux cas et correspond à 478 m pour le résultat de gls. La variance globale des résidus est le carré de Residual standard error. L’effet de pépite ici (0.294) est exprimé comme fraction de cette variance. Finalement, pour obtenir le palier partiel de la partie exponentielle, il faut soustraire l’effet de pépite de la variance totale.\nAprès avoir réalisé ces calculs, nous pouvons donner ces paramètres à la fonction vgm de gstat pour superposer ce variogramme estimé par gls à notre variogramme des résidus du modèle linéaire classique.\n\ngls_range <- 478\ngls_var <- 53.823^2\ngls_nugget <- 0.294 * gls_var\ngls_psill <- gls_var - gls_nugget\n\ngls_vgm <- vgm(\"Exp\", psill = gls_psill, range = gls_range, nugget = gls_nugget)\n\nplot(var_mg, gls_vgm, col = \"black\", ylim = c(0, 4000))\n\n\n\n\nEst-ce que le modèle est moins bien ajusté aux données ici? En fait, ce variogramme empirique représenté par les points avait été obtenu à partir des résidus du modèle linéaire ignorant la corrélation spatiale, donc c’est un estimé biaisé des corrélations spatiales réelles. La méthode est quand même adéquate pour vérifier rapidement s’il y a présence de corrélations spatiales. Toutefois, pour ajuster simultanément les coefficients de la régression et les paramètres de corrélation spatiale, l’approche des moindres carrés généralisés (GLS) est préférable et produira des estimés plus justes.\nFinalement, notez que le résultat du modèle gls donne aussi l’AIC, que nous pouvons utiliser pour comparer l’ajustement de différents modèles (avec différents prédicteurs ou différentes formes de corrélation spatiale)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercice",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercice",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exercice",
+ "text": "Exercice\nLe fichier bryo_belg.csv est adapté des données de l’étude:\n\nNeyens, T., Diggle, P.J., Faes, C., Beenaerts, N., Artois, T. et Giorgi, E. (2019) Mapping species richness using opportunistic samples: a case study on ground-floor bryophyte species richness in the Belgian province of Limburg. Scientific Reports 9, 19122. https://doi.org/10.1038/s41598-019-55593-x\n\nCe tableau de données indique la richesse spécifique des bryophytes au sol (richness) pour différents points d’échantillonnage de la province belge de Limbourg, avec leur position (x, y) en km, en plus de l’information sur la proportion de forêts (forest) et de milieux humides (wetland) dans une cellule de 1 km\\(^2\\) contenant le point d’échantillonnage.\n\nbryo_belg <- read.csv(\"data/bryo_belg.csv\")\nhead(bryo_belg)\n\n richness forest wetland x y\n1 9 0.2556721 0.5036614 228.9516 220.8869\n2 6 0.6449114 0.1172068 227.6714 219.8613\n3 5 0.5039905 0.6327003 228.8252 220.1073\n4 3 0.5987329 0.2432942 229.2775 218.9035\n5 2 0.7600775 0.1163538 209.2435 215.2414\n6 10 0.6865434 0.0000000 210.4142 216.5579\n\n\nPour cet exercice, nous utiliserons la racine carrée de la richesse spécifique comme variable réponse. La transformation racine carrée permet souvent d’homogénéiser la variance des données de comptage afin d’y appliquer une régression linéaire.\n\nAjustez un modèle linéaire de la richesse spécifique transformée en fonction de la fraction de forêt et de milieux humides, sans tenir compte des corrélations spatiales. Quel est l’effet des deux prédicteurs selon ce modèle?\nCalculez le variogramme empirique des résidus du modèle en (a). Semble-t-il y avoir une corrélation spatiale entre les points?\n\nNote: L’argument cutoff de la fonction variogram spécifie la distance maximale à laquelle le variogramme est calculé. Vous pouvez ajuster manuellement cette valeur pour bien voir le palier.\n\nRé-ajustez le modèle linéaire en (a) avec la fonction gls du package nlme, en essayant différents types de corrélations spatiales (exponentielle, gaussienne, sphérique). Comparez les modèles (incluant celui sans corrélation spatiale) avec l’AIC.\nQuel est l’effet de la fraction de forêts et de milieux humides selon le modèle en (c)? Expliquez les différences entre les conclusions de ce modèle et du modèle en (a)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#autorégression-conditionnelle-car",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#autorégression-conditionnelle-car",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Autorégression conditionnelle (CAR)",
+ "text": "Autorégression conditionnelle (CAR)\nDans le modèle d’autorégression conditionnelle, la valeur de \\(z_i\\) pour la région \\(i\\) suit une distribution normale: sa moyenne dépend de la valeur \\(z_j\\) des régions voisines, multipliée par le poids \\(w_{ij}\\) et un coefficient de corrélation \\(\\rho\\); son écart-type \\(\\sigma_{z_i}\\) peut varier d’une région à l’autre.\n\\[z_i \\sim \\text{N}\\left(\\sum_j \\rho w_{ij} z_j,\\sigma_{z_i} \\right)\\]\nDans ce modèle, si \\(w_{ij}\\) est une matrice binaire (0 pour les non-voisins, 1 pour les voisins), alors \\(\\rho\\) est le coefficient de corrélation partielle entre régions voisines. Cela est semblable à un modèle autorégressif d’ordre 1 dans le contexte de séries temporelles, où le coefficient d’autorégression indique la corrélation partielle."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#autorégression-simultanée-sar",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#autorégression-simultanée-sar",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Autorégression simultanée (SAR)",
+ "text": "Autorégression simultanée (SAR)\nDans le modèle d’autorégression simultanée, la valeur de \\(z_i\\) est donnée directement par la somme de contributions des valeurs voisines \\(z_j\\), multipliées par \\(\\rho w_{ij}\\), avec un résidu indépendant \\(\\nu_i\\) d’écart-type \\(\\sigma_z\\).\n\\[z_i = \\sum_j \\rho w_{ij} z_j + \\nu_i\\]\nÀ première vue, cela ressemble à un modèle autorégressif temporel. Il existe cependant une différence conceptuelle importante. Pour les modèles temporels, l’influence causale est dirigée dans une seule direction: \\(v(t-2)\\) affecte \\(v(t-1)\\) qui affecte ensuite \\(v(t)\\). Pour un modèle spatial, chaque \\(z_j\\) qui affecte \\(z_i\\) dépend à son tour de \\(z_i\\). Ainsi, pour déterminer la distribution conjointe des \\(z\\), il faut résoudre simultanément (d’où le nom du modèle) un système d’équations.\nPour cette raison, même si ce modèle ressemble à la formule du modèle conditionnel (CAR), les solutions des deux modèles diffèrent et dans le cas du SAR, le coefficient \\(\\rho\\) n’est pas directement égal à la corrélation partielle due à chaque région voisine.\nPour plus de détails sur les aspects mathématiques de ces modèles, vous pouvez consulter l’article de Ver Hoef et al. (2018) suggéré en référence.\nPour l’instant, nous considérerons les SAR et les CAR comme deux types de modèles possibles pour représenter une corrélation spatiale sur un réseau. Nous pouvons toujours ajuster plusieurs modèles et les comparer avec l’AIC pour choisir la meilleure forme de la corrélation ou la meilleure matrice de poids.\nLes modèles CAR et SAR partagent un avantage sur les modèles géostatistiques au niveau de l’efficacité. Dans un modèle géostatistique, les corrélations spatiales sont définies entre chaque paire de points, même si elles deviennent négligeables lorsque la distance augmente. Pour un modèle CAR ou SAR, seules les régions voisines contribuent et la plupart des poids sont égaux à 0, ce qui rend ces modèles plus rapides à ajuster qu’un modèle géostatistique lorsque les données sont massives."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#définition-du-réseau-de-voisinage",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#définition-du-réseau-de-voisinage",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Définition du réseau de voisinage",
+ "text": "Définition du réseau de voisinage\nLa fonction poly2nb du package spdep définit un réseau de voisinage à partir de polygones. Le résultat vois est une liste de 125 éléments où chaque élément contient les indices des polygones voisins (limitrophes) d’un polygone donné.\n\nvois <- poly2nb(elect2018)\nvois[[1]]\n\n[1] 2 37 63 88 101 117\n\n\nAinsi, la première circonscription (Abitibi-Est) a 6 circonscriptions voisines, dont on peut trouver les noms ainsi:\n\nelect2018$circ[vois[[1]]]\n\n[1] \"Abitibi-Ouest\" \"Gatineau\" \n[3] \"Laviolette-Saint-Maurice\" \"Pontiac\" \n[5] \"Rouyn-Noranda-Témiscamingue\" \"Ungava\" \n\n\nNous pouvons illustrer ce réseau en faisant l’extraction des coordonnées du centre de chaque circonscription, en créant une carte muette avec plot(elect2018[\"geometry\"]), puis en ajoutant le réseau comme couche additionnelle avec plot(vois, add = TRUE, coords = coords).\n\ncoords <- st_centroid(elect2018) %>%\n st_coordinates()\nplot(elect2018[\"geometry\"])\nplot(vois, add = TRUE, col = \"red\", coords = coords)\n\n\n\n\nOn peut faire un “zoom” sur le sud du Québec en choisissant les limites xlim et ylim appropriées.\n\nplot(elect2018[\"geometry\"], \n xlim = c(400000, 800000), ylim = c(100000, 500000))\nplot(vois, add = TRUE, col = \"red\", coords = coords)\n\n\n\n\nIl nous reste à ajouter des poids à chaque lien du réseau avec la fonction nb2listw. Le style de poids “B” correspond aux poids binaires, soit 1 pour la présence de lien et 0 pour l’absence de lien entre deux circonscriptions.\nUne fois ces poids définis, nous pouvons vérifier avec le test de Moran s’il y a une autocorrélation significative des votes obtenus par la CAQ entre circonscriptions voisines.\n\npoids <- nb2listw(vois, style = \"B\")\n\nmoran.test(elect2018$propCAQ, poids)\n\n\n Moran I test under randomisation\n\ndata: elect2018$propCAQ \nweights: poids \n\nMoran I statistic standard deviate = 13.148, p-value < 2.2e-16\nalternative hypothesis: greater\nsample estimates:\nMoran I statistic Expectation Variance \n 0.680607768 -0.008064516 0.002743472 \n\n\nLa valeur de \\(I = 0.68\\) est très significative à en juger par la valeur \\(p\\) du test.\nVérifions si la corrélation spatiale persiste après avoir tenu compte des quatre caractéristiques de la population, donc en inspectant les résidus d’un modèle linéaire incluant ces quatre prédicteurs.\n\nelect_lm <- lm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, data = elect2018)\nsummary(elect_lm)\n\n\nCall:\nlm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018)\n\nResiduals:\n Min 1Q Median 3Q Max \n-30.9890 -4.4878 0.0562 6.2653 25.8146 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.354e+01 1.836e+01 0.737 0.463 \nage_moy -9.170e-01 3.855e-01 -2.378 0.019 * \npct_frn 4.588e+01 5.202e+00 8.820 1.09e-14 ***\npct_prp 3.582e+01 6.527e+00 5.488 2.31e-07 ***\nrev_med -2.624e-05 2.465e-04 -0.106 0.915 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 9.409 on 120 degrees of freedom\nMultiple R-squared: 0.6096, Adjusted R-squared: 0.5965 \nF-statistic: 46.84 on 4 and 120 DF, p-value: < 2.2e-16\n\nmoran.test(residuals(elect_lm), poids)\n\n\n Moran I test under randomisation\n\ndata: residuals(elect_lm) \nweights: poids \n\nMoran I statistic standard deviate = 6.7047, p-value = 1.009e-11\nalternative hypothesis: greater\nsample estimates:\nMoran I statistic Expectation Variance \n 0.340083290 -0.008064516 0.002696300 \n\n\nL’indice de Moran a diminué mais demeure significatif, donc une partie de la corrélation précédente était induite par ces prédicteurs, mais il reste une corrélation spatiale due à d’autres facteurs."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèles-dautorégression-spatiale",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#modèles-dautorégression-spatiale",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Modèles d’autorégression spatiale",
+ "text": "Modèles d’autorégression spatiale\nFinalement, nous ajustons des modèles SAR et CAR à ces données avec la fonction spautolm (spatial autoregressive linear model) de spatialreg. Voici le code pour un modèle SAR incluant l’effet des même quatre prédicteurs.\n\nelect_sar <- spautolm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids)\nsummary(elect_sar)\n\n\nCall: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids)\n\nResiduals:\n Min 1Q Median 3Q Max \n-23.08342 -4.10573 0.24274 4.29941 23.08245 \n\nCoefficients: \n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 15.09421119 16.52357745 0.9135 0.36098\nage_moy -0.70481703 0.32204139 -2.1886 0.02863\npct_frn 39.09375061 5.43653962 7.1909 6.435e-13\npct_prp 14.32329345 6.96492611 2.0565 0.03974\nrev_med 0.00016730 0.00023209 0.7208 0.47101\n\nLambda: 0.12887 LR test value: 42.274 p-value: 7.9339e-11 \nNumerical Hessian standard error of lambda: 0.012069 \n\nLog likelihood: -433.8862 \nML residual variance (sigma squared): 53.028, (sigma: 7.282)\nNumber of observations: 125 \nNumber of parameters estimated: 7 \nAIC: 881.77\n\n\nLa valeur donnée par Lambda dans le sommaire correspond au coefficient \\(\\rho\\) dans notre description du modèle. Le test du rapport de vraisemblance (LR test) confirme que cette corrélation spatiale résiduelle (après avoir tenu compte de l’effet des prédicteurs) est significative.\nLes effets estimés pour les prédicteurs sont semblables à ceux du modèle linéaire sans corrélation spatiale. Les effets de l’âge moyen, de la fraction de francophones et la fraction de propriétaires demeurent significatifs, bien que leur magnitude ait un peu diminué.\nPour évaluer un modèle CAR plutôt que SAR, nous devons spécifier family = \"CAR\".\n\nelect_car <- spautolm(propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids, family = \"CAR\")\nsummary(elect_car)\n\n\nCall: spautolm(formula = propCAQ ~ age_moy + pct_frn + pct_prp + rev_med, \n data = elect2018, listw = poids, family = \"CAR\")\n\nResiduals:\n Min 1Q Median 3Q Max \n-21.73315 -4.24623 -0.24369 3.44228 23.43749 \n\nCoefficients: \n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 16.57164696 16.84155327 0.9840 0.325128\nage_moy -0.79072151 0.32972225 -2.3981 0.016478\npct_frn 38.99116707 5.43667482 7.1719 7.399e-13\npct_prp 17.98557474 6.80333470 2.6436 0.008202\nrev_med 0.00012639 0.00023106 0.5470 0.584364\n\nLambda: 0.15517 LR test value: 40.532 p-value: 1.9344e-10 \nNumerical Hessian standard error of lambda: 0.0026868 \n\nLog likelihood: -434.7573 \nML residual variance (sigma squared): 53.9, (sigma: 7.3416)\nNumber of observations: 125 \nNumber of parameters estimated: 7 \nAIC: 883.51\n\n\nPour un modèle CAR avec des poids binaires, la valeur de Lambda (que nous avions appelé \\(\\rho\\)) donne directement le coefficient de corrélation partielle entre circonscriptions voisines. Notez que l’AIC ici est légèrement supérieur au modèle SAR, donc ce dernier donnait un meilleur ajustement."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercice-3",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#exercice-3",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Exercice",
+ "text": "Exercice\nLe jeu de données rls_covid, en format shapefile, contient des données sur les cas de COVID-19 détectés, le nombre de cas par 1000 personnes (taux_1k) et la densité de population (dens_pop) dans chacun des réseaux locaux de service de santé (RLS) du Québec. (Source: Données téléchargées de l’Institut national de santé publique du Québec en date du 17 janvier 2021.)\n\nrls_covid <- read_sf(\"data/rls_covid.shp\")\nhead(rls_covid)\n\nSimple feature collection with 6 features and 5 fields\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: 785111.2 ymin: 341057.8 xmax: 979941.5 ymax: 541112.7\nProjected CRS: Conique_conforme_de_Lambert_du_MTQ_utilis_e_pour_Adresse_Qu_be\n# A tibble: 6 × 6\n RLS_code RLS_nom cas taux_1k dens_…¹ geometry\n \n1 0111 RLS de Kamouraska 152 7.34 6.76 (((827028.3 412772.4, 82…\n2 0112 RLS de Rivière-du-Lo… 256 7.34 19.6 (((855905 452116.9, 8557…\n3 0113 RLS de Témiscouata 81 4.26 4.69 (((911829.4 441311.2, 91…\n4 0114 RLS des Basques 28 3.3 5.35 (((879249.6 471975.6, 87…\n5 0115 RLS de Rimouski 576 9.96 15.5 (((917748.1 503148.7, 91…\n6 0116 RLS de La Mitis 76 4.24 5.53 (((951316 523499.3, 9525…\n# … with abbreviated variable name ¹dens_pop\n\n\nAjustez un modèle linéaire du nombre de cas par 1000 en fonction de la densité de population (il est suggéré d’appliquer une transformation logarithmique à cette dernière). Vérifiez si les résidus du modèle sont corrélés entre RLS limitrophes avec un test de Moran, puis modélisez les mêmes données avec un modèle autorégressif conditionnel."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#référence",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#référence",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Référence",
+ "text": "Référence\nVer Hoef, J.M., Peterson, E.E., Hooten, M.B., Hanks, E.M. et Fortin, M.-J. (2018) Spatial autoregressive models for statistical inference from ecological data. Ecological Monographs 88: 36-59."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#données",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#données",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Données",
+ "text": "Données\nLe jeu de données gambia inclus avec le package geoR présente les résultats d’une étude sur la prévalence du paludisme chez les enfants de 65 villages en Gambie. Nous utiliserons une version légèrement transformée des données contenues dans le fichier gambia.csv.\n\nlibrary(geoR)\n\ngambia <- read.csv(\"data/gambia.csv\")\nhead(gambia)\n\n id_village x y pos age netuse treated green phc\n1 1 349.6313 1458.055 1 1783 0 0 40.85 1\n2 1 349.6313 1458.055 0 404 1 0 40.85 1\n3 1 349.6313 1458.055 0 452 1 0 40.85 1\n4 1 349.6313 1458.055 1 566 1 0 40.85 1\n5 1 349.6313 1458.055 0 598 1 0 40.85 1\n6 1 349.6313 1458.055 1 590 1 0 40.85 1\n\n\nVoici les champs de ce jeu de données:\n\nid_village: Identifiant du village.\nx and y: Coordonnées spatiales du village (en km, basé sur les coordonnées UTM).\npos: Réponse binaire, si l’enfant a eu un test positif du paludisme.\nage: Âge de l’enfant en jours.\nnetuse: Si l’enfant dort sous un moustiquaire ou non.\ntreated: Si le moustiquaire est traité ou non.\ngreen: Mesure de la végétation basée sur les données de télédétection (disponible à l’échelle du village).\nphc: Présence ou absence d’un centre de santé publique pour le village.\n\nNous pouvons compter le nombre de cas positifs et le nombre total d’enfants testés par village pour cartographier la fraction des cas positifs (ou prévalence, prev).\n\n# Jeu de données à l'échelle du village\ngambia_agg <- group_by(gambia, id_village, x, y, green, phc) %>%\n summarize(pos = sum(pos), total = n()) %>%\n mutate(prev = pos / total) %>%\n ungroup()\n\n`summarise()` has grouped output by 'id_village', 'x', 'y', 'green'. You can\noverride using the `.groups` argument.\n\nhead(gambia_agg)\n\n# A tibble: 6 × 8\n id_village x y green phc pos total prev\n \n1 1 350. 1458. 40.8 1 17 33 0.515\n2 2 359. 1460. 40.8 1 19 63 0.302\n3 3 360. 1460. 40.1 0 7 17 0.412\n4 4 364. 1497. 40.8 0 8 24 0.333\n5 5 366. 1460. 40.8 0 10 26 0.385\n6 6 367. 1463. 40.8 0 7 18 0.389\n\n\n\nggplot(gambia_agg, aes(x = x, y = y)) +\n geom_point(aes(color = prev)) +\n geom_path(data = gambia.borders, aes(x = x / 1000, y = y / 1000)) +\n coord_fixed() +\n theme_minimal() +\n scale_color_viridis_c()\n\n\n\n\nNous utilisons le jeu de données gambia.borders du package geoR pour tracer les frontières des pays avec geom_path. Comme ces frontières sont en mètres, nous les divisons par 1000 pour obtenir la même échelle que nos points. Nous utilisons également coord_fixed pour assurer un rapport d’aspect de 1:1 entre les axes et utilisons la palette de couleur viridis, qui permet de visualiser plus facilement une variable continue par rapport à la palette par défaut dans ggplot2.\nSur la base de cette carte, il semble y avoir une corrélation spatiale dans la prévalence du paludisme, le groupe de villages de l’est montrant des valeurs de prévalence plus élevées (jaune-vert) et le groupe du milieu montrant des valeurs de prévalence plus faibles (violet)."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#glmm-non-spatial",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#glmm-non-spatial",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "GLMM non spatial",
+ "text": "GLMM non spatial\nPour ce premier exemple, nous allons ignorer l’aspect spatial des données et modéliser la présence du paludisme (pos) en fonction de l’utilisation d’une moustiquaire (netuse) et de la présence d’un centre de santé publique (phc). Comme nous avons une réponse binaire, nous devons utiliser un modèle de régression logistique (un GLM). Comme nous avons des prédicteurs au niveau individuel et au niveau du village et que nous nous attendons à ce que les enfants d’un même village aient une probabilité plus similaire d’avoir le paludisme même après avoir pris en compte ces prédicteurs, nous devons ajouter un effet aléatoire du village. Le résultat est un GLMM que nous ajustons en utilisant la fonction glmer du package lme4.\n\nlibrary(lme4)\n\nmod_glmm <- glmer(pos ~ netuse + phc + (1 | id_village), \n data = gambia, family = binomial)\nsummary(mod_glmm)\n\nGeneralized linear mixed model fit by maximum likelihood (Laplace\n Approximation) [glmerMod]\n Family: binomial ( logit )\nFormula: pos ~ netuse + phc + (1 | id_village)\n Data: gambia\n\n AIC BIC logLik deviance df.resid \n 2428.0 2450.5 -1210.0 2420.0 2031 \n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.1286 -0.7120 -0.4142 0.8474 3.3434 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n id_village (Intercept) 0.8149 0.9027 \nNumber of obs: 2035, groups: id_village, 65\n\nFixed effects:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 0.1491 0.2297 0.649 0.5164 \nnetuse -0.6044 0.1442 -4.190 2.79e-05 ***\nphc -0.4985 0.2604 -1.914 0.0556 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr) netuse\nnetuse -0.422 \nphc -0.715 -0.025\n\n\nD’après ces résultats, les variables netuse et phc sont toutes deux associées à une diminution de la prévalence du paludisme, bien que l’effet de phc ne soit pas significatif à un seuil \\(\\alpha = 0.05\\). L’ordonnée à l’origine (0.149) est le logit de la probabilité de présence du paludisme pour un enfant sans moustiquaire et sans centre de santé publique, mais c’est l’ordonnée à l’origine moyenne pour tous les villages. Il y a beaucoup de variation entre les villages selon l’écart-type de l’effet aléatoire (0.90). Nous pouvons obtenir l’ordonnée à l’origine estimée pour chaque village avec la fonction coef:\n\nhead(coef(mod_glmm)$id_village)\n\n (Intercept) netuse phc\n1 0.93727515 -0.6043602 -0.4984835\n2 0.09204843 -0.6043602 -0.4984835\n3 0.22500620 -0.6043602 -0.4984835\n4 -0.46271089 -0.6043602 -0.4984835\n5 0.13680037 -0.6043602 -0.4984835\n6 -0.03723346 -0.6043602 -0.4984835\n\n\nPar exemple, l’ordonnée à l’origine pour le village 1 est environ 0.94, équivalente à une probabilité de 72%:\n\nplogis(0.937)\n\n[1] 0.7184933\n\n\ntandis que celle pour le village 2 est équivalente à une probabilité de 52%:\n\nplogis(0.092)\n\n[1] 0.5229838\n\n\nLe package DHARMa fournit une méthode générale pour vérifier si les résidus d’un GLMM sont distribués selon le modèle spécifié et s’il existe une tendance résiduelle. Il simule des réplicats de chaque observation selon le modèle ajusté et détermine ensuite un “résidu standardisé”, qui est la position relative de la valeur observée par rapport aux valeurs simulées, par exemple 0 si l’observation est plus petite que toutes les simulations, 0.5 si elle se trouve au milieu, etc. Si le modèle représente bien les données, chaque valeur du résidu standardisé entre 0 et 1 doit avoir la même probabilité, de sorte que les résidus standardisés doivent produire une distribution uniforme entre 0 et 1.\nLa fonction simulateResiduals effectue le calcul des résidus standardisés, puis la fonction plot trace les graphiques de diagnostic avec les résultats de certains tests.\n\nlibrary(DHARMa)\nres_glmm <- simulateResiduals(mod_glmm)\nplot(res_glmm)\n\n\n\n\nLe graphique de gauche est un graphique quantile-quantile des résidus standardisés. Les résultats de trois tests statistiques sont également présentés: un test de Kolmogorov-Smirnov (KS) qui vérifie s’il y a un écart par rapport à la distribution théorique, un test de dispersion qui vérifie s’il y a une sous-dispersion ou une surdispersion et un test de valeurs aberrantes (outlier) basé sur le nombre de résidus qui sont plus extrêmes que toutes les simulations. Ici, nous obtenons un résultat significatif pour les valeurs aberrantes, bien que le message indique que ce résultat pourrait avoir un taux d’erreur de type I plus grand que prévu dans ce cas.\nÀ droite, nous obtenons généralement un graphique des résidus standardisés (en y) en fonction du rang des valeurs prédites, afin de vérifier l’absence de tendance résiduelle. Ici, les prédictions sont regroupées par quartile, il serait donc préférable d’agréger les prédictions et les résidus par village, ce que nous pouvons faire avec la fonction recalculateResiduals.\n\nplot(recalculateResiduals(res_glmm, group = gambia$id_village))\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\n\nLe graphique de droite montre les points individuels, ainsi qu’une régression quantile pour le 1er quartile, la médiane et le 3e quartile. En théorie, ces trois courbes devraient être des lignes droites horizontales (pas de tendance des résidus par rapport aux prévisions). La courbe pour le 3e quartile (en rouge) est significativement différente d’une ligne horizontale, ce qui pourrait indiquer un effet systématique manquant dans le modèle."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#glmm-spatial-avec-spamm",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#glmm-spatial-avec-spamm",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "GLMM spatial avec spaMM",
+ "text": "GLMM spatial avec spaMM\nLe package spaMM (modèles mixtes spatiaux) est un package R relativement récent qui permet d’effectuer une estimation approximative du maximum de vraisemblance des paramètres pour les GLM avec dépendance spatiale, modélisés soit comme un processus gaussien, soit avec un CAR (nous verrons ce dernier dans la dernière section). Le package implémente différents algorithmes, mais il existe une fonction unique fitme qui choisit l’algorithme approprié pour chaque type de modèle. Par exemple, voici le même modèle (non spatial) que nous avons vu ci-dessus, ajusté avec spaMM.\n\nlibrary(spaMM)\n\nmod_spamm_glmm <- fitme(pos ~ netuse + phc + (1 | id_village),\n data = gambia, family = binomial)\nsummary(mod_spamm_glmm)\n\nformula: pos ~ netuse + phc + (1 | id_village)\nEstimation of lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.1491 0.2287 0.6519\nnetuse -0.6045 0.1420 -4.2567\nphc -0.4986 0.2593 -1.9231\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n id_village : 0.8151 \n --- Coefficients for log(lambda):\n Group Term Estimate Cond.SE\n id_village (Intercept) -0.2045 0.2008\n# of obs: 2035; # of groups: id_village, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1210.016\n\n\nNotez que les estimés des effets fixes ainsi que la variance des effets aléatoires sont presque identiques à ceeux obtenues par glmer ci-dessus.\nNous pouvons maintenant utiliser spaMM pour ajuster le même modèle avec l’ajout de corrélations spatiales entre les villages. Dans la formule du modèle, ceci est représenté comme un effet aléatoire Matern(1 | x + y), ce qui signifie que les ordonnées à l’origine sont spatialement corrélées entre les villages suivant une fonction de corrélation de Matérn des coordonnées (x, y). La fonction de Matérn est une fonction flexible de corrélation spatiale qui comprend un paramètre de forme \\(\\nu\\) (nu), de sorte que lorsque \\(\\nu = 0,5\\), elle est équivalente à la corrélation exponentielle, mais quand \\(\\nu\\) prend de grandes valeurs, elle se rapproche d’une corrélation gaussienne. Nous pourrions laisser la fonction estimer \\(\\nu\\), mais ici nous le fixons à 0.5 avec l’argument fixed de fitme.\n\nmod_spamm <- fitme(pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village),\n data = gambia, family = binomial, fixed = list(nu = 0.5))\n\nIncrease spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').\n\nsummary(mod_spamm)\n\nformula: pos ~ netuse + phc + Matern(1 | x + y) + (1 | id_village)\nEstimation of corrPars and lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nEstimation of lambda by 'outer' ML, maximizing logL.\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.06861 0.3352 0.2047\nnetuse -0.51719 0.1407 -3.6757\nphc -0.44416 0.2052 -2.1648\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Correlation parameters:\n 1.nu 1.rho \n0.50000000 0.05128692 \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n x + y : 0.6421 \n id_village : 0.1978 \n# of obs: 2035; # of groups: x + y, 65; id_village, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1197.968\n\n\nCommençons par vérifier les effets aléatoires du modèle. La fonction de corrélation spatiale a un paramètre rho égal à 0.0513. Ce paramètre dans spaMM est l’inverse de la portée, donc ici la portée de la corrélation exponentielle est de 1/0.0513 ou environ 19.5 km. Il y a maintenant deux pramètres de variance, celui identifié comme x + y est la variance à longue distance (i.e. le palier) pour le modèle de corrélation exponentielle alors que celui identifié comme id_village montre la portion non corrélée de la variation entre les villages.\nSi nous avions ici laissé les effets aléatoires (1 | id_village) dans la formule pour représenter la partie non spatiale de la variation entre les villages, nous pourrions également représenter ceci avec un effet de pépite dans le modèle géostatistique. Dans les deux cas, cela représenterait l’idée que même deux villages très proches l’un de l’autre auraient des prévalences de base différentes dans le modèle.\nPar défaut, la fonction Matern n’a pas d’effet de pépite, mais nous pouvons en ajouter un en spécifiant une pépite non nulle dans la liste initiale des paramètres init.\n\nmod_spamm2 <- fitme(pos ~ netuse + phc + Matern(1 | x + y),\n data = gambia, family = binomial, fixed = list(nu = 0.5),\n init = list(Nugget = 0.1))\n\nIncrease spaMM.options(separation_max=<.>) to at least 21 if you want to check separation (see 'help(separation)').\n\nsummary(mod_spamm2)\n\nformula: pos ~ netuse + phc + Matern(1 | x + y)\nEstimation of corrPars and lambda by ML (p_v approximation of logL).\nEstimation of fixed effects by ML (p_v approximation of logL).\nEstimation of lambda by 'outer' ML, maximizing logL.\nfamily: binomial( link = logit ) \n ------------ Fixed effects (beta) ------------\n Estimate Cond. SE t-value\n(Intercept) 0.06861 0.3352 0.2047\nnetuse -0.51719 0.1407 -3.6757\nphc -0.44416 0.2052 -2.1648\n --------------- Random effects ---------------\nFamily: gaussian( link = identity ) \n --- Correlation parameters:\n 1.nu 1.Nugget 1.rho \n0.50000000 0.23551027 0.05128692 \n --- Variance parameters ('lambda'):\nlambda = var(u) for u ~ Gaussian; \n x + y : 0.8399 \n# of obs: 2035; # of groups: x + y, 65 \n ------------- Likelihood values -------------\n logLik\nlogL (p_v(h)): -1197.968\n\n\nComme vous pouvez le voir, toutes les estimations sont les mêmes, sauf que la variance de la portion spatiale (palier) est maintenant de 0.84 et que la pépite est égale à une fraction 0.235 de ce palier, soit une variance de 0.197, ce qui est identique à l’effet aléatoire id_village dans la version ci-dessus. Les deux formulations sont donc équivalentes.\nMaintenant, rappelons les coefficients que nous avions obtenus pour le GLMM non spatial :\n\nsummary(mod_glmm)$coefficients\n\n Estimate Std. Error z value Pr(>|z|)\n(Intercept) 0.1490596 0.2296971 0.6489399 5.163772e-01\nnetuse -0.6043602 0.1442448 -4.1898243 2.791706e-05\nphc -0.4984835 0.2604083 -1.9142382 5.558973e-02\n\n\nDans la version spatiale, les deux effets fixes se sont légèrement rapprochés de zéro, mais l’erreur-type de l’effet de phc a diminué. Il est intéressant de noter que l’inclusion de la dépendance spatiale nous a permis d’estimer plus précisément l’effet de la présence d’un centre de santé publique dans le village. Ce ne serait pas toujours le cas: pour un prédicteur qui est également fortement corrélé dans l’espace, la corrélation spatiale dans la réponse rend plus difficile l’estimation de l’effet de ce prédicteur, puisqu’il est confondu avec l’effet spatial. Cependant, pour un prédicteur qui n’est pas corrélé dans l’espace, l’inclusion de l’effet spatial réduit la variance résiduelle (non spatiale) et peut donc augmenter la précision de l’effet du prédicteur.\nLe package spaMM est également compatible avec DHARMa pour les diagnostics résiduels. (Vous pouvez ignorer l’avertissement selon lequel il ne fait pas partie de la classe des modèles pris en charge, cela est dû à l’utilisation de la fonction fitme plutôt que d’une fonction d’algorithme spécifique dans spaMM).\n\nres_spamm <- simulateResiduals(mod_spamm2)\nplot(res_spamm)\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\nplot(recalculateResiduals(res_spamm, group = gambia$id_village))\n\nDHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details\n\n\n\n\n\nEnfin, bien que nous allons montrer comment calculer et visualiser des prédictions spatiales ci-dessous, nous pouvons produire une carte rapide des effets spatiaux estimés dans un modèle spaMM avec la fonction filled.mapMM.\n\nfilled.mapMM(mod_spamm2)"
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#processus-gaussiens-vs.-splines-de-lissage",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#processus-gaussiens-vs.-splines-de-lissage",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Processus gaussiens vs. splines de lissage",
+ "text": "Processus gaussiens vs. splines de lissage\nSi vous connaissez bien les modèles additifs généralisés (GAM), vous avez peut-être pensé à représenter la variation spatiale de la prévalence du paludisme (comme le montre la carte ci-dessus) par une spline de lissage en 2D (en fonction de \\(x\\) et \\(y\\)) dans un GAM.\nLe code ci-dessous correspond à l’équivalent GAM de notre GLMM avec processus gaussien ci-dessus, ajusté avec la fonction gam du package mgcv. L’effet spatial est représenté par la spline 2D s(x, y) alors que l’effet aléatoire non spatial de village est représenté par s(id_village, bs = \"re\"), qui est équivalent à (1 | id_village) dans les modèles précédents. Notez que pour la fonction gam, les variables catégorielles doivent être explicitement converties en facteurs.\n\nlibrary(mgcv)\ngambia$id_village <- as.factor(gambia$id_village)\nmod_gam <- gam(pos ~ netuse + phc + s(id_village, bs = \"re\") + s(x, y), \n data = gambia, family = binomial)\n\nPour visualiser la spline en 2D, nous utiliserons le package gratia.\n\nlibrary(gratia)\ndraw(mod_gam)\n\n\n\n\nNotez que le graphique de la spline s(x, y) (en haut à droite) ne s’étend pas trop loin des emplacements des données (les autres zones sont vides). Dans ce graphique, on peut également voir que les effets aléatoires des villages suivent la distribution gaussienne attendue (en haut à gauche).\nEnsuite, nous utiliserons à la fois le GLMM spatial de la section précédente et ce GAMM pour prédire la prévalence moyenne sur une grille spatiale de points contenue dans le fichier gambia_pred.csv. Le graphique ci-dessous ajoute ces points de prédiction (en noir) sur la carte précédente des points de données.\n\ngambia_pred <- read.csv(\"data/gambia_pred.csv\")\n\nggplot(gambia_agg, aes(x = x, y = y)) +\n geom_point(data = gambia_pred) +\n geom_point(aes(color = prev)) +\n geom_path(data = gambia.borders, aes(x = x / 1000, y = y / 1000)) +\n coord_fixed() +\n theme_minimal() +\n scale_color_viridis_c()\n\n\n\n\nPour faire des prédictions à partir du modèle GAMM à ces endroits, le code ci-dessous effectue les étapes suivantes:\n\nTous les prédicteurs du modèle doivent se trouver dans le tableau de données de prédiction, nous ajoutons donc des valeurs constantes de netuse et phc (toutes deux égales à 1) pour tous les points. Ainsi, nous ferons des prédictions sur la prévalence du paludisme dans le cas où un moustiquaire est utilisée et où un centre de santé publique est présent. Nous ajoutons également un id_village constant, bien qu’il ne soit pas utilisé dans les prédictions (voir ci-dessous).\nNous appelons la fonction predict à la sortie de gam pour produire des prédictions aux nouveaux points de données (argument newdata), en incluant les erreurs-types (se.fit = TRUE) et en excluant les effets aléatoires du village, donc la prédiction est faite pour un “village moyen”. L’objet résultant gam_pred aura des colonnes fit (prédiction moyenne) et se.fit (erreur-type). Ces prédictions et erreurs-types sont sur l’échelle du lien (logit).\nNous rattachons le jeu de données de prédiction original à gam_pred avec cbind.\nNous ajoutons des colonnes pour la prédiction moyenne et les limites de l’intervalle de confiance à 50% (moyenne \\(\\pm\\) 0.674 erreur-type), converties de l’échelle logit à l’échelle de probabilité avec plogis. Nous choisissons un intervalle de 50% car un intervalle de 95% peut être trop large ici pour contraster les différentes prédictions sur la carte à la fin de cette section.\n\n\ngambia_pred <- mutate(gambia_pred, netuse = 1, phc = 1, id_village = 1)\n\ngam_pred <- predict(mod_gam, newdata = gambia_pred, se.fit = TRUE, \n exclude = \"s(id_village)\")\ngam_pred <- cbind(gambia_pred, as.data.frame(gam_pred))\ngam_pred <- mutate(gam_pred, pred = plogis(fit), \n lo = plogis(fit - 0.674 * se.fit), # 50% CI\n hi = plogis(fit + 0.674 * se.fit))\n\nNote : La raison pour laquelle nous ne faisons pas de prédictions directement sur l’échelle de probabilité (réponse) est que la formule normale des intervalles de confiance s’applique plus précisément sur l’échelle logit. L’ajout d’un certain nombre d’erreurs-types autour de la moyenne sur l’échelle de probabilité conduirait à des intervalles moins précis et peut-être même à des intervalles de confiance en dehors de la plage de valeurs possible (0, 1) pour une probabilité.\nNous appliquons la même stratégie pour faire des prédictions à partir du GLMM spatial avec spaMM. Il y a quelques différences dans la méthode predict par rapport au cas du GAMM.\n\nL’argument binding = \"fit\" signifie que les prédictions moyennes (colonne fit) seront attachées à l’ensemble de données de prédiction et retournées sous forme de tableau de données spamm_pred.\nL’argument variances = list(linPred = TRUE) indique à predict de calculer la variance du prédicteur linéaire (donc le carré de l’erreur-type). Cependant, il apparaît comme un attribut predVar dans le tableau de données de sortie plutôt que dans une colonne se.fit, donc nous le déplaçons vers une colonne sur la ligne suivante.\n\n\nspamm_pred <- predict(mod_spamm, newdata = gambia_pred, type = \"link\",\n binding = \"fit\", variances = list(linPred = TRUE))\nspamm_pred$se.fit <- sqrt(attr(spamm_pred, \"predVar\"))\nspamm_pred <- mutate(spamm_pred, pred = plogis(fit), \n lo = plogis(fit - 0.674 * se.fit),\n hi = plogis(fit + 0.674 * se.fit))\n\nEnfin, nous combinons les deux ensembles de prédictions sous la forme de différentes rangées d’un tableau de données pred_all avec bind_rows. Le nom du tableau de données d’où provient chaque prédiction (gam ou spamm) apparaîtra dans la colonne “model” (argument .id). Pour simplifier la production du prochain graphique, nous utilisons ensuite pivot_longer dans le package tidyr pour changer les trois colonnes “pred”, “lo” et “hi” en deux colonnes, “stat” et “value” (pred_tall a donc trois rangées pour chaque rangée dans pred_all).\n\npred_all <- bind_rows(gam = gam_pred, spamm = spamm_pred, .id = \"model\")\n\nlibrary(tidyr)\npred_tall <- pivot_longer(pred_all, c(pred, lo, hi), names_to = \"stat\",\n values_to = \"value\")\n\nUne fois ces étapes franchies, nous pouvons enfin examiner les cartes de prédiction (moyenne, limites inférieure et supérieure de l’intervalle de confiance à 50 %) à l’aide d’un graphique ggplot. Les points de données originaux sont indiqués en rouge.\n\nggplot(pred_tall, aes(x = x, y = y)) +\n geom_point(aes(color = value)) +\n geom_point(data = gambia_agg, color = \"red\", size = 0) +\n coord_fixed() +\n facet_grid(stat~model) +\n scale_color_viridis_c() +\n theme_minimal()\n\n\n\n\nBien que les deux modèles s’accordent à dire que la prévalence est plus élevée près du groupe de villages de l’est, le GAMM estime également une prévalence plus élevée en quelques points (bord ouest et autour du centre) où il n’y a pas de données. Il s’agit d’un artefact de la forme de la spline autour des points de données, puisqu’une spline est censée correspondre à une tendance globale, bien que non linéaire. En revanche, le modèle géostatistique représente l’effet spatial sous forme de corrélations locales et revient à la prévalence moyenne globale lorsqu’il est éloigné de tout point de données, ce qui est une supposition plus sûre. C’est l’une des raisons pour lesquelles il est préférable de choisir un modèle géostatistique / processus gaussien dans ce cas."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#méthodes-bayésiennes-pour-les-glmm-avec-processus-gaussiens",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#méthodes-bayésiennes-pour-les-glmm-avec-processus-gaussiens",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Méthodes bayésiennes pour les GLMM avec processus gaussiens",
+ "text": "Méthodes bayésiennes pour les GLMM avec processus gaussiens\nLes modèles bayésiens fournissent un cadre flexible pour exprimer des modèles avec une structure de dépendance complexe entre les données, y compris la dépendance spatiale. Cependant, l’ajustement d’un modèle de processus gaussien avec une approche entièrement bayésienne peut être lent, en raison de la nécessité de calculer une matrice de covariance spatiale entre toutes les paires de points à chaque itération.\nLa méthode INLA (pour integrated nested Laplace approximation) effectue un calcul approximatif de la distribution postérieure bayésienne, ce qui la rend adaptée aux problèmes de régression spatiale. Nous ne l’abordons pas dans ce cours, mais je recommande le manuel de Paula Moraga (dans la section des références ci-dessous) qui fournit des exemples concrets d’utilisation de la méthode INLA pour divers modèles de données géostatistiques et aréales, dans le contexte de l’épidémiologie, y compris des modèles avec une dépendance à la fois spatiale et temporelle. Le livre présente les mêmes données sur le paludisme en Gambie comme exemple d’un ensemble de données géostatistiques, ce qui a inspiré son utilisation dans ce cours."
+ },
+ {
+ "objectID": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#référence-1",
+ "href": "posts/2021-01-12-4-day-training-in-spatial-statistics-with-philippe-marchand/index.html#référence-1",
+ "title": "4-Day Training in Spatial Statistics with Philippe Marchand",
+ "section": "Référence",
+ "text": "Référence\nMoraga, Paula (2019) Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. Disponible en ligne: https://www.paulamoraga.com/book-geospatial/."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html",
+ "title": "Making websites with HUGO",
+ "section": "",
+ "text": "I am only 10 hours of a crash course in web development ahead of you. As part of a major research project on setting a biodiversity observation network, I had to develop a prototype of a portal for the project, for biodiversity information and bunch of dashboards on biodiversity trends. Never made a website before. I know how to code in a few langages, and I know that I hate playing with boxes, menus, importing images manually, and most of all, dealing with a crash of the system and having to redo the whole thing because I made a mistake somewhere. Not that a bug when I try to compile is better, but at least it is more tractable.\nHugo made it very easily because of its fundamental feature (which is the same reason I edit papers with LaTeX): the distinction between the view and the content. Once you have set up the rules defining the visual aspects of the pages, then you can focus on the content and let the software automatically constructing the html code for you. It’s fast, accessible, scriptable and could be version-controlled. All qualities for an open and reproducible science.\nTook me a few hours to learn the basics (much harder to get the higher level skills, especially to write your own Go scripts), I took some tricks here and there in different templates and at looking what others do, and that was it I had my website. Realized that it could be a good entry level course to BIOS2 fellows and decided to turn that experience into a training workshop.\nYou will find below basic instructions to install and run a template. The following is not a full tutorial, for that I recommend simply to take time looking at the documentation provided on the Hugo page (https://gohugo.io/). I also consulted the online book Hugo in action (https://www.manning.com/books/hugo-in-action). There are many other references, all of them with goods and bads. But it’s nice to have multiple ones because sometimes the description of a concept may be obscure in one reference but better in the other and it’s by comparing and switching between them that you can make progress."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#exercise-edit-the-toml-file-to-include-your-own-information.",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#exercise-edit-the-toml-file-to-include-your-own-information.",
+ "title": "Making websites with HUGO",
+ "section": "Exercise : Edit the toml file to include your own information.",
+ "text": "Exercise : Edit the toml file to include your own information.\nYou may want to change the section People to Collaborators and also provide a proper reference to your on github page. You can also add or remove sections, this will affect the menu at the top of the page. For instance, you can add a blog section."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#build-for-local-development",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#build-for-local-development",
+ "title": "Making websites with HUGO",
+ "section": "Build for local development",
+ "text": "Build for local development\nHugo will use all of the material to generate static html files that will be displayed on your browser. The command is really easy to use to run it on your own computer, you simply have to type the following in the main folder :\nhugo server\nAnd that’s it, it compiles and you can simply open it in your browser by clicking on the adress indicated in the terminal. Congratulations for your first Hugo webste !\nThere are useful information in the terminal about the building process."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#build-for-publishing-your-website",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#build-for-publishing-your-website",
+ "title": "Making websites with HUGO",
+ "section": "Build for publishing your website",
+ "text": "Build for publishing your website\nThe command hugo server is very fast and useful to test your website while you develop it. But once you’ll be ready to distribute it, you’ll need all of the html files and related material to distribute the website. This is easily done with the even simpler command\nhugo\nYou will find in the directory that a new folder named public appeared, with all of the material needed to deploy the website. If you click on the index.html file, you’ll get to the home page of the website. It is interesting to open this file in your text editor, you’ll get a sense of the html code that hugo generated automatically for you. You can also take a look at other files."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#exercise",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#exercise",
+ "title": "Making websites with HUGO",
+ "section": "Exercise",
+ "text": "Exercise\nTake 15 minutes to remove Tim’s material and replace it by the three chapters of your thesis."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#github-user-or-organization-pages",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#github-user-or-organization-pages",
+ "title": "Making websites with HUGO",
+ "section": "GitHub User or Organization Pages",
+ "text": "GitHub User or Organization Pages\n\nStep-by-step Instructions\n\nCreate a (e.g. blog) repository on GitHub. This repository will contain Hugo’s content and other source files.\nCreate a .github.io GitHub repository. This is the repository that will contain the fully rendered version of your Hugo website.\ngit clone && cd \nPaste your existing Hugo project into the new local repository. Make sure your website works locally (hugo server or hugo server -t ) and open your browser to http://localhost:1313.\nOnce you are happy with the results: Press Ctrl+C to kill the server Before proceeding run rm -rf public to completely remove the public directory\ngit submodule add -b main https://github.com//.github.io.git public. This creates a git submodule. Now when you run the hugo command to build your site to public, the created public directory will have a different remote origin (i.e. hosted GitHub repository).\nMake sure the baseURL in your config file is updated with: .github.io\n\n\n\nPut it Into a Script\nYou’re almost done. In order to automate next steps create a deploy.sh script. You can also make it executable with chmod +x deploy.sh.\nThe following are the contents of the deploy.sh script:\n #!/bin/sh\n\n # If a command fails then the deploy stops\n set -e\n\n printf \"\\033[0;32mDeploying updates to GitHub...\\033[0m\\n\"\n\n # Build the project.\n hugo # if using a theme, replace with `hugo -t `\n\n # Go To Public folder\n cd public\n\n # Add changes to git.\n git add .\n\n # Commit changes.\n msg=\"rebuilding site $(date)\"\n if [ -n \"$*\" ]; then\n msg=\"$*\"\n fi\n git commit -m \"$msg\""
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#using-a-theme",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#using-a-theme",
+ "title": "Making websites with HUGO",
+ "section": "Using a theme",
+ "text": "Using a theme\nIt is usually a good idea to not modify a template directly, but to have the template and the site in a separate folder. The basic concept when doing this is that the config.toml file of the site has to link to the proper folder of the theme.\nFor example\ntheme = \"template-site\"\nthemesDir = \"../..\"\nThis means that the template site is in a folder named template-site which is a parent folder of the site folder. Other options are possible.\nUsually, all the content should go in the site folder, not in the theme folder.\n\nExercise 1\n\nStart modifying the theme to make it look like a website for a Zoo. Choose your preferred color scheme by changing the style= parameter in the config.toml file.\nFeel free to download some images from unsplash and save them in the static/img folder. You can then use these images in the carrousel, as “testimonial” photos or as background images for some of the sections. You can add or remove sections from the home page by editing the config.toml file and changing the enable= parameter in the params. segment at the bottom.\nYou can also try to create a new blog entry by adding a new file in the content/blog folder. This file will have a .md extension and will be written in markdown format."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#customizing-a-theme",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#customizing-a-theme",
+ "title": "Making websites with HUGO",
+ "section": "Customizing a theme",
+ "text": "Customizing a theme"
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#basics-of-html",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#basics-of-html",
+ "title": "Making websites with HUGO",
+ "section": "Basics of HTML",
+ "text": "Basics of HTML\nCore structure of an HTML page\n\n\n\nThis is my great website\n\n\n\n
Main title
\n
Main content goes here
\n\n\n\nA divider, used to organize content into blocks\n\n\n\nA span, used to organize content or text into sections with different styles. Usually on the same line.\n\n\n\nA paragraph\n\n\n\nHeadings at different levels\n
Main title
\n
Second level
\n
Third level
\n\n\nAn image\n\n\n\nA link\nGreat website here!"
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#link-between-html-and-css",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#link-between-html-and-css",
+ "title": "Making websites with HUGO",
+ "section": "Link between HTML and CSS",
+ "text": "Link between HTML and CSS\n\nIn html\nid is always unique. Class is not.\n
\nOne great div!\n
\n\n\nIn CSS\n“#” is applied to id and “.” is applied to class. When nothing is specified, applies to tag.\n#this-div-only{\n font-size:24px;\n}\n\n.this-type-of-div{\n color: #bb0000;\n}\n\ndiv{\n display:block;\n}"
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#basics-of-css",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#basics-of-css",
+ "title": "Making websites with HUGO",
+ "section": "Basics of CSS",
+ "text": "Basics of CSS\nW3 Schools CSS reference\n\n\n\n\n\n\n\n\nProperty\nDescription\nExample\n\n\n\n\nwidth, height\nwidth of item\n200px, 200pt, 100%, 100vw/vh\n\n\nmin-width, min-height\nminimum size of item\n200px, 200pt, 100%, 100vw\n\n\ncolor\nfont color\n#aa0000, red or rgb(255,0,0)\n\n\nbackground-color\ncolor of background\n#aa0000, red or rgb(255,0,0)\n\n\nborder-color\ncolor of border\n#aa0000, red or rgb(255,0,0)\n\n\nborder\nsize, type and color of border\n1px solid black\n\n\nmargin\nmargin around item (top right bottom left)\n1px, or 1px 2px 2px 1px\n\n\npadding\npadding within item, inside div for example\n10px\n\n\nfont-family\nname of font\nVerdana, Arial\n\n\nfont-size\nsize of text\n14px, 2em\n\n\ndisplay\nshould item be on the same line, or in a separate block?\ninline, block, inline-block, flex, …\n\n\n\n\nExercise 2\n\nCreate a file named custom.css under template-site/my-site/static/css/.\nRight-click on elements on the web page that you want to modify, then click on Inspect element and try to find CSS properties that you could modify to improve the look of the page. Then, choosing the proper class, add entries in the custom.css file that start with a dot (.) followed by the proper class names.\n\n.this-class {\n font-size:28px;\n}"
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#partials",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#partials",
+ "title": "Making websites with HUGO",
+ "section": "Partials",
+ "text": "Partials\nPartials are snippets of HTML code that could be reused on different places on the website. For example, you will see that the layouts/index.html file in the template-site folder lists all the partials that create the home page.\nAn important point to remember is that Hugo will look for files first in the site’s folders, and if it doesn’t find the files there, it will look for them in the theme’s folder. So site folder layouts and CSS take priority over the theme folder.\n\nExercise 3\n\nCreate a new folder template-site/my-site/layouts. In this folder, create a new file named index.html and copy the content of the template-site/layouts/index.html file into it. Remove the testimonials section from the newly created file.\nCreate a new folder template-site/my-site/layouts/partials. In this folder, create a new file named featured-species.html put the following content into it, replacing the information with the species you selected.\n\n
\n\n
\n
Red-Eyed Tree Frog
\n
This frog can be found in the tropical rain forests of Costa Rica.
\n
\n
\n\nThen, add this section to the index.html file created above.\n\n\n{{ partial \"featured_species.html\" . }}\n\nYou will probably need to restart the Hugo server to see the changes appear on the site.\nNow, you need to edit the CSS! In your custom.css file, add the following lines.\n\n\n.featured-species{\n height:300px;\n background-color: #1d1f20;\n color:white;\n}\n\n.species-image{\n height:300px;\n float:left;\n}\n\n.featured-species h3{\n color:white;\n font-size:1.5em;\n}\n\n.species-description{\n float:left;\n padding:20px;\n font-size:2em;\n}\nModify this as you see fit!"
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#now-a-bit-of-go-lang-to-make-the-featured-species-different.",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#now-a-bit-of-go-lang-to-make-the-featured-species-different.",
+ "title": "Making websites with HUGO",
+ "section": "Now a bit of GO lang to make the featured species different.",
+ "text": "Now a bit of GO lang to make the featured species different.\nIntroduction to Hugo templating\n\nExercise 4\n\nReplace your partial featured-species.html content with this one\n\n{{ range .Site.Data.species }}\n {{ if eq (.enable) true }}\n
\n \n
\n
{{ .name }}
\n
{{ .description }}
\n
\n
\n {{end}}\n{{end}}\n\nNow, create a new folder /template-site/my-site/data/species.\nIn this folder, create new file named frog.yaml with the following content.\n\nenable: true\nname: \"Red-eyed tree frog\"\ndescription: \"This frog can be found in the forests of Costa Rica\"\nimage: \"frog.jpg\"\n\nFind other species photos and add them to the img folder. Then you can add new .yaml files in the data/species folder for each species."
+ },
+ {
+ "objectID": "posts/2020-12-07-making-websites-with-hugo/index.html#iframes",
+ "href": "posts/2020-12-07-making-websites-with-hugo/index.html#iframes",
+ "title": "Making websites with HUGO",
+ "section": "iFrames",
+ "text": "iFrames\nAn iFrame is a HTML tag that essentially allows you to embed another web page inside of your site.\n\nExercise 5\nFind a Youtube video and click on the share option below the video. Find the Embed option and copy the code that starts with