-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexercise_questions.Rmd
237 lines (160 loc) · 8.54 KB
/
exercise_questions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: "At-home Statistical Workout 💪💪💪💪"
author: "Andrew MacDonald"
output:
distill::distill_article:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Here are some suggested ideas for how to work out your likelihood skills!
They are divided into three sections: one, two and three parameters.
**NOTE** remember to consult the slides above for a step-by-step pseudocode approach
**To do an exercise: ** copy and paste the R code into an R source file. Run it, then work with the data.frame it creates!
## 1. One parameter
### 1.1 The patient baby birds
You are watching a female bird visiting her nest, bringing back food for nestlings.
Every time she comes back you write down the number of minutes since the last visit. You watch for 20 consecutive visits.
This species brings food once every 10 minutes, so six times an hour, and visits are perfectly random.
This implies an Exponential distribution with a rate of $6\text{visits} / 60\text{minutes}$ so in other words $0.1$ visits per minute.
Use maximum likelihood to confirm this number.
```{r bird-visits}
visit_rate <- 6/60
set.seed(1859)
time_between_visit <- rexp(20, rate = visit_rate)
bird_visits <- data.frame(times = time_between_visit)
hist(time_between_visit)
```
Questions / thoughts:
* **1.1.1** Begin by trying only positive numbers in your algorithm.
* **1.1.2** Run the the following `rexp(1, -1)`. What happens? Why?
* **1.1.3** How would you constrain your algorithm to use only positive numbers?
### 1.2 Monarch Butterfly numbers
You are concerned about Monarchs, so you've decided to count their eggs!
You look on 50 milkweed plants in your neighbourhood to find them.
The eggs follow a perfect Poisson distribution with a mean of 120 eggs per plant.
```{r}
mean_eggs <- 120
set.seed(1859)
counted_eggs <- rpois(50, mean_eggs)
egg_data <- data.frame(counted_eggs)
hist(counted_eggs)
```
You want to use Maximum Likelihood to look at this distribution. You want to be extra careful, so you want to do it two ways.
First, you try to fit a Poisson distribution like this:
$$
\text{eggs} \sim \text{Poisson}(\lambda)
$$
you do this by searching over values of $\lambda$ between 1 and 200.
**Questions**:
* **1.2.1** Write the expression for the maximum likelihood estimate of $\lambda$.
* **1.2.2** Describe how you will search between the values of 1 and 200 for the maximum likelihood estimate of $\lambda$.
* **1.2.3** Maximize the likelihood and recover the true value of 120.
#### Monarch Butterflies Part 2
Lambda can't be 0 or negative. You decide to replace the value of $\lambda$ with a function that is always greater than 0:
$$
f(x) = e^x
$$
```{r}
curve(exp(x), xlim = c(-3, 3))
```
Since this function is always positive, and $\lambda$ is always positive, the function can take the place of $\lambda$ in your model.
$$
\text{eggs} \sim \text{Poisson}(e^x)
$$
**Questions**:
* **1.2.4** Now what values should you use for $x$ ?
* **1.2.5** Find the Maximum Likelihood value of $x$.
* **1.2.6** Are your answers from the two methods the same? How can you tell?
### 1.3 Difference between two normal distributions (part I)
You add some fertilizer to plants, because experiments.
This makes the plants grow more on average, because physiology.
At the end of your experiment, you cut all the plants down and measure their dry biomass, because tradition.
As a skilled botanist, you know that the standard deviation of the control plots will be exactly 5g, and the standard deviation of the treatment plants will be ***exactly the same**. You know that plants are on average 200g, and the fertilizer should increase this by 30g (a 15% increase).
```{r normal-diff}
real_difference <- 30
set.seed(1859)
control_plants <- rnorm(300, mean = 200, sd = 5)
treated_plants <- rnorm(300, mean = 200 + real_difference, sd = 5)
plant_growth <- rbind(
data.frame(treatment = 0, mass = control_plants),
data.frame(treatment = 1, mass = treated_plants)
)
par(mfrow = c(2,1))
hist(control_plants, breaks = 15)
hist(treated_plants, breaks = 15)
par(mfrow = c(1,1))
```
**Questions**
* **1.3.1** What is the likelihood for the observations? Consider that you know the control mean ($\mu$), the standard deviation of both treatment and control ($\sigma$). What is missing? Call this missing quantity $\alpha$.
* **1.3.2** Use maximum likelihood to confirm that the difference is really 30g.
### Difference between two normal distributions part (II)
The fertilizer makes plants grow, but affects different plants differently, because genetics.
As a VERY skilled geneticist, you know that the standard deviation of the control plots will be exactly 5g, but now you also know the standard deviation of the treatment plants will be exactly 9g. You know that plants are on average 200g, and the fertilizer should increase this by 30g (a 15% increase).
```{r normal-diff2}
real_difference <- 30
set.seed(1859)
control_plants2 <- rnorm(300, mean = 200, sd = 5)
treated_plants2 <- rnorm(300, mean = 200 + real_difference, sd = 9)
plant_growth2 <- rbind(
data.frame(treatment = 0, mass = control_plants),
data.frame(treatment = 1, mass = treated_plants)
)
par(mfrow = c(2,1))
hist(control_plants2, breaks = 15)
hist(treated_plants2, breaks = 15)
par(mfrow = c(1,1))
```
**Questions**
* **1.3.3** What is the likelihood for the observations? Consider that you know the control mean ($\mu$), the control standard deviation ($\sigma_c$) and the treatment standard deviation ($\sigma_t$). How is this different from the first equation?
* **1.3.4** Use maximum likelihood to confirm that the difference is really 30g.
* **1.3.5** Is the likelihood profile different from the first exercise? that is, does the plot of likelihood values (y axis) against parameter values (x axis) look the same?
## Two parameters
_The following two examples are lifted from Bolker , page 182_
Consider the tadpoles: they live a difficult life. They are in an experiment where they are kept at different densities. The experimenter introduces a predator to the container with the tadpoles, and observes as the predator kills tadpoles.
The experiment has six treatments: 5, 10, 15, 20, 25, 30 tadpoles per tank, and 5 tanks per treatment.
Predators can feed more when prey density is higher, but sometimes slow down when the prey get to very high densities.
In a classic Type II functional response, this is modelled with two numbers.
First, a predator species has some constant attack rate $a$ -- they only need to eat so much per unit time.
Second, they must "handle" each prey item (kill, digest, etc) before attacking another.
This handling time creates the slowdown in attack: no matter how many prey there are, each still takes the same effort to catch, kill, and digest.
The Type II functional response is written as:
$$
f(N) = \frac{a}{a + ahN}
$$
* $a$ is the attack rate
* $h$ is handling time
* $N$ is the number of prey
This equation produces a per capita rate of predation: the *proportion* of prey killed by one predator in a unit of time.
```{r}
a <- 0.8
h <- .3
curve(a/(1 + a*h*x), xlim = c(0,30), ylim = c(0,1))
```
* **2.1.1** Play with this equation in R, using `curve()`, and modify values of `a` and `h`. Convince yourself that the output is always between 0 and 1.
* **2.1.2** Are there any constraints on the parameters? Think about how you will handle that in your Maximum Likelihood search.
At the end of the experiment, the scientist comes back and measures how many tadpoles were killed in each treatment, out of the total number originally placed in that tank.
* **2.1.3** What probability distribution would you use for these data? What are the parameters of this distribution?
* **2.1.4** The functional response equation explored above replaces one of these parameters. Which one, and why?
#### Part 2
Simulate some data with known parameters for $a = 0.8$ and $h = 0.3$
```{r}
a <- 0.8
h <- .3
how_many_tanks <- 10
rbinom(how_many_tanks, prob = a/(1 + a*h*10), size = 10)
size_of_tank <- c(5, 10, 15, 20, 25, 30)
set.seed(1859)
dead_tadpoles <- lapply(size_of_tank, function(SIZE) rbinom(how_many_tanks, prob = a/(1 + a*h* SIZE), size = SIZE))
dead_tadpole_df <- data.frame(
density = rep(size_of_tank, times = how_many_tanks),
mortality = do.call(c, dead_tadpoles))
with(dead_tadpole_df,
plot(density, jitter(mortality, amount = .2),
pch = 21, bg = "darkgreen",
cex = 1.2, col = "grey")
)
```
* **2.1.5** Use the likelihood you wrote above to find the values of $a$ and $h$ that maximize the likelihood. Can you recover the values you used in the simulation?