-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathVB_Bayes1.qmd
407 lines (289 loc) · 15.6 KB
/
VB_Bayes1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: VectorByte Methods Training
subtitle: Introduction to Bayesian Statistics
author: |
| The VectorByte Team
| (Leah R. Johnson, Virginia Tech)
title-slide-attributes:
data-background-image: VectorByte-logo_lg.png
data-background-size: contain
data-background-opacity: "0.2"
format: revealjs
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(cache = FALSE,
echo = FALSE,
message = FALSE,
warning = FALSE,
#fig.height=6,
#fig.width = 1.777777*6,
tidy = FALSE,
comment = NA,
highlight = TRUE,
prompt = FALSE,
crop = TRUE,
comment = "#>",
collapse = TRUE)
library(knitr)
library(kableExtra)
library(xtable)
library(viridis)
options(stringsAsFactors=FALSE)
knit_hooks$set(no.main = function(before, options, envir) {
if (before) par(mar = c(4.1, 4.1, 1.1, 1.1)) # smaller margin on top
})
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_knit$set(width = 70)
source("my_knitter.R")
#library(tidyverse)
#library(reshape2)
#theme_set(theme_light(base_size = 16))
make_latex_decorator <- function(output, otherwise) {
function() {
if (knitr:::is_latex_output()) output else otherwise
}
}
insert_pause <- make_latex_decorator(". . .", "\n")
insert_slide_break <- make_latex_decorator("----", "\n")
insert_inc_bullet <- make_latex_decorator("> *", "*")
insert_html_math <- make_latex_decorator("", "$$")
## classoption: aspectratio=169
```
## Learning Objectives
1. Understand the basic principles underlying Bayesian modeling methodology
2. Introduce how to use Bayesian inference for real-world problems
3. Introduce computation tools to perform inference for simple models in R (how to turn the Bayesian crank)
## What is Bayesian Inference?
In the Bayesian approach our probabilities numerically represent rational beliefs.
`r sk1()`
`r myblue("Bayes rule")` provides a rational method for updating those beliefs in light of new information and incorporating/quantifying uncertainty in those beliefs.
`r sk1()`
Thus, ***Bayesian inference is an approach for understanding data inductively***.
## Recall: Bayes Theorem
Bayes Theorem allows us to relate the conditional probabilities of two events $A$ and $B$: $$
\text{Pr}(A|B) = \frac{\text{Pr}(B|A)\text{Pr}(A)}{\text{Pr}(B)}
$$
## What is Bayesian Inference?
We can re-write Bayes rule in terms of our parameters, $\theta$ and our data, $Y$:
\begin{align*}
\text{Pr}(\theta|Y) & = \frac{\text{Pr}(Y|\theta)\text{Pr}(\theta)}{\text{Pr}(Y)}
\end{align*}
The LHS is the main quantity of interest in a Bayesian analysis, the `r myblue("posterior")`, denoted $f(\theta|Y)$: $$
\overbrace{f(\theta|Y)}^\text{Posterior} \propto \overbrace{\mathcal{L}(\theta; Y)}^\text{Likelihood} \times \overbrace{f(\theta)}^\text{Prior}
$$
## Bayesian methods provide
1. models for rational, quantitative learning
2. parameter estimates with good statistical properties
3. estimators that work for small and large sample sizes
4. parsimonious descriptions of data, predictions for missing data, and forecasts for future data
5. a coherent computational framework for model estimation, selection and validation
## Classical vs Bayesian
The fundamental differences between classical and Bayesian methods is what is `r myblue("fixed")` and what is `r myred("random")` in an analysis.
`r sk1()`
| Paradigm | Fixed | Random |
|:---------:|:----------------:|:----------------:|
| Classical | param ($\theta$) | data ($Y$) |
| Bayesian | data ($Y$) | param ($\theta$) |
## Why/Why Not Bayesian Statistics?
`r sk1()`
***`r myblue("Pros")`***
1. If $f(\theta)$ & $\mathcal{L}(\theta; Y)$ represent a rational person's beliefs, then Bayes' rule is an optimal method of updating these beliefs given new info (Cox 1946, 1961; Savage 1954; 1972).
2. Provides more intuitive answers in terms of the probability that parameters have particular values.
3. In many complicated statistical problems there are no obvious non-Bayesian inference methods.
## Why/Why Not Bayesian Statistics?
`r sk1()`
***`r myred("Cons")`***
1. It can be hard to mathematically formulate prior beliefs (choice of $f(\theta)$ often ad hoc or for computational reasons).
2. Posterior distributions can be sensitive to prior choice.
3. Analyses can be computationally costly.
## Steps to Making Inference
1. Research question
2. Data collection
3. Model $Y_i \approx f(X_i)$
4. Estimate the parameter in the model with uncertainty
5. Make inference
The difference between `r mygrn("Classical")` and `r myblue("Bayesian")` lies in step 4:
- `r mygrn("Classical")` uses maximum likelihood estimatation\
- `r myblue("Bayesian")` derives a posterior distribution.
## Example: Estimating the probability of a rare event
Suppose we are interested in the prevalence of an infectious disease in a small city. A small random sample of 20 individuals will be checked for infection.
- Interest is in the fraction of infected individuals $$
\theta \in \Theta =[0,1]
$$
- The data records the number of infected individuals $$
y \in \mathcal{Y} =\{0,1, \ldots, 20\}
$$
## Example: Likelihood/sampling model
Before the sample is obtained, the number of infected individuals is unknown.
- Let $Y$ denote this to-be-determined value
- If $\theta$ were known, a sensible `r myblue("sampling")` model is $$
Y|\theta \sim \mathrm{Bin} (20, \theta)
$$
![](bin_sampling.png){fig-align="center"}
## Example: Prior
Other studies from various parts of the country indicate that the infection rate ranges from about 0.05 to 0.20, with an average prevalence of 0.1.
- Moment matching from a beta distribution (a convenient choice) gives the prior $\theta \sim \mathrm{Beta} (2,20)$
```{r, echo=FALSE, echo=FALSE, fig.align="center", fig.height=2, fig.width=3.75, dev.args=list(bg='transparent'), no.main=TRUE}
#\begin{center}
#\includegraphics[scale=0.55,trim=10 50 0 60]{beta_bin_posterior}
#\end{center}
x<-seq(0,1, length=1000)
plot(x, dbeta(x, 2, 20), type="l", col="grey", xlab="", ylab="density", lwd=3, ylim=c(0,8), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
```
## Example: Posterior
The prior and sample model combination: \begin{align*}
\theta & \sim \mathrm{Beta} (a,b) \\
Y|\theta & \sim \mathrm{Bin} (n, \theta)
\end{align*} and an observed $y$ (the data), leads to the `r myblue("posterior")`
$$
p(\theta|y)= \mathrm{Beta}(a+y, b+n-y)
$$
## Example: Posterior
For our case, we have $a=2$, $b=20$, $n=20$.
If we don't find any infections ($y=0$) our posterior is:
$$
p(\theta |y=0)= \mathrm{Beta}(2, 40)
$$
```{r, echo=FALSE, echo=FALSE, fig.align="center", fig.height=2.25, fig.width=4, dev.args=list(bg='transparent'), no.main=TRUE}
#\begin{center}
#\includegraphics[scale=0.55,trim=10 50 0 60]{beta_bin_posterior}
#\end{center}
x<-seq(0,1, length=1000)
plot(x, dbeta(x, 2, 20), type="l", col="grey", xlab="", ylab="density", lwd=3, ylim=c(0,15), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
lines(x, dbeta(x, 2, 40), lwd=3)
legend("topright", c("posterior (y=0)", "prior"), col=c("black", "grey"), lwd=3, bty="n")
```
## Example: Sensitivity Analysis
`r myred("How influential is our prior?")` The `r myblue("posterior expectation")` is $$
\mathrm{E}\{\theta|Y=y\} = \frac{n}{w+n} \bar{y} + \frac{w}{w+n} \theta_0
$$ a **weighted average** of the sample mean and the prior expectation: \begin{align*}
\theta_0 & = \frac{a}{a+b} ~~~~ \rightarrow \text{ prior expectation (or guess)} \\
w & = a + b ~~~~ \rightarrow \text{ prior confidence}
\end{align*}
## Example: A non-Bayesian approach
A standard estimate of a population proportion, $\theta$ is the sample mean $\bar{y} = y/n$. If $y=0 \rightarrow \bar{y} = 0$.
`r sk1()`
Understanding the sampling uncertainty is crucial (e.g., for reporting to health officials).
The most popular 95% confidence interval for a population proportion is the `r myblue("Wald Interval")`: $$
\bar{y} \pm 1.96 \sqrt{\bar{y}(1-\bar{y})/n}.
$$ This has the correct *`r myblue("asymptotic")`* coverage, but $y=0$ `r myred("is still problematic")`!
## Conjugate Bayesian Models
Some sets of priors/likelihoods/posteriors exhibit a special relationship called **`r myblue("conjugacy")`**: when posterior and prior distributions have the same form.
E.g., in our Beta-Binomial/Bernoilli example: \begin{align*}
\theta & \sim \mathrm{Beta} (a,b) \\
Y|\theta & \sim \mathrm{Bin} (n, \theta) \\
\theta | Y & \sim \mathrm{Beta}(a^*, b^*)
\end{align*}
## Are all posteriors in the same family as the priors? ***`r myred("No")`***
`r sk2()`
Conjugacy is a nice special property, but most of the time this isn't the case.
`r sk1()`
Usually getting an analytic form of the posterior distribution can be hard or impossible.
## What do you do with a Posterior?
- Summarize important aspects of the posterior
- mean, median, mode, variance...
- Check sensitivity of posterior to prior choice
- Say what range of parameters is consistent with the observed data given our prior information
- Make predictions
## Posterior Summaries (point)
For the Beta-Binomial model, we found that $$
p(\theta | y)= \mathrm{Beta}(a+y, b+n-y).
$$ We can calculate multiple summaries exactly, for example: \begin{align*}
\mathrm{mean}= \mathrm{E}[\theta|Y] & = \frac{a+y}{a+b+n} \\
\mathrm{mode}(\theta|Y) & = \frac{a+y-1}{a+b+n-2} ~~~ \dagger
\end{align*}
`r sk1()` $\dagger$ a.k.a. the *maximum a posteriori estimator* (MAP)
## Prior Sensitivity
The posterior expectation can be written as a weighted average of information from the prior and the data $$
\mathrm{E}\{\theta |Y=y\} = \frac{n}{a + b +n} \bar{y} + \frac{a+b}{a+b+n} \theta_0.
$$ Thus $a$ and $b$ can be interpreted here as ***prior data*** where $a$ is the number of `r myred("prior successes")` and $a+b$ is the `r myblue("prior sample size")`. When $n\gg a+b$ most of our information comes from the data instead of the prior.
## Visualizing the prior vs. posterior
We can also visually check for sensitivity, since we don't have general analytic approaches.
```{r, echo=FALSE, echo=FALSE, echo=FALSE, fig.align="center", fig.height=2.35, fig.width=4.5, dev.args=list(bg='transparent'), no.main=TRUE}
#\begin{center}
#\includegraphics[scale=0.5,trim=30 50 0 20]{prior_sens}
#\end{center}
x<-seq(0,1, length=1000)
n<-20
y<-1
a1<-1
b1<-1
a2<-15
b2<-20
par(mfrow=c(1,3), bty="l", mar = c(3.1, 1.75, 1.1, 1.1))
plot(x, dbeta(x, a1, b1), type="l", col="grey", xlab="",
ylab="density", lwd=3,
ylim=c(0,13), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
lines(x, dbeta(x, a1+y, b1+n-y), lwd=3)
legend("topright", c("posterior (y=1, n=20)", "prior (a=1, b=1)"),
col=c("black", "grey"), lwd=3, bty="n", cex=0.5)
abline(v=(a1+y)/(a1+b1+n), col=2, lty=2)
plot(x, dbeta(x, a2, b2), type="l", col="grey", xlab="",
ylab="density", lwd=3,
ylim=c(0,13), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
lines(x, dbeta(x, a2+y, 21+n-y), lwd=3)
legend("topright", c("posterior (y=1, n=20)", "prior (a=15, b=20)"),
col=c("black", "grey"), lwd=3, bty="n", cex=0.5)
abline(v=(a2+y)/(a2+b2+n), col=2, lty=2)
plot(x, dbeta(x, a2, b2), type="l", col="grey", xlab="",
ylab="density", lwd=3,
ylim=c(0,13), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
lines(x, dbeta(x, a2+5*y, 21+5*n-5*y), lwd=3)
legend("topright", c("posterior (y=5, n=100)", "prior (a=15, b=20)"),
col=c("black", "grey"), lwd=3, bty="n", cex=0.5)
abline(v=(a2+5*y)/(a2+b2+5*n), col=2, lty=2)
```
## Confidence Regions
An interval $[l(y), u(y)]$, based on the observed data $Y=y$, has 95 % Bayesian coverage for $\theta$ if $$
P(l(y) <\theta < u(y)|Y=y)=0.95
$$ The interpretation: it describes your information about the true value of $\theta$ after you have observed $Y=y$.
`r sk1()` Such intervals are typically called `r myblue("credible intervals")`, to distinguish them from frequentist confidence intervals. Both are referred to as CIs.
## Quantile-based (Bayesian) CI
Perhaps the easiest way to obtain a credible interval is to use the posterior quantiles.
To make a $100 \times (1-\alpha)$ % quantile-based CI, find numbers $\theta_{\alpha/2}<\theta_{1- \alpha/2}$ such that
- $P(\theta <\theta_{\alpha/2} |Y=y)=\alpha/2$
- $P(\theta >\theta_{1-\alpha/2} |Y=y)=\alpha/2$
The numbers $\theta_{\alpha/2},\theta_{1- \alpha/2}$ are the $\alpha/2$ and $1-\alpha/2$ posterior quantiles of $\theta$.
## Example: Binomial sampling + uniform prior
Suppose out of $n=10$ conditionally independent draws of a binary random variable we observe $Y=2$ ones (successes).
`r sk1()` Using a uniform prior distribution (a.k.a., $\mathrm{Beta}(1,1)$) for $\theta$, the posterior distribution is $\theta | y=2 \sim \mathrm{Beta}(1+2,1+10-2)$.
------------------------------------------------------------------------
A 95% CI from the 0.025 and 0.975 quantiles of this beta:
```{r, echo=TRUE}
round(qbeta(p=c(0.025, 0.975), 3, 9), 2)
```
```{r, echo=FALSE, fig.align="center", fig.height=2.25, fig.width=4, dev.args=list(bg='transparent'), no.main=TRUE}
#\begin{center}
#\includegraphics[scale=0.5,trim=30 50 0 50]{quantCI}
#\end{center}
x<-seq(0,1, length=1000)
plot(x, dbeta(x, 3, 9), type="l", col="black", xlab="", ylab="density", lwd=3,
ylim=c(0,3.25), bty="l", cex.lab=0.75, cex.axis=0.75, mgp=c(2,1,0))
abline(v=qbeta(0.025, 3, 9), col=2, lty=2, lwd=2)
abline(v=qbeta(0.975, 3, 9), col=2, lty=2, lwd=2)
legend("topright", c("posterior", "95% CI"),
col=c("black", "red"), lwd=3, lty=c(1,2), bty="n", cex=0.75)
```
The posterior probability that $\theta \in [0.06, 0.52]$ is 95%. BUT, there are $\theta$-values outside the CI that have higher probability \[density\] than points inside!
## Alternative: HPD region
A $100 \times(1-\alpha)$ % highest posterior density (HPD) regions is the part of parameter space, $s(y)$, such that:
1. $P(\theta \in s(y) |Y=y)= 1-\alpha$
2. If $\theta_a \in s(y)$ and $\theta_b \notin s(y)$ then $P(\theta_a |Y=y)>P(\theta_b |Y=y)$
$\Rightarrow$ all points inside the HPD region have higher probability density than those outside.
------------------------------------------------------------------------
We collect the highest density points with cumulative density greater that $1-\alpha$:
![](HPD_example){fig-align="center"}
The 95 % HPD region is \[0.04, 0.48\] which is narrower than the quantile-based CI, yet both contain 95 % probability.
------------------------------------------------------------------------
`r myred("Exercise: Is a treatment for cancer effective?")`
We have data on $n$ cancer patients that have been given a treatment. Our outcome variable is whether or not it was "effective":
| Patient | Effectiveness | Numerical Data |
|:--------:|:-------------:|:--------------:|
| 1 | N | 0 |
| 2 | N | 0 |
| 3 | Y | 1 |
| $\vdots$ | $\vdots$ | $\vdots$ |
-------------------------------------
The appropriate sampling model for each patient is a Bernoilli: $$Y_i|\theta \stackrel{iid}{\sim} f(Y|\theta) = \text{Bern}(\theta)$$ where $\theta$ is the success rate of the treatment. Write down the likelihood for the $n$ patients. Then, assuming a $\mathrm{Beta}(a,b)$ prior for $\theta$, find the posterior distribution for $\theta|Y$. Does this look familiar?
## Next Steps
Next you'll complete a practical where you conduct a conjugate Bayesian analysis for the mean of a normal distribution on the midge data introduced in the likelihood chapter. You'll also visualize the effect of different prior choices on the posterior distribution.