Polishing pass on LFMCMC vignette

UofUEpiBio · Dec 11, 2024 · b57d0f8 · b57d0f8
1 parent 4011c4b
commit b57d0f8
Showing 1 changed file with 65 additions and 37 deletions.
diff --git a/vignettes/likelihood-free-mcmc.Rmd b/vignettes/likelihood-free-mcmc.Rmd
@@ -17,16 +17,23 @@ knitr::opts_chunk$set(
 )
 ```
 # Introduction
-The purpose of the "lfmcmc" function is to perform a Likelihood-Free Markhov Chain Monte Carlo simulation.
+The purpose of the "LFMCMC" class in epiworldR is to perform a Likelihood-Free Markhov Chain Monte Carlo (LFMCMC) simulation.
+LFMCMC is used to approximate models where the likelihood function is either unavailable or computationally expensive.
+This example assumes a general understanding of LFMCMC.
+To learn more about it, see the Handbook of Markhov Chain Monte Carlo by Brooks et al. (https://doi.org/10.1201/b10905).
+
 In this example, we use LFMCMC to recover the parameters of an SIR model.
 
-# Setup the SIR Model
+# Setup The SIR Model
+Our SIR model will have the following characteristics:
 
-This example uses a social network with parameters listed within the
-`ModelSIR` function.
-The virus name is specified (COVID-19), 1000 agents are initialized, the virus prevalence of 0.1 is declared, the transmission rate for any given agent is 0.3, and the recovery rate is set to 0.3.
-To create this model in epiworldR, use the `ModelSIR` and `agents_smallworld` functions.
+* **Virus Name:** COVID-19
+* **Initial Virus Prevalence:** 0.1
+* **Recovery Rate:** 0.3
+* **Transmission Rate:** 0.3
+* **Number of Agents:** 1,000
 
+We use the `ModelSIR` and `agents_smallworld` functions to construct the model in epiworldR.
 ```{r setup-sir}
 library(epiworldR)
 
@@ -48,8 +55,7 @@ agents_smallworld(
 )
 ```
 
-# Run the Model
-We run the model for 50 days and print the summary statistics.
+Then we run the model for 50 days and print the results.
 ```{r run-sir}
 verbose_off(model_sir)
 
@@ -61,24 +67,26 @@ run(
 
 summary(model_sir)
 ```
-For this example, we will try to recover the Recovery and Transmission rate parameters with LFMCMC.
-In most cases, you'd use an observed dataset, but we're using the simulated data generated by running the model to show we can recover the parameters through LFMCMC.
-```{r get-obs-data}
-sir_model_data <- get_today_total(model_sir)
+
+Note the "Model parameters" and the "Distribution of the population at time 50" from the above output.
+Our goal is to recover the model parameters (Recovery and Transmission rates) through LFMCMC.
+We accomplish this by comparing the population distribution from each simulation run to the "observed" distribution from our model.
+We get this distribution using the `get_today_total` function in epiworldR.
+```{r get-sir-model-data}
+model_sir_data <- get_today_total(model_sir)
 ```
 
-# Setup LFMCMC
-Now that we have our dataset, we can set up the LFMCMC object.
-LFMCMC requires four functions:
+For practical cases, you would use observed data, instead of a model simulation.
+We use a simulation in our example to show the accuracy of LFMCMC in recovering the model parameters.
+Whenever we use the term "observed data" below, we are referring to the model distribution ().
 
-* Simulation function: This runs an epiworld SIR model with a given set of parameters and outputs a simulated dataset from the model run. In our case, we are setting the Recovery and Transmission rate parameters and running the model for 50 days, then returning the final totals for each SIR state. If you were using an observed dataset, this function should return output that is has the same structure as the observed data (e.g., dimensions, value types, etc.)
-* Summary function: This extracts summary statistics from the dataset. This should produce the same output format for both the observed data and the simulation function output. In our case, since we are using the final SIR state totals as our "observed" data which is the same as the output of the simulation function, our summary function simply passes that data through.
-* Proposal function: This returns a new set of parameters. It is called the proposal function, because it is "proposing" the new parameters for the LFMCMC algorithm to try in the simulation function. It does this by taking the parameters from the previous run (`old_params`) and taking a random step away from those values.
-* Kernel function: This scores the results of the latest simulation run against the observed dataset, by comparing the summary statistics (from `summary_fun()`). LFMCMC uses the kernel score and the Hastings Ratio to determine whether or not to accept that run.
+# Setup LFMCMC
+In epiworldR, LFMCMC requires four functions:
 
-We define our LFMCMC functions:
+The **simulation function** runs a model with a given set of parameters and produces output that matches the structure of our observed data.
+For our example, we set the Recovery and Transmission rate parameters, run an SIR model for 50 days, and return the distribution of the population at the end of the run.
 
-```{r lfmcmc-fun-setup}
+```{r simfun}
 simulation_fun <- function(params) {
 
   set_param(model_sir, "Recovery rate", params[1])
@@ -92,40 +100,57 @@ simulation_fun <- function(params) {
   get_today_total(model_sir)
 
 }
+```
+
+The **summary function** extracts summary statistics from the given data.
+This should produce the same output format for both the observed data and the simulated data from `simulation_fun`.
+For our example, since the population distribution is already a summary of the data, our summary function simply passes that data through.
+With more complicated use cases, you might instead compute summary statistics such as the mean or standard deviation.
 
-summary_fun <- function(dat) {
-  return(dat)
+```{r sumfun}
+summary_fun <- function(data) {
+  return(data)
 }
+```
 
+The **proposal function** returns a new set of parameters, which it is "proposing" parameters for the LFMCMC algorithm to try in the simulation function.
+In our example, it takes the parameters from the previous run (`old_params`) and does a random step away from those values.
+
+```{r propfun}
 proposal_fun <- function(old_params) {
   res <- plogis(qlogis(old_params) + rnorm(length(old_params)))
   return(res)
 }
+```
 
+The **kernel function** effectively scores the results of the latest simulation run against the observed data, by comparing the summary statistics from `summary_fun` for each.
+LFMCMC uses the kernel score and the Hastings Ratio to determine whether or not to accept the parameters for that run.
+In our example, since `summary_fun` simply passes the data through, `simulated_stats` and `observed_stats` are our simulated and observed data respectively.
+```{r kernfun}
 kernel_fun <- function(simulated_stats, observed_stats, epsilon) {
   dnorm(sqrt(sum((simulated_stats - observed_stats)^2)))
 }
 ```
 
-Then we initialize the LFMCMC object using the `LFMCMC` function in epiworldR and the various setters for the four LFMCMC functions and the observed data.
+With all four functions defined, we can initialize the simulation object using the `LFMCMC` function in epiworldR along with the appropriate setter functions.
 ```{r lfmcmc-setup}
 lfmcmc_model <- LFMCMC(model_sir) |>
   set_simulation_fun(simulation_fun) |>
   set_summary_fun(summary_fun) |>
   set_proposal_fun(proposal_fun) |>
   set_kernel_fun(kernel_fun) |>
-  set_observed_data(sir_model_data)
+  set_observed_data(model_sir_data)
 ```
 
-# Run LFMCMC simulation
-We set the initial parameters of the simulation with Recovery rate of 0.1 and a Transmission rate of 0.5.
-We will run LFMCMC for 2,000 iterations and use a kernel epsilon of 1.0.
-Now, we can run the LFMCMC simulation with the `run_lfmcmc()` function in epiworldR.
+# Run LFMCMC Simulation
+To run LFMCMC, we need to set the initial model parameters.
+For our example, we use an initial Recovery rate of 0.1 and an initial Transmission rate of 0.5.
+We set the kernel epsilon to 1.0 and run the simulation for 2,000 samples (iterations) using the `run_lfmcmc` function.
 
 ```{r lfmcmc-run}
 initial_params <- c(0.1, 0.5)
-n_samples <- 2000
 epsilon <- 1.0
+n_samples <- 2000
 
 run_lfmcmc(
   lfmcmc = lfmcmc_model,
@@ -135,14 +160,17 @@ run_lfmcmc(
   seed = model_seed
 )
 ```
-# Print results
-We set the name of the LFMCMC statistics using the state names of the SIR model and set the names of the estimated parameters.
-Then we print the results of the simulation.
+
+# Results
+To make the printed results easier to read, we use the `set_params_names` and `set_stats_names` functions before calling `print`.
+We also use a burn-in period of 500 samples.
 ```{r lfmcmc-print}
+set_params_names(lfmcmc_model, c("Recovery rate", "Transmission rate"))
 set_stats_names(lfmcmc_model, get_states(model_sir))
-set_params_names(lfmcmc_model, c("Immune recovery", "Infectiousness"))
 
-print(lfmcmc_model)
+print(lfmcmc_model, burnin = 500)
 ```
-LFMCMC didn't perfectly recover the initial model parameters, but it did produce reasonably close estimates.
-If the dataset were observed and the initial Recovery and Transmission rates unknown, we could estimate them using LFMCMC.
+
+Recall that the observed data came from a model with a Recovery rate of 0.3 and a Transmission rate of 0.3.
+As the above output shows, LFMCMC made a close approximation of the parameters, which resulted in a close approximation of the observed population distribution.
+This example highlights the effectiveness of using LFMCMC for highly complex models.