Author: Steve Walker
Last Updated: 2025-01-10
This is a (currently draft) set of companion materials for a workshop on
using macpan2
for applied public
health modelling.
(Note: I highlight areas that cannot be finalized yet, with notes like this)
- Preliminaries
- Exploration
- Starter Model Library
- Simulating Dynamics
- Relating Model Specifications to Box Diagrams
- Relating Model Specifications to Explicit Dynamics
- Basics of Modifying Model Specifications
- Compare Simulated and Observed Incidence
- Reporting Delays
- Hospital Admissions and Occupancy
- Symptom Status
- Seroprevalence
- Vaccination
- Wastewater Pathogen Concentrations
- Vital Dynamics
- Importation and Stochasticity
- Parameterization
- Inference
- Stratification
Please read the syllabus before participating in this workshop. The instructor will review this syllabus with participants before the sessions start.
MLi: I would like a visualization of the roadmap.
The syllabus provides a roadmap for the workshop, guiding participants
through the exploration-parameterization-inference-stratification
strategy essential for compartmental modeling using the macpan2
package. Each session is organized around these steps, offering a
structured approach to model refinement. Participants will follow a
practical example that demonstrates each phase of the strategy, with
companion materials available for deeper exploration of related skills.
Each session contains exercises designed to reinforce key concepts and
encourage participants to apply lessons to their own compartmental model
diagrams. For those without a pre-existing model or whose models are not
applicable, a detailed example model is provided to ensure a hands-on
learning experience. By the end of the workshop, participants will have
the skills to evaluate the suitability of compartmental modeling for
public health problems, develop models using macpan2, navigate its
documentation to address challenges, and provide constructive feedback
for improving the tool.
MLi: We need to straight out the end goal of the workshop, what do we want people to get out of it? For me, I would like people to bring in a model and they can walk out with a macpan2 implementation of the model they bring in.
MLi: Given it is in-person, I would like a bit of interactive discussions. For example, we can do a brainstorming session of what people want based on their experience and challenges with their own models. We will list all of them and then SW can go through 1 at a time as a group and say, we can do this in macpan2 or we can't do it. This will give us an idea about future developments.
The goal of macpan2
is not to implement cutting edge modelling and
statistical techniques. Instead the goal is to implement well-understood
techniques, optimized for use by epidemiologists who are tasked with
answering real-world public health questions using data-calibrated
compartmental models.
Our effort to present standard tools in a convenient manner has led us to develop concepts that encapsulate how I believe modellers do (or should) think. By building and updating our software with these concepts in mind, I seek to converge on a tool that will help modellers focus more on epidemiology and less on computational details. We highlight these concepts in special boxes.
User Feedback Our modelling concepts are hypotheses that I would like feedback on. What concepts seem helpful/useful and what concepts seem harmful/useless? |
MLi: Do we have a good way to get feedbacks and not via email to SW?
Ideally these concepts would lead to a perfectly intuitive tool. In reality I recognize that there are important technical details that you will need to learn to use the tool effectively. We try to collect the more technical ideas in tip boxes.
Tip boxes like this contain technical information that could interrupt the flow of ideas, but is too important to bury in footnotes. |
If you’re attending this workshop, you’re already a highly skilled modeller. Many of the concepts will be familiar to you, so you may gain more by actively engaging with the tool itself. To support this process of self-discovery, I have designed a series of exercises.
For your first exercise, please familiarize yourself with the package website: https://canmod.github.io/macpan2. Pay particular attention to the quickstart guide and other user guides. |
Throughout these materials I will repeatedly use several coding idioms. Please familiarize yourself with these in advance of the workshop.
We will often use the base R pipe
operator,
|>
, when it improves readability.
MLi: We need to identify what is really necessary vs recommended. People can still use macpan2 without pipes right?
The macpan2
package is designed to be used with two other standard
packages, and so all of the code in these materials assumes that the
following packages are loaded.
library(macpan2) ## obviously
library(ggplot2)
library(dplyr)
library(lubridate)
This document was produced with the following R environment.
sessionInfo() |> print()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.0.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 dplyr_1.1.4 ggplot2_3.5.1 macpan2_1.13.0
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.3 knitr_1.49 rlang_1.1.4
## [5] xfun_0.49 generics_0.1.3 jsonlite_1.8.9 glue_1.8.0
## [9] colorspace_2.1-1 TMB_1.9.15 htmltools_0.5.8.1 fansi_1.0.6
## [13] scales_1.3.0 rmarkdown_2.29 grid_4.4.2 tibble_3.2.1
## [17] evaluate_1.0.1 munsell_0.5.1 MASS_7.3-61 fastmap_1.2.0
## [21] yaml_2.3.10 lifecycle_1.0.4 memoise_2.0.1 compiler_4.4.2
## [25] timechange_0.3.0 pkgconfig_2.0.3 lattice_0.22-6 digest_0.6.37
## [29] R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4 oor_0.0.2
## [33] pillar_1.9.0 magrittr_2.0.3 Matrix_1.7-1 withr_3.0.2
## [37] tools_4.4.2 gtable_0.3.6 cachem_1.1.0
Participants will learn how to explore and modify models, and informally compare them with observed data.
MLi: We will have to also prepare for folks without a model/data. If this is the case, I was like to promote IIDDA. We can say, if you want have data or model, here is a very cool database we compiled. If you are interested in reading more, go here for the preprint.
At the core of macpan2 are model specifications, which primarily define the flows between compartments and set default values for parameters. These specifications offer a clear and systematic way to describe the behavior of a system, starting with the foundational elements of compartmental models. While additional complexities, such as vital dynamics, external sources and sinks, or initialization of state variables, are often necessary, I begin with simple specifications.
Throughout the workshop, I provide you with starting points that take care of much of the boilerplate, enabling you to concentrate on the unique aspects of your applications. For instance, you will explore a library of dynamical models, identify simple specifications, and see how they define flows and default parameter values. Hands-on exercises will guide you to experiment with these models, helping you uncover their structure and applications in a variety of public health contexts.
To list available models in the macpan2 library, use the following command:
show_models()
Directory | Title | Description |
---|---|---|
awareness | Awareness Models | Behaviour modifications in response to death |
hiv | HIV | A simple HIV model |
lotka_volterra_competition | Lotka-Volterra | Simple two-species competition model |
lotka_volterra_predator_prey | Lotka-Volterra | Simple predator-prey model |
macpan_base | Macpan Base | Re-implementation of the McMaster group’s COVID-19 model |
nfds | NFDS and Vaccine Design | An ecological model using population genomics to design optimal vaccines as implemented in Colijn et al. (2020) |
seir | Basic SEIR | Simple epidemic model with an exposed class |
shiver | SHIVER = SEIR + H + V | A modified SEIR model with Hospitalization and Vaccination |
si | Basic SI | A very simple epidemic model |
sir | Basic SIR | A very simple epidemic model |
sir_demog | SIR with Demography | An SIR model with birth and death |
sir_mosquito | Mosquito-Vector SIR | SIR model for mosquito vectors |
sir_waning | SIR with Waning Immunity | A basic SIR model with a flow from R back to S |
ww | Wastewater Model | Macpan base with an additional wastewater component |
This function displays the available models, along with their directories, titles, and descriptions.
To examine a specific model, such as the SIR model, load it into R:
sir = mp_tmb_library("starter_models", "sir", package = "macpan2")
print(sir)
## ---------------------
## Default values:
## ---------------------
## quantity value
## beta 0.2
## gamma 0.1
## N 100.0
## I 1.0
## R 0.0
##
## ---------------------
## Before the simulation loop (t = 0):
## ---------------------
## 1: S ~ N - I - R
##
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N",
## abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
There are three parts to this specification.
- Default values: The default values that model quantities take
before each simulation begins1. In the
sir
example there are two parameters,beta
andgamma
, the total population size,N
, and the initial values of two state variables,I
andR
. - Before the simulation loop: Expressions that get evaluated by
macpan2
before the dynamical simulation loop. In thesir
example the initial value ofS
is computed. Note that the specification could have placedS
in the defaults, but as we will see the choice of whether to (1) place the initial value of a quantity in the defaults or (2) compute it inmacpan2
before the simulation loop begins will depend on a number of considerations. - At every iteration of the simulation loop: Expressions that get
evaluated iteratively during the simulation loop. In the
sir
example there are two flows, which I discuss more throughout the next few sections.
Model specification objects, like sir above, can be produced using several different functions. Above I produced such an object using mp_tmb_library() , which reads pre-existing specifications in from the library. If you were to go the directory that defines this library model you would see the script that actually produces the specification. This script uses the mp_tmb_model_spec() function to do so, and you can use this function directly yourself to produce your own model specifications. Later we will learn about other functions that take a model specification and return a modified version of this specification. This is useful when a model needs to be modified so that it can be compared with data (e.g., account for under-reporting, time-varying transmission rates in response to school closures). |
The sir model illustrates the two kinds of expressions that can be used to define model behaviour: (1) Generic math expressions (e.g., S ~ N - I - R ) and (2) flows among compartments (e.g., mp_per_capita_flows(from ...) ). In generic math expressions, the tilde, ~ , is used to denote assignment. For example, the expression, S ~ N - I - R , says that the result of N - I - R is assigned to the quantity, S . Any function listed here can be used on the right-hand-side of these expressions. Each flow expression corresponds to a single flow between compartments. The tilde-based math expressions and flow expressions can be mixed to provide a great deal of flexibility in model specifications. Throughout I will discuss the relative advantages and disadvantages of using one or the other. |
MLi: I personally don't like this order. I would probably start with showing how to implement an SIR and then be like oh yeah, we have a bunch of templates where we already coded some of the commonly used models. It would be great to show the manual implementation matching up with the boilerplate.
The main thing that you will do with model specifications like sir
is
to simulate dynamics under these models. Because specifications come
with default values for parameters and initial states, they can be
simulated from without any further specification. This does not mean
that model library specifications will typically be ready for any
particular application without modification, and I will discuss many
types of modifications that can be made to these models today. However,
typically the first thing to do when given a new model is to simulate
from it, so you develop this skill immediately.
Follow this quick-start guide to simulate from the sir model using its default values. |
Simple, Consistent, and General Format for Simulation Output Simulations of macpan2 models are always returned in the same format – there are no options. This inflexibility is deliberate. The simulation data format is simple and easily adapted to whatever data organizational preferences you might have. This choice also benefits us because it means we do not need to worry about reinventing data preparation tools (many of which already exist) and focus on epidemiological modelling functionality (of which fewer options exist). This article gives a brief but full description of the data format. |
MLi: Page not found. Unless quickstart shows how to manually make an SIR, I am not comfortable with the current flow. I don't like the idea of right off the bat you use the template. Don't get me wrong, for an user that is probably the case, but for teaching, I would like a manual implementation first.
Model specification objects, such as the sir
object produced in the
previous section, are essentially textual descriptions of compartmental
model flow diagrams. Below I print out the during the simulation loop
portion of the sir
object, and plot the corresponding flow diagram,
highlighting the relationship between the two.
mp_print_during(sir)
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N",
## abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
One aspect of the previous model printout that might be unclear to you
is the specific mathematical meaning of the mp_per_capita_flow
functions, and their arguments.
mp_print_during(sir)
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N",
## abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
The technical documentation for this (and related) functions is here, but the following set of definitions summarizes the terminology.
Absolute flow rate: Total number of individuals flowing from one compartment to another in a single time-step. We call absolute flows things like infection and recovery . These names specified using the abs_rate argument in the mp_per_capita_flow() function. The numerical value of the absolute flow rate depends on the per-capita flow rate, but in a way that depends on the state update method.Per-capita flow rate: The rate at which individuals flow from one compartment to another in a single time-step, per individual in the source compartment. For example, the per-capita infection rate is often given by beta * I / N and for recovery is often gamma . These expressions are specified using the rate argument of the mp_per_capita_rate function.State update method: Mathematical framework for computing absolute flow rates from per-capita flow rates. For example, the difference equation (a.k.a. Euler ODE solver, mp_euler()) state update method results in infection = S * (beta * I / N) , with the per-capita flow rate in parentheses. For the state update method using the Runge-Kutta 4 ODE solver (mp_rk4()), the relationship between absolute flow rates and per-capita flow rates is more complex because it accounts (to some extent) for the fact that the state variables change continuously through a single time step. |
Now that these definitions are in place, let’s look at an example. One
can use the
mp_expand()
function to expand the mp_per_capita_flow()
notation to an explicit
form that is passed to the simulation engine.
sir |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N",
## abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
sir |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: infection ~ S * (beta * I/N)
## 2: recovery ~ I * (gamma)
## 3: S ~ S - infection
## 4: I ~ I + infection - recovery
## 5: R ~ R + recovery
This output illustrates that the (default) simulation framework is a discrete time system of difference equations. In the first two expressions in the expanded version the absolute flow rates are computed, which are used in the last three expressions to update the state variables.
For another example, suppose you are more comfortable with treating compartmental models as ODEs. Indeed many modellers read the SIR flow diagram as representing this specific set of ODEs.
MLi: Expand text why one would want to use ODE. For example, people are used to it... bla bla bla
We can easily use the sir
compartmental model object to simulate it as
an ODE using the mp_rk4()
function, which uses the Runge Kutta 4 ODE
solver.
sir |> mp_rk4() |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: k1_infection ~ S * (beta * I/N)
## 2: k1_recovery ~ I * (gamma)
## 3: k1_S ~ -k1_infection
## 4: k1_I ~ +k1_infection - k1_recovery
## 5: k1_R ~ +k1_recovery
## 6: k2_infection ~ (S + (k1_S/2)) * (beta * (I + (k1_I/2))/N)
## 7: k2_recovery ~ (I + (k1_I/2)) * (gamma)
## 8: k2_S ~ -k2_infection
## 9: k2_I ~ +k2_infection - k2_recovery
## 10: k2_R ~ +k2_recovery
## 11: k3_infection ~ (S + (k2_S/2)) * (beta * (I + (k2_I/2))/N)
## 12: k3_recovery ~ (I + (k2_I/2)) * (gamma)
## 13: k3_S ~ -k3_infection
## 14: k3_I ~ +k3_infection - k3_recovery
## 15: k3_R ~ +k3_recovery
## 16: k4_infection ~ (S + k3_S) * (beta * (I + k3_I)/N)
## 17: k4_recovery ~ (I + k3_I) * (gamma)
## 18: k4_S ~ -k4_infection
## 19: k4_I ~ +k4_infection - k4_recovery
## 20: k4_R ~ +k4_recovery
## 21: infection ~ (k1_infection + 2 * k2_infection + 2 * k3_infection + k4_infection)/6
## 22: recovery ~ (k1_recovery + 2 * k2_recovery + 2 * k3_recovery + k4_recovery)/6
## 23: S ~ S - infection
## 24: I ~ I + infection - recovery
## 25: R ~ R + recovery
Mix and Match Compartmental Models with State Update Methods Difference equations and ODEs are the two most common state update methods for simulating a compartmental model. The design of macpan2 allows different state update methods to be mixed and matched with different compartmental model specifications. This flexible modularity is a key part of the power of macpan2 – the model specification language is agnostic about the state update method. Therefore each compartmental model can be simulated in any number of ways. The full list of currently available state update methods is here. |
Here is one more example of an SIR model expansion that uses the Euler-multinomial distribution to simulate a compartmental model with process error.
MLi: Write a bit more text. Why do people want it and why is it good to include it. I know we are not teaching why people epi/modelling, but I feel like a few extra lines talking through some points can help instead the current version just saying "use this".
sir |> mp_euler_multinomial() |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: infection ~ reulermultinom(S, (beta * I/N))
## 2: recovery ~ reulermultinom(I, gamma)
## 3: S ~ S - infection
## 4: I ~ I + infection - recovery
## 5: R ~ R + recovery
Note that all state update
methods
generate the same last three expressions for this sir
model object.
S ~ S - infection
I ~ I + infection - recovery
R ~ R + recovery
This means that all state update methods can be thought of as discrete time difference equations, but with different approaches to calculating the absolute flow rates per time step.
Embrace Difference Equations The original McMasterPandemic project was based entirely on discrete time models. Simulations from discrete time difference equations requires little math to implement and understand. This simplicity is nice in applied work where the barriers to real-world impact do not usually include mathematical novelty and elegance. Further, when modelling a real system it is natural to compare simulations with observed data, which are measured at discrete times anyways. Difference equations are not without their drawbacks however. Many results in mathematical epidemiology come from ODEs and various stochastic processes, and so when using difference equations epidemiologists might be concerned that their intuition from ODEs will not hold. Difference equations can also be less stable (e.g., state variables going below zero) than ODEs, which can become a real problem when calibrating models using trajectory matching. Therefore we make it easy to check these concerns by using a state update method like mp_rk4() (for a simple ODE solver) or mp_euler_multinomial() (for a simple process error model) to overcome these concerns whenever necessary.When using these alternative state update methods keep in mind that they too can be thought of as difference equation model. Carefully inspect the last three lines of any of the expanded sir models, which constitute the same set of difference equations no matter what state update method is used. This similarity indicates that all state update methods boil down to discrete-time difference equations, but with different approaches to calculating the absolute flow rates (e.g., infection and recovery ). These absolute rate variables describe the changes in the state variables (e.g., S , I , R ) due to different processes over a single time-step, allowing for continuous change within a time-step. This means, for example, that the time-series of the infection variable will contain the incidence over each time-step. So, for example, if each time-step represents one week, the infection variable will contain weekly incidence values.In summary, the choice of state update method will often not matter in applied work, but when it does the state update methods make it simple to explore options. |
MLi: I would go back and see if we can make a shorter version. We can keep this as it is, but I feel like there is an imbalance of efforts (which is not your fault), so the way to resolve this imbalance is to cut (pieces that are way too detailed) and not add (make other sectioned even bigger).
Write out by hand the last four lines of an expanded SEIR model. Read the SEIR model from the model library and test if you were correct. |
TODO: Simulate dynamics with different approaches.
The models in the model library are intended to provide a place to start, so let’s get started doing this.
Follow the advice here to create two copies of the SIR model: one called SI and one called SEIR. Place both of these models in a directory on your computer called macpan-workshop-library . Or just do one model as time permits.Simplify the model called SI so that it is an SI model (check your work here). Add flows to the model called SEIR so that it becomes an SEIR model (check your work here). Simulate a time-series of incidence from one of these models and plot it. |
MLi: Highlight this is basically what "off-the-shelf" model means.
The previous exercise described one approach to modifying existing
models, which involves copying a directory and editing the files in that
directory. Another approach uses the mp_tmb_update()
,
mp_tmb_insert()
,
mp_tmb_delete()
functions, which take a loaded model specification object (like the
sir
object above) and returns a modified version of that object.
Read this help page and be ready to discuss the differences between these functions: https://canmod.github.io/macpan2/reference/mp_tmb_insert. Use these functions to repeat the previous exercise by modifying the sir object within an R session instead of editing a model definition file. |
Now we are getting to the interesting stuff in this first Exploration
section of the Exploration-Parameterization-Inference-Stratification
methodology: using data (and other empirical information) and
simulations together to understand a system. This step is critical in
epidemiological modelling for applied public health work. No matter how
fancy you get with your techniques, your model must always be relevant
to a specific place, time, and public health concern.
Let’s get started by loading in some familiar Ontario COVID-19 data.
release = "https://github.com/canmod/macpan2/releases/download/macpan1.5_data"
covid_on = (release
|> file.path("covid_on.RDS")
|> url()
|> readRDS()
)
covid_on |> head() |> print()
## # A tibble: 6 × 4
## province date var value
## <chr> <date> <chr> <dbl>
## 1 ON 2020-02-04 Hospitalization NA
## 2 ON 2020-02-04 ICU NA
## 3 ON 2020-02-04 Ventilator NA
## 4 ON 2020-02-04 report NA
## 5 ON 2020-02-04 newTests NA
## 6 ON 2020-02-04 death NA
covid_on |> summary()
## province date var value
## Length:11481 Min. :2020-01-16 Length:11481 Min. :-26834981
## Class :character 1st Qu.:2021-01-27 Class :character 1st Qu.: 1
## Mode :character Median :2022-03-10 Mode :character Median : 95
## Mean :2022-03-14 Mean : 638
## 3rd Qu.:2023-05-26 3rd Qu.: 694
## Max. :2024-04-16 Max. : 26834981
## NA's :1793
Hmmm … lots of suspicious things to investigate … but let’s stay focused on case reports, which I will compare with simulated incidence.
reports = (covid_on
|> filter(var == "report")
|> filter(abs(value) < 10^4) ## do not believe numbers higher than 10^4
)
(reports
|> ggplot()
+ geom_line(aes(date, value))
+ theme_bw()
)
Just to start, let’s focus on a single wave so I do not need to worry about time-varying parameters and/or the causes of waves.
MLi: Is this hinting you will continue this exact same example later? If not, change the text accordingly please.
early_reports = filter(reports, date < as.Date("2020-07-01"))
(early_reports
|> ggplot()
+ geom_line(aes(date, value))
+ theme_bw()
)
Before comparing these data with simulations, I update the sir
model
to make it more realistic. We do not need to be perfect at this stage,
just reasonable. The goal is to make an initial plot containing both
simulations and observed data. This will form the foundation for
modifying the model to make it capture important features of the
underlying processes.
MLi: What does reasonable mean here? Expand the chain of thought.
sir_covid = mp_tmb_insert(sir
## Modify the the part of the model that
## is executed once "during" each iteration
## of the simulation loop.
, phase = "during"
## Insert the new expressions at the
## end of each iteration.
## (Inf ~ infinity ~ end)
, at = Inf
## Specify the model for under-reporting as
## a simple fraction of the number of new
## cases every time-step.
, expressions = list(reports ~ report_prob * infection)
## Update defaults to something a little more
## reasonable for early covid in Ontario.
, default = list(
gamma = 1/14 ## 2-week recovery
, beta = 2.5/14 ## R0 = 2.5
, report_prob = 0.1 ## 10% of cases get reported
, N = 1.4e7 ## ontario population
, I = 50 ## start with undetected infections
)
)
print(sir_covid)
## ---------------------
## Default values:
## ---------------------
## quantity value
## beta 1.785714e-01
## gamma 7.142857e-02
## N 1.400000e+07
## I 5.000000e+01
## R 0.000000e+00
## report_prob 1.000000e-01
##
## ---------------------
## Before the simulation loop (t = 0):
## ---------------------
## 1: S ~ N - I - R
##
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N",
## abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
## 3: reports ~ report_prob * infection
Now we simulate the trajectory of the reports
variable in the model, and plot it against the observed data.
sim_early_reports = (sir_covid
|> mp_simulator(
time_steps = nrow(early_reports)
, outputs = "reports"
)
|> mp_trajectory()
|> mutate(
date = min(early_reports$date) + days(time)
)
)
comparison_early_reports = bind_rows(
list(
simulated = sim_early_reports
, observed = early_reports
)
, .id = "source"
)
(comparison_early_reports
|> ggplot()
+ geom_line(aes(date, value, colour = source))
## constrain y-axis to the range of the
## observed data, otherwise the simulations
## will make it impossible to see the data
+ coord_cartesian(ylim = c(0, max(early_reports$value, na.rm = TRUE)))
+ theme_bw() + theme(legend.position="bottom")
)
It took me a few iterations of model adjustments to make this plot, which illustrates a reasonably good fit before the peak of the wave and quite poor fit after it. Still, the next two concepts argue that this is quite a bit of progress and sets us up nicely for next steps.
MLi: Define "reasonably good fit". What makes you say that? (I am not saying you are wrong, I want you to say a bit more).
The challenge introduced by reporting delays is that we cannot compare the simulated incidence values from our model with observed case reports, because there is a reporting delay. The following toy example shows how the apparent peak in the data follows the simulations.
So before comparing the simulations with observed data, we use convolution to transform the simulated incidence values into a time series of reported cases that is expected by the model. Keep reading to find out specifically how this is done.
MLi: Explain what convolutions are vs a shift in data or whatever people do to handle reporting delays.
Now we can now compare the observed and expected reported cases in a manner that accounts for reporting delays.
Now to explain how the expected reported case curve is computed. In this method we assume that each new case waits a Gamma-distributed random amount of time before it is observed. The mean and coefficient of variation of this distribution can be fitted parameters. During the COVID-19 pandemic the McMaster group found that a Gamma distribution with mean delay of 11 days and coefficient of variation of 0.25 was a reasonable model.
MLi: Gamma is ok, for an example, do we want to use Normal instead? If you use Gamma, do you want to explain it a little? (i.e. non-negative and etc).
The following figure illustrates how a single expected case report is computed, in this case at time 26. The Gamma distribution of delay times is represented as bars giving the probability that a new case at a particular time is reported at time 26, which is given as the dashed horizontal line. Note how the reported number of cases is roughly near (but not identical) to the simulated number of cases at the peak of the delay distribution. This makes sense because we would expect most of our reports now to come from the highest probability time in the past.
In macpan2
we typically have a model that doesn’t account for delayed
reporting and we want to update it to do so before fitting to data. One
can use the mp_tmb_insert_reports()
function for this purpose.
si_with_delays = (si
|> mp_tmb_insert_reports(
incidence_name = "infection"
, mean_delay = 11
, cv_delay = 0.25
, report_prob = 1
, reports_name = "reports"
)
)
(si_with_delays
|> mp_simulator(
time_steps = 50L
, outputs = c("infection", "reports")
)
|> mp_trajectory()
|> ggplot()
+ geom_line(aes(time, value, colour = matrix))
+ theme_bw()
)
Notice how the 11
-day reporting delay specific using the mean_delay
argument of the mp_tmb_insert_reports
function has produced a
reports
variable that peaks about 9
or 10
days after the new
infection
peak. Non-linear
averaging has
shifted the individual-level delay of 11
days to something a little
lower at the population level. Still, the mp_tmb_insert_reports
function has had the desired qualitative effect of shifting the
reports
peak to the right of the infection
peak.
Read the documentation on mp_tmb_insert_reports, and have a look at the following model library examples that utilize this delay modelling functionality: shiver, macpan_base, wastewater. |
When these data are available and of high quality it is typically straightforward to compare them with simulations. One must obviously have a model containing hospital compartments (or modify an existing model to contain such compartments). One must also make sure that admissions are compared with flows into a hospital compartment and that occupancy is compared with a state variable of a hospital compartment.
The shiver model gives an example that utilizes hospital data. |
The simple compartmental models (i.e., si, sir, sir_waning, seir) do not account for variation in symptoms amongst infected individuals. However, the severity of symptoms can have a big impact on what cases get reported.
For some applications it is necessary to distinguish between different
symptom statuses, which we do by splitting up the I
box. This
splitting up can be done by replacing all calls to
mp_per_capita_flows()
that involve I
with flows that involve several
I
boxes (e.g., I_mild
, I_severe
) that group together individuals
with similar symptom status.
The macpan_base model utilizes this technique. A discussion of this technique is also given in the examples on the mp_per_capita_flow() help page. |
Modify the si model so that it has three boxes: S , I_mild , and I_severe . |
In cases where report data are broken down by symptom status, then this
technique of splitting up I
into several boxes might be sufficient.
However, if cases with different symptom statuses are combined in
reports, an additional model component must be added that weights the
relative likelihood that different symptom statuses will be reported.
This additional component is related to the reporting
delay functionality.
It is often the case that serological data can be used to estimate the degree of immunity in the population, which can in turn be used to estimate the number of individuals in recovered, vaccinated, and otherwise immune (or partially immune) compartments. The shiver model description provides an example of this technique, which involves transforming either the raw seroprevalence data so that it is comparable with different compartments or adding a variable to the model that produces a transformation of the variables tracking immune states that is comparable with the raw seroprevalence data.
MLi: Is shiver doing R/N or (N-S)/N for seroprevalence?
MLi: Can you do a better transition instead of jumping right in? A short paragraph saying, oh, you want to do vaccination, you will need x,y,z).
If you happen to know the vaccination schedule (e.g., number of doses administered per time-step for each time-step in the simulation), one simple way to model vaccination is to create a time-varying parameter that moves a known time-dependent number of individuals from a susceptible compartment to a protected compartment. An even simpler variation is just to include a constant number of doses per day, but this approach can be too simplistic especially when investigating questions of vaccine efficiacy.
One complexity with this approach is that it can reduce the number of
susceptible individuals below zero, if the user is not careful. This can
happen when the simulated number of S
individuals drops below the
number of doses per day. This issue could be addressed by using the
minimum of S
and the known number of doses as the realized number of
doses. However this approach has severe drawbacks when calibrating,
because as we will learn in the Parameterization
section macpan2
uses the
TMB package
for optimizing fit to data and TMB becomes much less efficient when we
use discontinuous functions like this. To address this complexity we can
use continuously saturating functions of S
to model the number of
doses, a technique that is covered here with the shiver
model.
This technique is also discussed in the examples on the
mp_per_capita_flow() help
page.
Wastewater models must have two kinds of compartments: those that
represent numbers of people and those that represent the concentration
of a pathogen in wastewater. Flows from a person-box into a
wastewater-box (e.g., shedding from I
to W
) requires one to use the
mp_per_capita_inflow()
function instead of the mp_per_capita_flow()
function (see
here
for details). This is because flowing from I
to W
does not cause a
reduction in the size of I
, because people are not becoming
pathogens. The mp_per_capita_inflow()
function ensures that W
is
increased by this shedding process without decreasing I
. This
technique is illustrated
here.
In many cases birth-death dynamics happen slowly enough relative to transmission dynamics that we can assume that the population size is constant, with no birth or death.
The sir_demog model includes birth and death. You can use this example to add vital dynamics to another model. Neither birth nor death can use mp_per_capita_flow() , but instead mp_per_capita_inflow() and mp_per_capita_outflow() (see here for examples and details). |
In many cases large jurisdictions can be considered closed and deterministic systems without altering conclusions. Small populations however are more strongly impacted by stochasticity.
The awareness model in the library illustrates how macpan2 can be used in these cases. |
Participants will learn how to parameterize models for making inferences about a particular population and public health problem.
We have done this already, because it is so common that the default
values provided in the model
library will
require adjustment for any particular purpose – especially for very
generic models like
sir.
See the sir_covid
model object that we produced above in this
section.
The default
argument of the
mp_tmb_model_spec()
and mp_tmb_update()
functions allows us to
supply prior information on model parameter values, but gives us no
opportunity to express uncertainty about these defaults. Later in the
Optimized Calibration section we will see how
to combine a model specification with distributional
specifications
for the parameters that we would like to estimate through calibration.
The distributional specifications are explained in the context of various use-cases in this article. |
We have covered a lot of gound above, but it has all paved the way towards the topic of formally calibrating a model using data. Before getting to this point we needed to do the following.
- Find a model specification to start with.
- Be able to simulate from this model
- Be able to modify this model so that we could compare simulations with different kinds of observed data – for example:
- Express prior uncertainty about model parameters
Now that we have this out of the way we can create a model calibrator object that combines (1) a model specification, (2) a dataset that can be compared with simulations from this model, and (3) a list of distributional assumptions about model parameters that we would like to estimate. Such a calibrator can be used to fit these parameters to data, and then simulate and forecast observable variables using this calibrated model. These calibrators can also be used to generate confidence intervals for these simulations and forecasts using (regularized) maximum likelihood theory.
This article steps through the process of calibrating a compartmental model specification, using the sanity check of fitting a model to simulations generated from the same model. The sequel to this article illustrates these techniques using real scarlet fever data from the International Infectious Disease Data Archive. |
The recommended approach to model fitting with macpan2
is maximum a
posteriori (MAP)
estimation,
which finds the values of parameters at the peak of a posterior
distribution. This approach allows us to get approximate Bayesian
uncertainty estimates by approximating the posterior distribution as a
multivariate Gaussian centered at the peak obtained by MAP. One is able
to declare that parameters be considered random effects, which means
that they are integrated out of the posterior distribution using the
Laplace
approximation.
Using this approach allows us to reduce the dimensionality of the
posterior distribution used in MAP by only optimizing over the fixed
effect parameters.
This approximate posterior distribution can be used to obtain confidence / credible intervals for any continuous function of model variables. The two main examples of this are as follows.
- Confidence intervals on fitted parameters using
mp_tmb_coef(., conf.int = TRUE)
. - Confidence intervals on fitted trajectories using
mp_trajectory_sd()
This approach to computing confidence intervals using Gaussian approximations to the posterior is a special case of the delta method that is implemented in TMB, which is described here.
MLi: Will need some tips when you hit optimization problem and NaNs. What do you do? Can we make a simple MRE?
Example calibrations of varying levels of complexity Intro to calibration using toy data Intro to calibration using real scarlet fever data For hackers only McMaster COVID model Wastewater model of COVID SHIVER model to COVID SEIR model using toy data |
You are free to declare any value in the model to be a trajectory or
parameter. Trajectories are compared with observed data to produce a
likelihood component (e.g., case reports are distributed with a negative
binomial distribution with dispersion parameter 1
) and parameters are
used to produce a component of the prior distribution (e.g., the
transmission rate is log normally distributed with mean 0.2
and
standard deviation 1
). Available distributional assumptions and
settings are described
here,
but inspecting the examples linked to in the previous FAQ will be more
useful for figuring out how to use this functionality.
Yes.
A common approach for making simple models (i.e., si, sir, sir_waning, seir, shiver) applicable to complex reality, is to use time-varying parameters. A common example from the COVID-19 pandemic involved the variation of the transmission rate through time in response to public health measures.
This article describes several approaches to calibrating time-varying parameters. |
Participants will learn how to make inferences using realistically parameterized models.
If you have a model_calibrator
object fitted to a dataset called
fitted_data
, you can visualize the goodness-of-fit of this model to
the data using ggplot2
in the following manner.
fitted_data = mp_trajectory(model_calibrator)
(observed_data
|> ggplot()
+ geom_point(aes(time, value))
+ geom_line(aes(time, value), data = fitted_data)
+ theme_bw()
)
This section of the article on model fitting illustrates other techniques, such as checking convergence of the optimizer, getting parameter estimates with confidence intervals, and visualizing the goodness-of-fit with confidence intervals. |
Forecasts can be made by extending the simulation bounds, and using the general tools for assessing model fits.
Counter-factuals can be introduced by using the model modification tools, and using the general tools for assessing model fits. Counter-factuals commonly involve making parameters time-varying in the forecast period (e.g., in 30 days the transmission rate will change due to non-pharmaceutical interventions being considered).
This example model includes a discussion of how to compute reproduction numbers and the moments of the generation interval distribution for a general compartmental model.
When one extends one of the simple models (i.e.,
si,
sir,
sir_waning,
seir,
shiver)
to be stratified (e.g., by age), transmission/infection processes
become more
complex.
In these cases one may specify S
, E
, and I
as vectors, and
infection rates as matrix operations involving these vectors and
matrices of transmission parameters. This approach is desecribed
here.
- Preliminaries
- Exploration
- Starter Model Library
- Simulating Dynamics
- Relating Model Specifications to Box Diagrams
- Relating Model Specifications to Explicit Dynamics
- Basics of Modifying Model Specifications
- Compare Simulated and Observed Incidence
- Reporting Delays
- Hospital Admissions and Occupancy
- Symptom Status
- Seroprevalence
- Vaccination
- Wastewater Pathogen Concentrations
- Vital Dynamics
- Importation and Stochasticity
- Parameterization
- Inference
- Stratification
Footnotes
-
These defaults can be overridden but let’s ignore that for now. We will cover the various ways to override defaults later. ↩