Skip to content

Latest commit

 

History

History
1108 lines (913 loc) · 91 KB

companion-materials.md

File metadata and controls

1108 lines (913 loc) · 91 KB

Workshop on macpan2 – Companion Materials

Author: Steve Walker

Last Updated: 2025-03-20

CC BY-NC-SA 4.0

This is a (currently draft) set of companion materials for a workshop on using macpan2 for applied public health modelling.

(Note: I highlight areas that cannot be finalized yet, with notes like this)

Table of Contents

Preliminaries

Syllabus

Please read the syllabus before participating in this workshop. The instructor will review this syllabus with participants before the sessions start.

The syllabus provides a roadmap for the workshop, guiding participants through the exploration-calibration-inference-stratification strategy that can be used for compartmental modeling. By the end of the workshop, participants will have the skills to develop compartmental models using macpan2, navigate its documentation to address challenges, and provide constructive feedback for improving the tool.

Philosophy

The goal of macpan2 is to implement well-understood techniques, for use by epidemiologists who are tasked with using data-calibrated compartmental models to answer public health questions.

Organization

We collect some of the more technical ideas in tip boxes.

Tip boxes like this contain technical information that could interrupt the flow of ideas, but is too important to bury in footnotes.

If you’re attending this workshop, you’re already an experienced modeller. Many of the concepts will be familiar to you, so you may gain more by doing illustrative exercises.

For your first exercise, please familiarize yourself with the package website: https://canmod.github.io/macpan2. Pay particular attention to the quickstart guide and other user guides.

Code Style

Throughout these materials I will repeatedly use several coding idioms. Please familiarize yourself with these in advance of the workshop.

We will often use the base R pipe operator, |>, when it improves readability.

Technical Setup

Instructions for setting your computer for using macpan2 are here.

Exploration

In this section you will learn how to explore and specify compartmental models, and informally compare them with observed data.

At the core of macpan2 are model specifications, which define compartment flows and set default parameter values. These specifications provide a systematic framework for describing system behavior, starting with the foundational elements of compartmental models. While additional complexities (e.g., vital dynamics) are often required, we are going to be focusing on simple models.

Parsimony
Starting simple reflects a good modeling principle: begin with a straightforward model and add complexity as needed for specific data, details, or public health objectives. This approach enables a quick start with macpan2 and a smooth transition to more intricate applications. It also underscores the value of abstract compartmental models stored as reusable templates, which can be efficiently adapted to diverse public health challenges.

Starter Model Library

The macpan2 model library provides some example models as a place to start.

To list available models in the macpan2 library:

show_models()
Directory Title Description
awareness Awareness Models Behaviour modifications in response to death
hiv HIV A simple HIV model
lotka_volterra_competition Lotka-Volterra Simple two-species competition model
lotka_volterra_predator_prey Lotka-Volterra Simple predator-prey model
macpan_base Macpan Base Re-implementation of the McMaster group’s COVID-19 model
nfds NFDS and Vaccine Design An ecological model using population genomics to design optimal vaccines as implemented in Colijn et al. (2020)
seir Basic SEIR Simple epidemic model with an exposed class
shiver SHIVER = SEIR + H + V A modified SEIR model with Hospitalization and Vaccination
si Basic SI A very simple epidemic model
sir Basic SIR A very simple epidemic model
sir_demog SIR with Demography An SIR model with birth and death
sir_mosquito Mosquito-Vector SIR SIR model for mosquito vectors
sir_waning SIR with Waning Immunity (SIRS) A basic SIR model with a flow from R back to S
ww Wastewater Model Macpan base with an additional wastewater component

This function displays the available models, along with their directories, titles, and descriptions. A simpler function just lists the models, mp_list_models. The mp_model_docs function will take you to the documentation for any particular model in this list.

To examine a specific model, such as the SIR model, load it into R:

sir = mp_tmb_library("starter_models", "sir", package = "macpan2")
print(sir)
## ---------------------
## Default values:
##  quantity value
##      beta   0.2
##     gamma   0.1
##         N 100.0
##         I   1.0
##         R   0.0
## ---------------------
## 
## ---------------------
## Before the simulation loop (t = 0):
## ---------------------
## 1: S ~ N - I - R
## 
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N", 
##      abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")

There are three parts to this specification.

  • Default values: The default values that model quantities take before each simulation begins1. In the sir example there are two parameters (the transmission rate beta and removal rate gamma), the total population size, N, and the initial values of two state variables, I and R.
  • Before the simulation loop: Expressions that macpan2 evaluates before the dynamical simulation loop. In the sir example the initial value of S is computed. The specification could have placed S in the defaults: in this case we compute it before the simulation loop begins instead in order to allow it to depend on other model quantities.
  • At every iteration of the simulation loop: Expressions that get evaluated iteratively during the simulation loop. In the sir example there are two flows, which I discuss more throughout the next few sections.
You can use several functions to produce model specification objects, like sir above, can be produced using several different functions. Above I produced such an object using mp_tmb_library(), which reads pre-existing specifications in from the library. If you go to the [directory]s(https://github.com/canmod/macpan2/tree/main/inst/starter_models/sir) that defines this library model you will see the R script that actually produces the specification, using the mp_tmb_model_spec() function; you can use this function yourself to produce your own model specifications. Later we will learn about other functions that take a model specification and return a modified version of this specification. This is useful when a model needs to be modified so that it can be compared with data (e.g., account for under-reporting, time-varying transmission rates in response to school closures).
The sir model illustrates the two kinds of expressions that can be used to define model behaviour: (1) Generic math expressions (e.g., S ~ N - I - R) and (2) flows among compartments (e.g., mp_per_capita_flows(from ...)). In generic math expressions, the tilde, ~, is used to denote assignment. For example, the expression, S ~ N - I - R, says that the result of N - I - R is assigned to the quantity, S. Any function listed here can be used on the right-hand-side of these expressions. Each flow expression corresponds to a single flow between compartments. The tilde-based math expressions and flow expressions can be mixed to provide a great deal of flexibility in model specifications.

Simulating Dynamics

The main thing that you will do with model specifications like sir is to simulate dynamics under these models. Simulation is useful to explore scenarios and to prepare for calibrating models to data. Because specifications come with default values for parameters and initial states, they can be simulated from without any further information. However, you will almost almost need to modify a built-in model specification to apply it to your particular situation; we will discuss some of the ways to modify these models.

Typically the first thing to do to understand and validate a new model is to simulate from it, so we will work on this immediately.

Follow this quick-start guide to simulate from the sir model using its default values.
Simple, Consistent, and General Format for Simulation Output
Simulations of macpan2 models are always return simulation outputs in the same format, a long format data frame where each output variable (e.g., S, I, R) has a separate row for every time. This output format is simple, works well with existing R frameworks (e.g., tidyverse/ggplot2) and can easily be restructured to other formats (e.g., with tidyr::pivot_wider()). This article gives a brief but full description of the data format.

Relating Model Specifications to Box Diagrams

Model specification objects, such as the sir object produced in the previous section, are essentially textual descriptions of compartmental model flow diagrams. Below we compare the during the simulation loop portion of the sir object with the corresponding flow diagram:

mp_print_during(sir)
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N", 
##      abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")

Identify how each aspect of the diagram (compartments, arrows, rates) are represented in the printout of the model specification.

Inspect the model specification again. What questions do you have about each part of the specification?

Relating Model Specifications to Explicit Dynamics

One possibly unclear aspect of the previous model printout is the specific mathematical meaning of the mp_per_capita_flow functions, and their arguments.

mp_print_during(sir)
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N", 
##      abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")

You can find the technical documentation for this function (and related functions) here, but the following set of definitions summarizes the terminology.

Absolute flow rate: Total number of individuals flowing from one compartment to another in a single time-step. We call absolute flows things like infection and recovery. These names specified using the abs_rate argument in the mp_per_capita_flow() function. The numerical value of the absolute flow rate depends on the per-capita flow rate, but in a way that depends on the state update method.

Per-capita flow rate: The rate at which individuals flow from one compartment to another in a single time-step, per individual in the source compartment. For example, the per-capita infection rate is often given by beta * I / N and for recovery is often gamma. These expressions are specified using the rate argument of the mp_per_capita_rate function.

State update method: Mathematical framework for computing absolute flow rates from per-capita flow rates. For example, the difference equation (a.k.a. Euler ODE solver, mp_euler()) state update method results in infection = S * (beta * I / N), with the per-capita flow rate in parentheses. For the state update method using the Runge-Kutta 4 ODE solver (mp_rk4()), the relationship between absolute flow rates and per-capita flow rates is more complex because it accounts (to some extent) for the fact that the state variables change continuously through a single time step.

Now that these definitions are in place, let’s look at an example. One can use the mp_expand() function to expand the mp_per_capita_flow() notation to an explicit form that is passed to the simulation engine.

sir |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N", 
##      abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
sir |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: infection ~ S * (beta * I/N)
## 2: recovery ~ I * (gamma)
## 3: S ~ S - infection
## 4: I ~ I + infection - recovery
## 5: R ~ R + recovery

This output illustrates that the (default) simulation framework is a discrete time system of difference equations. In the first two expressions in the expanded version the absolute flow rates are computed, which are used in the last three expressions to update the state variables.

For another example, suppose you are more comfortable with treating compartmental models as ODEs. Indeed many modellers read the SIR flow diagram as representing this specific set of ODEs.

$$ \begin{align} \frac{dS}{dt} & = -S \times \beta I / N \\ \frac{dI}{dt} & = -I \times \gamma + S \times \beta I / N \\ \frac{dR}{dt} & = I \times \gamma \\ \end{align} $$

We can easily use the sir compartmental model object to simulate it as an ODE using the mp_rk4() function, which uses the Runge Kutta 4 ODE solver.

sir |> mp_rk4() |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
##  1: k1_infection ~ S * (beta * I/N)
##  2: k1_recovery ~ I * (gamma)
##  3: k1_S ~ -k1_infection
##  4: k1_I ~ +k1_infection - k1_recovery
##  5: k1_R ~ +k1_recovery
##  6: k2_infection ~ (S + (k1_S/2)) * (beta * (I + (k1_I/2))/N)
##  7: k2_recovery ~ (I + (k1_I/2)) * (gamma)
##  8: k2_S ~ -k2_infection
##  9: k2_I ~ +k2_infection - k2_recovery
## 10: k2_R ~ +k2_recovery
## 11: k3_infection ~ (S + (k2_S/2)) * (beta * (I + (k2_I/2))/N)
## 12: k3_recovery ~ (I + (k2_I/2)) * (gamma)
## 13: k3_S ~ -k3_infection
## 14: k3_I ~ +k3_infection - k3_recovery
## 15: k3_R ~ +k3_recovery
## 16: k4_infection ~ (S + k3_S) * (beta * (I + k3_I)/N)
## 17: k4_recovery ~ (I + k3_I) * (gamma)
## 18: k4_S ~ -k4_infection
## 19: k4_I ~ +k4_infection - k4_recovery
## 20: k4_R ~ +k4_recovery
## 21: infection ~ (k1_infection + 2 * k2_infection + 2 * k3_infection + k4_infection)/6
## 22: recovery ~ (k1_recovery + 2 * k2_recovery + 2 * k3_recovery + k4_recovery)/6
## 23: S ~ S - infection
## 24: I ~ I + infection - recovery
## 25: R ~ R + recovery
Mix and Match Compartmental Models with State Update Methods
Difference equations and ODEs are the two most common state update methods for simulating a compartmental model. The design of macpan2 allows different state update methods to be mixed and matched with different compartmental model specifications. This flexible modularity is a key part of the power of macpan2 – the model specification language is agnostic about the state update method. Therefore each compartmental model can be simulated in any number of ways. The full list of currently available state update methods is here.

Here is one more example of an SIR model expansion that uses the Euler-multinomial distribution to simulate a compartmental model with process error.

sir |> mp_euler_multinomial() |> mp_expand() |> mp_print_during()
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: infection ~ reulermultinom(S, (beta * I/N))
## 2: recovery ~ reulermultinom(I, gamma)
## 3: S ~ S - infection
## 4: I ~ I + infection - recovery
## 5: R ~ R + recovery

Note that all state update methods generate the same last three expressions for this sir model object.

S ~ S - infection
I ~ I + infection - recovery
R ~ R + recovery

This means that all state update methods can be thought of as discrete time difference equations, but with different approaches to calculating the absolute flow rates per time step.

Embrace Difference Equations
The original McMasterPandemic project was based entirely on discrete time models. Simulations from discrete time difference equations requires little math to implement and understand. This simplicity is nice in applied work where the barriers to real-world impact do not usually include mathematical novelty and elegance. Further, when modelling a real system it is natural to compare simulations with observed data, which are measured at discrete times anyways.

Difference equations are not without their drawbacks however. Many results in mathematical epidemiology come from ODEs and various stochastic processes, and so when using difference equations epidemiologists might be concerned that their intuition from ODEs will not hold. Difference equations can also be less stable (e.g., state variables going below zero) than ODEs, which can become a real problem when calibrating models using trajectory matching. Therefore we make it easy to check these concerns by using a state update method like mp_rk4() (for a simple ODE solver) or mp_euler_multinomial() (for a simple process error model) to overcome these concerns whenever necessary.

When using these alternative state update methods keep in mind that they too can be thought of as difference equation model. Carefully inspect the last three lines of any of the expanded sir models, which constitute the same set of difference equations no matter what state update method is used. This similarity indicates that all state update methods boil down to discrete-time difference equations, but with different approaches to calculating the absolute flow rates (e.g., infection and recovery). These absolute rate variables describe the changes in the state variables (e.g., S, I, R) due to different processes over a single time-step, allowing for continuous change within a time-step. This means, for example, that the time-series of the infection variable will contain the incidence over each time-step. So, for example, if each time-step represents one week, the infection variable will contain weekly incidence values.

In summary, the choice of state update method will often not matter in applied work, but when it does the state update methods make it simple to explore options.
Write out by hand the last four lines of an expanded SEIR model.

Read the SEIR model from the model library and test if you were correct.

Basics of Modifying Model Specifications

Follow the advice here to create two copies of the SIR model: one called SI and one called SEIR. Place both of these models in a directory on your computer called macpan-workshop-library. Or just do one model as time permits.

Simplify the model called SI so that it is an SI model (check your work here).

Add flows to the model called SEIR so that it becomes an SEIR model (check your work here).

Simulate a time-series of incidence from one of these models and plot it.

The previous exercise described one approach to modifying existing models, which involves copying a directory and editing the files in that directory. Another approach uses the mp_tmb_update(), mp_tmb_insert(), mp_tmb_delete() functions, which take a loaded model specification object (like the sir object above) and returns a modified version of that object.

Read this help page and be ready to discuss the differences between these functions: https://canmod.github.io/macpan2/reference/mp_tmb_insert.

Use these functions to repeat the previous exercise by modifying the sir object within an R session instead of editing a model definition file.

Compare Simulated and Observed Incidence

Now we will use data and simulations together to understand a system.

Data Prep and Model Modifications Are Two Sides of the Same Coin
To confront a model with data one needs to do two general classes of things: (1) access and prepare data and (2) modify the model so that it is relevant to those data. Typically there is an interplay between how to prepare data and how you set up (or modify) the model. For example, you might choose to deal with under-reporting by dividing the incidence time-series by a known underreporting rate. In this case you wouldn’t need to modify the model, because scaling the data would account for underreporting. On the other hand, you might choose to transform the modelled incidence (number of new cases per time period) into a modelled observed incidence by multiplying the incidence by the reporting probability, and then comparing the simulations with the raw observed data. In this section we will explore a specific path to handling these issues.

Let’s get started by loading in some familiar Ontario COVID-19 data.

release = "https://github.com/canmod/macpan2/releases/download/macpan1.5_data"
covid_on = (release
  |> file.path("covid_on.RDS")
  |> url() 
  |> readRDS()
)
covid_on |> head() |> print()
## # A tibble: 6 × 4
##   province date       var             value
##   <chr>    <date>     <chr>           <dbl>
## 1 ON       2020-02-04 Hospitalization    NA
## 2 ON       2020-02-04 ICU                NA
## 3 ON       2020-02-04 Ventilator         NA
## 4 ON       2020-02-04 report             NA
## 5 ON       2020-02-04 newTests           NA
## 6 ON       2020-02-04 death              NA
covid_on |> summary()
##    province              date                var                value          
##  Length:11481       Min.   :2020-01-16   Length:11481       Min.   :-26834981  
##  Class :character   1st Qu.:2021-01-27   Class :character   1st Qu.:        1  
##  Mode  :character   Median :2022-03-10   Mode  :character   Median :       95  
##                     Mean   :2022-03-14                      Mean   :      638  
##                     3rd Qu.:2023-05-26                      3rd Qu.:      694  
##                     Max.   :2024-04-16                      Max.   : 26834981  
##                                                             NA's   :1793

Hmmm … lots of suspicious things to investigate … but let’s stay focused on case reports, which I will compare with simulated incidence.

reports = (covid_on
    |> filter(var == "report")
    |> filter(abs(value) < 10^4) ## do not believe numbers higher than 10^4
)
(reports
    |> ggplot()
    + geom_line(aes(date, value))
    + theme_bw()
)

Just to start, let’s focus on a single wave so I do not need to worry about time-varying parameters and/or the causes of waves.

early_reports = filter(reports, date < as.Date("2020-07-01"))
(early_reports
    |> ggplot()
    + geom_line(aes(date, value))
    + theme_bw()
)

Before comparing these data with simulations, I update the sir model to make it more realistic. We do not need to be perfect at this stage, just reasonable. The goal is to make an initial plot containing both simulations and observed data. This will form the foundation for modifying the model to make it capture important features of the underlying processes.

sir_covid = mp_tmb_insert(sir

    ## Modify the the part of the model that
    ## is executed once "during" each iteration
    ## of the simulation loop.
    , phase = "during"

    ## Insert the new expressions at the 
    ## end of each iteration.
    ## (Inf ~ infinity ~ end)
    , at = Inf

    ## Specify the model for under-reporting as
    ## a simple fraction of the number of new
    ## cases every time-step.
    , expressions = list(reports ~ report_prob * infection)

    ## Update defaults to something a little more 
    ## reasonable for early covid in Ontario.
    , default = list(
          gamma = 1/14      ## 2-week recovery
        , beta = 2.5/14     ## R0 = 2.5
        , report_prob = 0.1 ## 10% of cases get reported
        , N = 1.4e7         ## ontario population  
        , I = 50            ## start with undetected infections
    )
)
print(sir_covid)
## ---------------------
## Default values:
##     quantity        value
##         beta 1.785714e-01
##        gamma 7.142857e-02
##            N 1.400000e+07
##            I 5.000000e+01
##            R 0.000000e+00
##  report_prob 1.000000e-01
## ---------------------
## 
## ---------------------
## Before the simulation loop (t = 0):
## ---------------------
## 1: S ~ N - I - R
## 
## ---------------------
## At every iteration of the simulation loop (t = 1 to T):
## ---------------------
## 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I / N", 
##      abs_rate = "infection")
## 2: mp_per_capita_flow(from = "I", to = "R", rate = "gamma", abs_rate = "recovery")
## 3: reports ~ report_prob * infection
Take a moment to reconcile the code that produced this model update, with the printout of the updated model.

Did anything surprise you?

Would you have made different modelling choices?

If so, what questions do you have about how to code them?

Now we simulate the trajectory of the reports variable in the model, and plot it against the observed data.

sim_early_reports = (sir_covid 
    |> mp_simulator(
          time_steps = nrow(early_reports)
        , outputs = "reports"
    )
    |> mp_trajectory()
    |> mutate(
        date = min(early_reports$date) + days(time)
    )
)
comparison_early_reports = bind_rows(
      list(
          simulated = sim_early_reports
        , observed = early_reports
      )
    , .id = "source"
)
(comparison_early_reports
    |> ggplot()
    + geom_line(aes(date, value, colour = source))
    
    ## constrain y-axis to the range of the
    ## observed data, otherwise the simulations
    ## will make it impossible to see the data
    + coord_cartesian(ylim = c(0, max(early_reports$value, na.rm = TRUE)))
    
    + theme_bw() + theme(legend.position="bottom")
)

It took me a few iterations of model adjustments to make this plot, which illustrates a reasonably good fit before the peak of the wave and quite poor fit after it. Still, the next two concepts argue that this is quite a bit of progress and sets us up nicely for next steps.

Goodness of Fit Can Come Later
Being able to generate simulations that are comparable to observed data is an important step in-and-of-itself in applied public health modelling, even if the comparison shows that the model is inconsistent with the data. Just having this comparison pipeline sets us up to assess whether subsequent model modifications help explain the system.
Bad Fits Can Be Due to Bad Models and/or Bad Data
It can be tempting to assume that lack of fit can mean that the model is bad, but in applied work this is not nearly good enough. What we want is to make the most appropriate recommendations to public health policy makers, and this goal can be jeopardized by bad data as well as bad modelling. Sometimes the exploration step can involve producing various simulations to build up hypotheses, and these hypotheses can be more useful to policy makers than a model that fits extremely well to a dataset that we believe has low quality. Unfortunately there is no free lunch here, and we need to be wise.

Reporting Delays

The challenge introduced by reporting delays is that we cannot compare the simulated incidence values from our model with observed case reports, because there is a reporting delay. The following toy example shows how the apparent peak in the data follows the simulations.

So before comparing the simulations with observed data, we use convolution to transform the simulated incidence values into a time series of reported cases that is expected by the model. Keep reading to find out specifically how this is done.

Now we can now compare the observed and expected reported cases in a manner that accounts for reporting delays.

Now to explain how the expected reported case curve is computed. In this method we assume that each new case waits a Gamma-distributed random amount of time before it is observed. The mean and coefficient of variation of this distribution can be fitted parameters. During the COVID-19 pandemic the McMaster group found that a Gamma distribution with mean delay of 11 days and coefficient of variation of 0.25 was a reasonable model.

The following figure illustrates how a single expected case report is computed, in this case at time 26. The Gamma distribution of delay times is represented as bars giving the probability that a new case at a particular time is reported at time 26, which is given as the dashed horizontal line. Note how the reported number of cases is roughly near (but not identical) to the simulated number of cases at the peak of the delay distribution. This makes sense because we would expect most of our reports now to come from the highest probability time in the past.

In macpan2 we typically have a model that doesn’t account for delayed reporting and we want to update it to do so before fitting to data. One can use the mp_tmb_insert_reports() function for this purpose.

si_with_delays = (si
 |> mp_tmb_insert_reports(
      incidence_name = "infection"
    , mean_delay = 11
    , cv_delay = 0.25
    , report_prob = 1
    , reports_name = "reports"
 )
)
(si_with_delays
  |> mp_simulator(
      time_steps = 50L
    , outputs = c("infection", "reports")
  )
  |> mp_trajectory()
  |> ggplot()
  + geom_line(aes(time, value, colour = matrix))
  + theme_bw()
)

Notice how the 11-day reporting delay specific using the mean_delay argument of the mp_tmb_insert_reports function has produced a reports variable that peaks about 9 or 10 days after the new infection peak. Non-linear averaging has shifted the individual-level delay of 11 days to something a little lower at the population level. Still, the mp_tmb_insert_reports function has had the desired qualitative effect of shifting the reports peak to the right of the infection peak.

Read the documentation on mp_tmb_insert_reports, and have a look at the following model library examples that utilize this delay modelling functionality: shiver, macpan_base, wastewater.

Hospital Admissions and Occupancy

When these data are available and of high quality it is typically straightforward to compare them with simulations. One must obviously have a model containing hospital compartments (or modify an existing model to contain such compartments). One must also make sure that admissions are compared with flows into a hospital compartment and that occupancy is compared with a state variable of a hospital compartment.

The shiver model gives an example that utilizes hospital data.

Symptom Status

The simple compartmental models (i.e., si, sir, sir_waning, seir) do not account for variation in symptoms amongst infected individuals. However, the severity of symptoms can have a big impact on what cases get reported.

For some applications it is necessary to distinguish between different symptom statuses, which we do by splitting up the I box. This splitting up can be done by replacing all calls to mp_per_capita_flows() that involve I with flows that involve several I boxes (e.g., I_mild, I_severe) that group together individuals with similar symptom status.

The macpan_base model utilizes this technique. A discussion of this technique is also given in the examples on the mp_per_capita_flow() help page.
Modify the si model so that it has three boxes: S, I_mild, and I_severe.

In cases where report data are broken down by symptom status, then this technique of splitting up I into several boxes might be sufficient. However, if cases with different symptom statuses are combined in reports, an additional model component must be added that weights the relative likelihood that different symptom statuses will be reported. This additional component is related to the reporting delay functionality.

Seroprevalence

It is often the case that serological data can be used to estimate the degree of immunity in the population, which can in turn be used to estimate the number of individuals in recovered, vaccinated, and otherwise immune (or partially immune) compartments. The shiver model description provides an example of this technique, which involves transforming either the raw seroprevalence data so that it is comparable with different compartments or adding a variable to the model that produces a transformation of the variables tracking immune states that is comparable with the raw seroprevalence data.

Multiple Trajectories
Models that can simulate multiple empirically observable variables have a better chance of being reliable. For example, a model that only generates case reports will be less able to learn from data as one that can generate seroprevalence as well. In particular, seroprevalence will give better estimates of recovery parameters (for obvious reasons) but also might do so for transmission parameters as well because of better estimates of the size of the susceptible pool.

Vaccination

If you happen to now the vaccination schedule (e.g., number of doses administered per time-step for each time-step in the simulation), one simple way to model vaccination is to create a time-varying parameter that moves a known time-dependent number of individuals from a susceptible compartment to a protected compartment. An even simpler variation is just to include a constant number of doses per day, but this approach can be too simplistic especially when investigating questions of vaccine efficiacy.

One complexity with this approach is that it can reduce the number of susceptible individuals below zero, if the user is not careful. This can happen when the simulated number of S individuals drops below the number of doses per day. This issue could be addressed by using the minimum of S and the known number of doses as the realized number of doses. However this approach has severe drawbacks when calibrating, because as we will learn in the Calibration section macpan2 uses the TMB package for optimizing fit to data and TMB becomes much less efficient when we use discontinuous functions like this. To address this complexity we can use continuously saturating functions of S to model the number of doses, a technique that is covered here with the shiver model. This technique is also discussed in the examples on the mp_per_capita_flow() help page.

Wastewater Pathogen Concentrations

Wastewater models must have two kinds of compartments: those that represent numbers of people and those that represent the concentration of a pathogen in wastewater. Flows from a person-box into a wastewater-box (e.g., shedding from I to W) requires one to use the mp_per_capita_inflow() function instead of the mp_per_capita_flow() function (see here for details). This is because flowing from I to W does not cause a reduction in the size of I, because people are not becoming pathogens. The mp_per_capita_inflow() function ensures that W is increased by this shedding process without decreasing I. This technique is illustrated here.

Vital Dynamics

In many cases birth-death dynamics happen slowly enough relative to transmission dynamics that we can assume that the population size is constant, with no birth or death.

The sir_demog model includes birth and death. You can use this example to add vital dynamics to another model. Neither birth nor death can use mp_per_capita_flow(), but instead mp_per_capita_inflow() and mp_per_capita_outflow() (see here for examples and details).

Importation and Stochasticity

In many cases large jurisdictions can be considered closed and deterministic systems without altering conclusions. Small populations however are more strongly impacted by stochasticity.

The awareness model in the library illustrates how macpan2 can be used in these cases.

Calibration

Participants will learn how to parameterize models for making inferences about a particular population and public health problem.

Modify Default Values

We have done this already, because it is so common that the default values provided in the model library will require adjustment for any particular purpose – especially for very generic models like sir. See the sir_covid model object that we produced above in this section.

Express Uncertainty in Model Parameters

The default argument of the mp_tmb_model_spec() and mp_tmb_update() functions allows us to supply prior information on model parameter values, but gives us no opportunity to express uncertainty about these defaults. Later in the Optimized Calibration section we will see how to combine a model specification with distributional specifications for the parameters that we would like to estimate through calibration.

The distributional specifications are explained in the context of various use-cases in this article.

Optimized Calibration

We have covered a lot of gound above, but it has all paved the way towards the topic of formally calibrating a model using data. Before getting to this point we needed to do the following.

Now that we have this out of the way we can create a model calibrator object that combines (1) a model specification, (2) a dataset that can be compared with simulations from this model, and (3) a list of distributional assumptions about model parameters that we would like to estimate. Such a calibrator can be used to fit these parameters to data, and then simulate and forecast observable variables using this calibrated model. These calibrators can also be used to generate confidence intervals for these simulations and forecasts using (regularized) maximum likelihood theory.

This article steps through the process of calibrating a compartmental model specification, using the sanity check of fitting a model to simulations generated from the same model. The sequel to this article illustrates these techniques using real scarlet fever data from the International Infectious Disease Data Archive.

MAP Estimation

The recommended approach to model fitting with macpan2 is maximum a posteriori (MAP) estimation, which finds the values of parameters at the peak of a posterior distribution. This approach allows us to get approximate Bayesian uncertainty estimates by approximating the posterior distribution as a multivariate Gaussian centered at the peak obtained by MAP. One is able to declare that parameters be considered random effects, which means that they are integrated out of the posterior distribution using the Laplace approximation. Using this approach allows us to reduce the dimensionality of the posterior distribution used in MAP by only optimizing over the fixed effect parameters.

This approximate posterior distribution can be used to obtain confidence / credible intervals for any continuous function of model variables. The two main examples of this are as follows.

  1. Confidence intervals on fitted parameters using mp_tmb_coef(., conf.int = TRUE).
  2. Confidence intervals on fitted trajectories using mp_trajectory_sd()

This approach to computing confidence intervals using Gaussian approximations to the posterior is a special case of the delta method that is implemented in TMB, which is described here.

Example calibrations of varying levels of complexity
Intro to calibration using toy data

Intro to calibration using real scarlet fever data

For hackers only

McMaster COVID model

Wastewater model of COVID

SHIVER model to COVID

SEIR model using toy data

Likelihood and Prior Distribution Assumptions

You are free to declare any value in the model to be a trajectory or parameter. Trajectories are compared with observed data to produce a likelihood component (e.g., case reports are distributed with a negative binomial distribution with dispersion parameter 1) and parameters are used to produce a component of the prior distribution (e.g., the transmission rate is log normally distributed with mean 0.2 and standard deviation 1). Available distributional assumptions and settings are described here, but inspecting the examples linked to in the previous FAQ will be more useful for figuring out how to use this functionality.

Can I hack macpan2 to use MCMC esimation?

Yes.

Calibrating Time-Varying Parameters

A common approach for making simple models (i.e., si, sir, sir_waning, seir, shiver) applicable to complex reality, is to use time-varying parameters. A common example from the COVID-19 pandemic involved the variation of the transmission rate through time in response to public health measures.

This article describes several approaches to calibrating time-varying parameters.

Inference

Participants will learn how to make inferences using realistically parameterized models.

Assessing Model Fits

If you have a model_calibrator object fitted to a dataset called fitted_data, you can visualize the goodness-of-fit of this model to the data using ggplot2 in the following manner.

fitted_data = mp_trajectory(model_calibrator)
(observed_data
  |> ggplot()
  + geom_point(aes(time, value))
  + geom_line(aes(time, value), data = fitted_data)
  + theme_bw()
)
This section of the article on model fitting illustrates other techniques, such as checking convergence of the optimizer, getting parameter estimates with confidence intervals, and visualizing the goodness-of-fit with confidence intervals.

Forecasts

Forecasts can be made by extending the simulation bounds, and using the general tools for assessing model fits.

Counter-Factuals

Counter-factuals can be introduced by using the model modification tools, and using the general tools for assessing model fits. Counter-factuals commonly involve making parameters time-varying in the forecast period (e.g., in 30 days the transmission rate will change due to non-pharmaceutical interventions being considered).

Reproduction Numbers of Generation Interval Distributions

This example model includes a discussion of how to compute reproduction numbers and the moments of the generation interval distribution for a general compartmental model.

Stratification

When one extends one of the simple models (i.e., si, sir, sir_waning, seir, shiver) to be stratified (e.g., by age), transmission/infection processes become more complex. In these cases one may specify S, E, and I as vectors, and infection rates as matrix operations involving these vectors and matrices of transmission parameters. This approach is desecribed here.

Table of Contents

Footnotes

Footnotes

  1. These defaults can be overridden but let’s ignore that for now. We will cover the various ways to override defaults later.