Introduction to the simulation framework • Sima

The goal of the package is to provide a framework for defining simulators for a variety of purposes and applications in the health care domain. Several design choices have been made to make the simulator as general as possible, highly scalable and easy to use. The package also provides tools to calibrate the simulator according to chosen metrics.

The simulators created by the package are dynamic and operate on the individual level with a fixed-increment time progression where the synthetic population is subject to reoccurring events governed by stochastic processes.

Population fundamentals

To define a simulator, an initial population and a list of events are required. The initial population can be defined directly from real data, or it can be synthetic. In R, the population should be defined as a data.table with a row for each individual and each column corresponding to a status variable. There are no restrictions on the population, except that a column named “alive” must always exist in the table, with values 0 and 1 indicating whether the individual is alive or not. The simulator also uses a status variable called “last_event” to denote which event was applied last before the individual died. This can be used for example to determine the cause of death in the case where the simulator contains multiple events that are able to cause mortality.

Event fundamentals

Events are categorized into two distinct groups: manipulation events and accumulation events. Manipulation events modify the status of the population that is currently still alive in the simulation. This type of events can be for example aging, occurrence of a disease or mortality. Accumulation events on the other hand introduce new individuals to the population while keeping the status of the existing population otherwise intact. Accumulation events can model for example immigration or births. Each event has access to its own set of parameters that can be used to define statistical models and to enable the behavior of the event to be calibrated. Even functionality is essentially defined as an R expression that has access to the status of the synthetic population and the parameters of the event in question.

Constructing events

Both types of events are defined as R6 classes with the following constructors

ManipulationEvent$new(name, description, parameters, derived, allocated, mechanism)
AccumulationEvent$new(name, description, parameters, derived, allocated, mechanism)

For both types of events, the arguments of the constructor are defined as follows:

name is a character string that serves as a unique identifier.
description is an optional argument that provides a text-based explanation about the functionality and purpose of the event
parameters is a named list of parameters
derived is a named list of parameters, that are computed based on parameters.
allocated a named list of expression that can be used to define persistent private storage for the event.
mechanism is an R expression that defines what the event does. This expression can reference parameters, derived parameters, or allocated objects by name

As an example, consider a simulator where events are applied daily and we wish to model the aging of individuals in the population. This can be accomplished with the following manipulation event:

aging <- ManipulationEvent$new(
  name = "Aging",
  description = "Aging of individuals",
  parameters = list(
    day = 1.0 / 365.0
  )
  mechanism = expression({
    status[alive == 1L, age := age + day]
  })
)

In this instance we define a simple parameters day to describe how much a person ages per iteration in the simulation. We assume the existence of a status variable alive in the population that describes whether the person is still alive at present. Note that the parameter day can be simply referenced by name within the mechanism expression. The population status is a data.table so the appropriate syntax is required.

In the case of manipulation events, the mechanism should not have a return value. In the case of accumulation events however, the return value should be a data.table with the same columns as status with each row depicting those individuals that are to be added to the population.

Constructing a Simulator

The simulator is defined as an R6 class with the following constructor

Simulator$new(initializer, man_events, acc_events, seeds, keep_init, ...)

The arguments of the constructor are defined as follows:

initializer is a function that returns the initial population as a data.table.
man_events is a list of ManipulationEvent objects.
acc_events is a list of AccumulationEvent objects.
seeds is a vector of RNG seed numbers.
keep_init is logical value that indicates whether the initial status of the population should be stored.
... are the additional arguments passed to initializer

Calibration

After defining the events and the initial population, the next task is typically to obtain parameter values for the events such that some measure of the synthetic population matches that of an external validation data set as closely as possible. To this end, events with parameters can be configured after initialization. Furthermore, it is possible to run multiple simulations with different parameter configurations while keeping the initial population intact and using the same seed values for the pseudo random number generation each time such that the any difference in output will be purely a result of the varying parameter values. Standard R optimization tools such as optim can be used.

The Simulator class method configure makes it easy to run the simulation with different parameter values while keeping the initial population intact between runs.

Simulator$configure(N, pars, output, seeds = 123, ...)

The arguments for the method are as follows:

N is the length of the simulation for a single function evaluation in the optimization. The value that should be used is highly dependent on target feature of the calibration, the objective function and the size of population.
pars is a named list corresponding to the event names to be calibrated. Each element should also be a named list contain the names of the specific parameters and their values to be changed.
output is a function that is applied to the synthetic population after the simulation. The first argument should accept the status data.table.
seeds are the seed values to use for random number generation.
... are the additional arguments passed to output.

For the standard optim function, the objective function to use and the calibration process itself take the following form

obj_optim <- function(par, validation) {
  pars <- list("Event 1" = list("x1_1" = x1_1_value, ...),
               "Event 2" = list("x2_1" = x2_1_value, ...),
               ...
               "Event n" = list("xn_1" = xn_1_value, ...))
  out <- sim$configure(N, p, output)
  return(objective(out, validation))
}

result <- optim(par = par_init, fn = obj_optim, validation = some_data)

Note that this is not actual usable R code, but simply a pseudo-code representation in R syntax, where the ellipsis simply denotes a list of n events or p_n parameters for a specific event. The idea is that the user has to define which event corresponds to which set of parameters and their values. For example, for the event named “Event 1”, we set its parameter x1_1 to have the value x1_1_value along with any other parameters and values that could be included within .... Next, the configure method is used to compute a summary statistics from the synthetic population. This is then compared to a set of validation data, provided in validation by the actual objective function, denoted here by objective. For other functions beside optim, alternate formulations will likely have to be used. See the vignette("example") for a practical demonstration of the entire calibration process.

Running the simulator

Simulation sequences of varying lengths can be carried out with the Simulator class method run.

Simulator$run(N, show_progress = TRUE)

Here, N is an integer value indicating the length of the simulation. The current status of the population can be accessed at any time via the method get_status. The argument show_progress enables a progress bar for the simulation, when not using parallel computation.

Parallel computation

The simulator can be configured to take advantage of multiple computing cores via the class method start_cluster.

Simulator$start_cluster(cl, nc, export, packages, interface = c("doParallel", "doMPI"))

The arguments are:

cl is a cluster object.
nc denotes the number of workers to use.
export is a character vector of object names in the global environment that should be made available to the workers, such as user-defined functions or additional data.
packages is a character vector of package names that should be loaded on the workers.
interface gives the foreach backend to use, can be either doParallel or doMPI.

The population is split into chunks of equal size (if possible) and distributed to the available workers along with other necessary data. The implementation takes advantage of the foreach package for this purpose and any interface to this looping construct can be used fundamentally, but only doParallel and doMPI are currently supported. The following example demonstrates the process of setting up the computation cluster in the case of doParallel:

nc <- parallel::detectCores()
cl <- parallel::makeCluster(nc)
doParallel::registerDoParallel(cl)

sim$start_cluster(cl, nc, interface = "doParallel")
# Carry out calibration / simulations
sim$stop_cluster()