The goal of the package is to provide a framework for defining simulators for a variety of purposes and applications in the health care domain. Several design choices have been made to make the simulator as general as possible, highly scalable and easy to use. The package also provides tools to calibrate the simulator according to chosen metrics.
The simulators created by the package are dynamic and operate on the individual level with a fixed-increment time progression where the synthetic population is subject to reoccurring events governed by stochastic processes.
To define a simulator, an initial population and a list of events are required. The initial population can be defined directly from real data, or it can be synthetic. In R, the population should be defined as a data.table
with a row for each individual and each column corresponding to a status variable. There are no restrictions on the population, except that a column named “alive” must always exist in the table, with values 0 and 1 indicating whether the individual is alive or not. The simulator also uses a status variable called “last_event” to denote which event was applied last before the individual died. This can be used for example to determine the cause of death in the case where the simulator contains multiple events that are able to cause mortality.
Events are categorized into two distinct groups: manipulation events and accumulation events. Manipulation events modify the status of the population that is currently still alive in the simulation. This type of events can be for example aging, occurrence of a disease or mortality. Accumulation events on the other hand introduce new individuals to the population while keeping the status of the existing population otherwise intact. Accumulation events can model for example immigration or births. Each event has access to its own set of parameters that can be used to define statistical models and to enable the behavior of the event to be calibrated. Even functionality is essentially defined as an R expression that has access to the status of the synthetic population and the parameters of the event in question.
Both types of events are defined as R6 classes with the following constructors
ManipulationEvent$new(name, description, parameters, derived, allocated, mechanism)
AccumulationEvent$new(name, description, parameters, derived, allocated, mechanism)
For both types of events, the arguments of the constructor are defined as follows:
name
is a character string that serves as a unique identifier.description
is an optional argument that provides a text-based explanation about the functionality and purpose of the eventparameters
is a named list of parametersderived
is a named list of parameters, that are computed based on parameters
.allocated
a named list of expression that can be used to define persistent private storage for the event.mechanism
is an R expression that defines what the event does. This expression can reference parameters
, derived
parameters, or allocated
objects by nameAs an example, consider a simulator where events are applied daily and we wish to model the aging of individuals in the population. This can be accomplished with the following manipulation event:
<- ManipulationEvent$new(
aging name = "Aging",
description = "Aging of individuals",
parameters = list(
day = 1.0 / 365.0
)mechanism = expression({
== 1L, age := age + day]
status[alive
}) )
In this instance we define a simple parameters day
to describe how much a person ages per iteration in the simulation. We assume the existence of a status variable alive
in the population that describes whether the person is still alive at present. Note that the parameter day
can be simply referenced by name within the mechanism
expression. The population status
is a data.table
so the appropriate syntax is required.
In the case of manipulation events, the mechanism
should not have a return value. In the case of accumulation events however, the return value should be a data.table
with the same columns as status
with each row depicting those individuals that are to be added to the population.
The simulator is defined as an R6 class with the following constructor
Simulator$new(initializer, man_events, acc_events, seeds, keep_init, ...)
The arguments of the constructor are defined as follows:
initializer
is a function that returns the initial population as a data.table
.man_events
is a list of ManipulationEvent
objects.acc_events
is a list of AccumulationEvent
objects.seeds
is a vector of RNG seed numbers.keep_init
is logical value that indicates whether the initial status of the population should be stored....
are the additional arguments passed to initializer
After defining the events and the initial population, the next task is typically to obtain parameter values for the events such that some measure of the synthetic population matches that of an external validation data set as closely as possible. To this end, events with parameters can be configured after initialization. Furthermore, it is possible to run multiple simulations with different parameter configurations while keeping the initial population intact and using the same seed values for the pseudo random number generation each time such that the any difference in output will be purely a result of the varying parameter values. Standard R optimization tools such as optim
can be used.
The Simulator
class method configure
makes it easy to run the simulation with different parameter values while keeping the initial population intact between runs.
Simulator$configure(N, pars, output, seeds = 123, ...)
The arguments for the method are as follows:
N
is the length of the simulation for a single function evaluation in the optimization. The value that should be used is highly dependent on target feature of the calibration, the objective function and the size of population.pars
is a named list corresponding to the event names to be calibrated. Each element should also be a named list contain the names of the specific parameters and their values to be changed.output
is a function that is applied to the synthetic population after the simulation. The first argument should accept the status data.table
.seeds
are the seed values to use for random number generation....
are the additional arguments passed to output
.For the standard optim
function, the objective function to use and the calibration process itself take the following form
<- function(par, validation) {
obj_optim <- list("Event 1" = list("x1_1" = x1_1_value, ...),
pars "Event 2" = list("x2_1" = x2_1_value, ...),
..."Event n" = list("xn_1" = xn_1_value, ...))
<- sim$configure(N, p, output)
out return(objective(out, validation))
}
<- optim(par = par_init, fn = obj_optim, validation = some_data) result
Note that this is not actual usable R code, but simply a pseudo-code representation in R syntax, where the ellipsis simply denotes a list of n
events or p_n
parameters for a specific event. The idea is that the user has to define which event corresponds to which set of parameters and their values. For example, for the event named “Event 1”, we set its parameter x1_1
to have the value x1_1_value
along with any other parameters and values that could be included within ...
. Next, the configure
method is used to compute a summary statistics from the synthetic population. This is then compared to a set of validation data, provided in validation
by the actual objective function, denoted here by objective
. For other functions beside optim
, alternate formulations will likely have to be used. See the vignette("example")
for a practical demonstration of the entire calibration process.
Simulation sequences of varying lengths can be carried out with the Simulator
class method run
.
Simulator$run(N, show_progress = TRUE)
Here, N
is an integer value indicating the length of the simulation. The current status of the population can be accessed at any time via the method get_status
. The argument show_progress
enables a progress bar for the simulation, when not using parallel computation.
The simulator can be configured to take advantage of multiple computing cores via the class method start_cluster
.
Simulator$start_cluster(cl, nc, export, packages, interface = c("doParallel", "doMPI"))
The arguments are:
cl
is a cluster object.nc
denotes the number of workers to use.export
is a character vector of object names in the global environment that should be made available to the workers, such as user-defined functions or additional data.packages
is a character vector of package names that should be loaded on the workers.interface
gives the foreach
backend to use, can be either doParallel
or doMPI
.The population is split into chunks of equal size (if possible) and distributed to the available workers along with other necessary data. The implementation takes advantage of the foreach
package for this purpose and any interface to this looping construct can be used fundamentally, but only doParallel
and doMPI
are currently supported. The following example demonstrates the process of setting up the computation cluster in the case of doParallel
:
nc <- parallel::detectCores()
cl <- parallel::makeCluster(nc)
doParallel::registerDoParallel(cl)
sim$start_cluster(cl, nc, interface = "doParallel")
# Carry out calibration / simulations
sim$stop_cluster()