Skip to content

Support different random variable distribution types within Synthea

Andy Gregorowicz edited this page Aug 28, 2020 · 6 revisions

Synthea Random Variable Distribution Types

Synthea Generic Module Framework states allow users to generate clinical records by using random variables to select values for various aspects of the simulation. Currently, the random variables in GMF states are uniformly distributed. Observations of many natural processes will follow different distributions, such as Gaussian or Poisson distributions. This page discusses approaches for supporting random variables with different distribution types in GMF states.

Considerations

A few things to keep in mind when evaluating approaches for supporting different distribution types:

  • Existing GMF JSON - Synthea already has a substantial amount of modules written. Any breaking changes introduced will require fixing the existing modules. Depending on the types of changes introduced, it may be possible to automate this process.
  • Naming conventions - Most, but not all, random variables in Synthea have their characteristics defined in a property called range. This makes sense for a uniform distribution, but does not really make sense for other distribution types
  • Existing simulation code - Each subclass of State that generates a random observation, handles it differently
  • Units - Some places co-locate the unit with the variable parameters, others have it in a separate place
  • Existing module builder code - Any changes made in GMF will need to be reflected in the module builder web application

Existing GMF States with Random Variables

State Random Variable Property Units Notes
Delay range yes also allows for exact values
Observation range no also allows for exact values
Procedure duration yes
SetAttribute range no
Symptom range no also allows for exact values
VitalSign range no also allows for exact values

Potential Solutions

Move units out of ranges/durations

On the java side of things, if the random variable has units on the same level, it gets placed into a RangeWithUnits class. Otherwise, it gets loaded into a Range class. When we support different distribution types, since they will have different parameters, they are likely to have their own java classes. If the units are kept with the properties for the distributions, this will likely create the need for two classes per distribution. So, something like Gaussian and GaussianWithUnits. The downside to the move is that it breaks existing GMF JSON, but it would be straightforward to create a script to translate existing modules.

  • Pros
    • Cleaner implementation of new distribution types
    • More uniform state definitions
  • Cons
    • Breaks existing GMF JSON format
    • Would require translation of existing modules

How to detect distribution type

Add a type property

Have an extra property in the description of the distribution called type. All existing values would have to be updated to include a "type": "normal".

  • Pros
    • Easy for the Gson related code to figure out what is going on
    • Different distribution types could have the same parameter names
  • Cons
    • Breaks existing GMF JSON format
    • Would require translation of existing modules

Auto-detect distribution type based on property names

Look at the properties specified for the distribution and select one based on property names. So if a distribution description in JSON has a low and high property, use a normal distribution.

  • Pros
    • Backwards compatible
  • Cons
    • Requires distributions to specify their properties in a way that does not conflict
    • Potential for user confusion if a distribution is picked that they didn't expect

Dealing with range and duration

Existing states have their random variable properties in either range or duration. The name range works well for a normal distribution, but it's awkward to be the parent property for a Gaussian distribution being described by a mean and standard deviation.

Live with the awkwardness

Keep the existing property names.

  • Pros
    • Backwards compatible
  • Cons
    • Feels gross

Rename the properties

Change range and duration properties to something like distribution.

  • Pros
    • GMF representation would make more sense
  • Cons
    • Breaks existing GMF JSON format
    • Would require translation of existing modules

Deprecate range and duration, provide new distribution property

Leave the old stuff alone. Note that it is deprecated and will be removed in a future release. Create a new property that supports all of the desired distributions.

  • Pros
    • Backwards compatible
  • Cons
    • Implementation becomes more of a mess
Clone this wiki locally