-
Notifications
You must be signed in to change notification settings - Fork 22
Transpiler Architecture
- Introduction
- Overview of Transpilation
- Phases
- Additional Details
The SDEverywhere transpiler converts models written in the Vensim modeling language to either C or JavaScript. The transpiler supports the Vensim language features and Vensim library functions that are most commonly used in models, including subscripts.
The transpiler is published in the @sdeverywhere/compile
package, which is used by the sde
command line tool (from the @sdeverywhere/cli
package). The source code can be found in the packages/compile directory in this repo.
The @sdeverywhere/compile
package is written in the ECMAScript 2015 language (also known as ES6), a modern, standardized version of JavaScript. Much of the code is written in a functional programming style using the Ramda toolkit. (Most other packages in the SDEverywhere repo are written in TypeScript, but the cli
and compile
packages are currently written in JavaScript.)
Note that the term "SDEverywhere" generally refers to the collection of libraries and tools that are developed in this repo, but when this document refers to SDEverywhere, it is using it as a shorthand for the SDEverywhere transpiler (the compile
package).
From a high level perspective, the transpiler can be thought of as a black box that takes model files as input, performs some computation, and generates C or JavaScript files as output.
graph TB;
input["Model files"]
compiler("Transpiler")
output["C or JS files"]
input-->compiler
compiler-->output
style input stroke:none,fill:green,color:white
style output stroke:none,fill:royalblue,color:white
The next diagram shows the above sequence in terms of the actual files and high-level function.
We can see that the input files consist of one or more Vensim model (.mdl
) files, zero or more exogenous data files (e.g., in .xlsx
, .csv
, or .dat
format), and a spec.json
file that tells the transpiler what input/output variables to include, where to find the data files, and so on.
These input files are fed to the parseAndGenerate
function, which returns the generated code (the content of a .c
or .js
file) as output.
graph TB;
input_mdl["{model}.mdl"]
input_data[".xlsx | .csv | .dat"]
input_spec["spec.json"]
compiler("parseAndGenerate<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
output_c["{model}.c"]
output_js["{model}.js"]
input_mdl-->compiler
input_data-->compiler
input_spec-->compiler
compiler-->output_c
compiler-->output_js
classDef input stroke:none,fill:green,color:white
classDef output stroke:none,fill:royalblue,color:white
class input_mdl,input_data,input_spec input
class output_c,output_js output
The following diagram shows the phases of the transpilation process (i.e., the parseAndGenerate
function from above) in more detail, including the intermediate files/objects that are passed from one phase to the next.
The remainder of this document will explain each of these phases in more detail.
graph TB;
input_mdl["{model}.mdl"]
preprocessor("preprocessVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
intermediate_mdl["processed.mdl"]
parser("parseVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
intermediate_ast["AST"]
reader("readDimensionDefs + readVariables<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
analyzer("analyze<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
generator("generateCode<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
output_c["processed.c"]
input_mdl-->preprocessor
preprocessor-->intermediate_mdl
intermediate_mdl-->parser
parser-->intermediate_ast
intermediate_ast-->reader
reader-->intermediate_model1
intermediate_model1-->analyzer
analyzer-->intermediate_model2
intermediate_model2-->generator
generator-->output_c
classDef input stroke:none,fill:green,color:white
classDef intermediate stroke:none,fill:orangered,color:white
classDef output stroke:none,fill:royalblue,color:white
class input_mdl input
class intermediate_mdl,intermediate_ast,intermediate_model1,intermediate_model2 intermediate
class output_c output
graph LR;
input_mdl["{model}.mdl"]
preprocessor("preprocessVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
intermediate_mdl["processed.mdl"]
input_mdl-->preprocessor
preprocessor-->intermediate_mdl
classDef input stroke:none,fill:green,color:white
classDef intermediate stroke:none,fill:orangered,color:white
class input_mdl input
class intermediate_mdl intermediate
In the first phase, the Vensim model file(s) are preprocessed to make the definitions easier for the parsing phase to digest.
In the common case where there is a single .mdl
file, that file is passed through a preprocessor.
The preprocessor removes some things that the parser grammar can't yet handle, such as macros, tabbed arrays, and the graph/sketch/view definitions in the private section of the .mdl
file.
The preprocessor produces a new .mdl
file that contains only the relevant dimension definitions and equations needed for the parsing phase.
In the case of a complex model that consists of multiple "submodels" (i.e., multiple .mdl
files), the sde flatten
command must be used to preprocess all .mdl
files and combine duplicate definitions to produce a single .mdl
file that contains the resolved dimension definitions and equations.
graph LR;
intermediate_mdl["processed.mdl"]
parser("parseVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
intermediate_ast["AST"]
intermediate_mdl-->parser
parser-->intermediate_ast
classDef intermediate stroke:none,fill:orangered,color:white
class intermediate_mdl,intermediate_ast intermediate
In the second phase, a single preprocessed .mdl
file is passed to the parseVensimModel
function (part of the @sdeverywhere/parse package), which parses the model definitions and produces an abstract syntax tree (AST).
The AST is an in-memory representation of the model that allows later phases to work with the model definitions in a way that is not strongly tied to the source file format. Though currently we only have support for Vensim models as an input format, we plan to add support for the XMILE format used by Stella. The AST is designed to be file format agnostic, meaning that once a model file is parsed into an AST, the later phases of the transpiler can work with that AST structure without needing separate, special-cased logic for Vensim models and XMILE models.
Internally, the parseVensimModel
function uses the antlr4-vensim package to parse the dimension definitions and equations from the .mdl
file and produce the AST.
graph LR;
intermediate_ast["AST"]
reader("readDimensionDefs + readVariables<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
intermediate_ast-->reader
reader-->intermediate_model1
classDef intermediate stroke:none,fill:orangered,color:white
class intermediate_ast,intermediate_model1 intermediate
The AST from the second phase is the input to the third phase. This phase has two parts.
First, the dimension definitions ("subscript ranges" in Vensim terminology) from the AST are read.
For each dimension definition, a corresponding Subscript
object is created and managed by the subscript
module.
For more on subscripts and dimensions, consult the Dimensions and Subscripts section below.
Second, the equations and data variable definitions from the AST are read.
For each equation or data variable, a corresponding Variable
object is created and managed by the model
module.
For more on variables and the Variable
class, consult the Variables section below.
During this phase, the Variable
objects are not fully resolved.
They contain a reference to the parsed Equation
, and the left-hand side (i.e., the name of the variable) is determined, but the right-hand side of the equation has not yet been examined.
That will happen in the next phase.
Syntactically, an equation can be one of three things: a variable, a lookup, or a constant list.
The readVariable
function creates multiple variables for each constant in a constant list.
Subscripts are put into normal form.
When a variable is added to the model, the Model
object checks to see if there is an index subscript on the LHS.
If so, the variable is a non-apply-to-all array, and is added to the nonAtoANames
list indexed by the variable name, with a value of an array of flags for each subscript in normal order, indicating whether the subscript is an index or not.
A subscripted constant variable can be defined with all of the constants in a list on the RHS.
This notation is handled as a top-level alternative for the RHS in the grammar.
When readVariables
finds a constant list, it creates new Variable
instances, one for each index in the constant list.
graph LR;
intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
analyzer("analyze<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
intermediate_model1-->analyzer
analyzer-->intermediate_model2
classDef intermediate stroke:none,fill:orangered,color:white
class intermediate_model1,intermediate_model2 intermediate
In the fourth phase, the Variable
objects created during the third phase are analyzed, and the right-hand side of each equation is examined to determine which variables and functions it references.
During this phase, the variable type (varType
) for each Variable
instance is determined.
Function call arguments are validated to make sure the number and format of the arguments match what is expected by the functions (as specified by the Vensim documentation, in the case of Vensim models).
For complex function calls, this phase may store additional metadata in the Variable
object that prepares that variable for the code generation phase.
This phase will throw an error if it encounters inconsistencies, such as variable dependencies that cannot be resolved (unknown variable references) or unresolved data variables (for which the data cannot be found in the associated external data files).
At the completion of this phase, if everything is valid, the Variable
objects are considered fully resolved and ready to be passed to the code generation phase.
When readEquation
finds lookup syntax on the RHS, it creates a lookup variable by setting the points, optional range, and variable type in the Variable
.
If a variable has no references, the variable type is set to const
.
If a function name such as _INTEG
is encountered, the variable type is set to level
.
If the variable is non-apply-to-all, and it has a dimension subscript on the RHS in the same position as an index subscript on the LHS, then the equation references each element of the non-apply-to-all variable separately, one for each index in the dimension.
The readEquation
function constructs a refId
for each of the expanded variables and adds it to the expandedRefIds
list.
The references are added later in the addReferencesToList
function.
After the first part of phase is complete, the Variable
objects form a dependency tree.
The spec.json
file is consulted to see which input and output variables are specified for inclusion in the generated model code.
The removeUnusedVariables
function walks the dependency tree (consulting the references
and initReferences
properties of each Variable
object).
Only the variables specified in the outputVarNames
array from the spec.json
file and their dependencies are retained in the generated model.
After the removeUnusedVariables
function completes, the variables
array in the model
module will contain only the retained Variable
objects; the rest are discarded.
To help illustrate this, consider the following model (defined in pseudo-Vensim code):
w = 5 ~~|
x = 4 ~~|
y = x + 3 ~~|
z = 1 ~~|
u = y * 2 ~~|
Suppose the spec.json
file has:
{
"outputVarNames": ["u", "z"]
}
In this case, the generated model will include:
-
u
(because it is a declared "output") -
y
(because it is referenced byu
) -
x
(because it is referenced byy
) -
z
(because it is a declared "output")
The generated model will not include w
because it is not referenced in the dependency tree.
graph LR;
intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
generator("generateCode<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
output_c["processed.c"]
intermediate_model2-->generator
generator-->output_c
classDef intermediate stroke:none,fill:orangered,color:white
classDef output stroke:none,fill:royalblue,color:white
class intermediate_model2 intermediate
class output_c output
In the last phase, the analyzed equations are used to generate C or JavaScript code that can be used to run the model.
The generated code is divided into distinct sections and functions:
- variable declaration
- lookup/data initialization (
initLookups
) - constant initialization (
initConstants
) - level variable initialization (
initLevels
) - level variable evaluation (
evalLevels
) - aux variable evaluation (
evalAux
) - input variable handling (
setInputsFromBuffer
) - output variable handling (
storeOutputData
)
For each section, the generateCode
function calls generateEquation
for each Variable
instance to generate lines of code for that section and variable:
- There will be one C variable declaration (either a
double
orLookup*
) generated for eachVariable
instance. - For each data variable or lookup, there will be code generated that initializes the
Lookup
data structure corresponding to that variable. - For each constant value, there will be code generated that intializes the constant value once at the start of the model run.
- For each level variable, there will be code generated that initializes the level at the start of the model run and separate code that evaluates the level variable at each time step.
- For each aux variable, there will be code generated that evaluates the variable at each time step.
- The
setInputsFromBuffer
function will contain one line for each input variable declared in thespec.json
file (in theinputVarNames
array). - The
storeOutputData
function will contain one line for each output variable declared in thespec.json
file (in theoutputVarNames
array).
The code generator gets lists of variables for each section of the program and calls the generateEquation
function to generate code for each variable.
The Model
object supplies the variable lists, relying on the following internal functions:
-
varsOfType
returnsVariable
instances for a givenvarType
. -
sortVarsOfType
returns aux or levelVariable
instances sorted in dependency order using eval time references. -
sortInitVars
does the same using init time references. The other difference is that aux and level vars are evaluated separately at eval time, while a mixture of level vars and the aux vars they depend on are evaluated at init time.
The generateEquation
function maintains a context object (see GenExprContext
type) that has a number of properties that hold intermediate results as the AST is visited.
Code is generated differently in the init section of the program.
This is controlled by the mode
flag in the GenExprContext
.
Array functions such as SUM
require the creation of a temporary variable and a loop.
These intermediate variables are tracked in the GenExprContext
.
Subscripted variables are also evaluated in a loop.
The subscript loop opening and closing code are tracked in the GenExprContext
, as is the array function code.
Array functions mark one dimension that the function operates over.
The dimension is marked by a !
character at the end of the dimension name.
If this is detected, the !
is removed and the name of the marked dimension is saved in the markedDimIds
array in the GenExprContext
.
SDEverywhere uses XMILE terminology in most cases. A Vensim subscript range becomes a "dimension" that has "indices". (The XMILE specification has "element" as the child of "dimension" in the model XML format, but uses "index" informally, so SDEverywhere sticks with "index".) XMILE does not include the notion of subranges. SDEverywhere calls subranges "subdimensions".
Vensim refers to variables and equations interchangeably. This usually makes sense, since most variables are defined by a single equation. In SDEverywhere, models define variables with equations. However, a subscripted variable may be defined by multiple equations. In XMILE terminology, an apply-to-all array has an equation that defines all indices of the variable. There is just one array variable. A non-apply-to-all array is defined by different equations for each index. This means there are multiple variables, one for each index.
The Variable
class is the heart of SDEverywhere.
An equation has a left-hand side (LHS), usually the variable name, and a right-hand side (RHS), usually a formula expression that is evaluated to determine the variable's value.
The RHS could also be a Vensim lookup (a set of data points) or a constant array.
For more detail, consult the Variables section below.
- A Vensim "subscript range definition" defines an SDEverywhere dimension.
- Subscript range definitions give a list of subscripts that can include dimensions, indices, or both.
- A dimension can map to multiple dimensions listed in the mapping value.
- In a subscript range definition with a mapping, the map-from dimension is on the left, and the map-to dimensions are on the right after the
->
marker. - In an equation, the map-to dimension is on the LHS, and a map-from dimension is on the RHS.
- Dimensions cannot be defined in a mapping.
- A subscript is not an index if it is defined as a dimension.
- An index in a map-from dimension can be mapped to a dimension with multiple indices in the map-to dimension.
- When a map-to dimension lists subscripts, it has the same semantics as a regular subscript range definition.
- The reasons to list subscripts in a map-to dimension is to map subscripts in a different order than the dimension's definition, or to map an index in the map-from dimension to more than index in the map-to dimension.
- dimension: subscripts
- dimension: subscripts -> dimensions
- dimension: subscripts -> (dimension: map-to subscripts)
The dimensions given to the right of the ->
marker are the "mapping value" of the mapping.
Here is a mapped dimension with three subscripts in the map-from and map-to dimensions.
DimA: R1, R2, R3 -> (EFGroups: Group1, Group2, Group3)
But EFGroups
has four subscripts!
EFGroups: DimF, E1, E2, E3
The map-to dimension does not really have three subscripts. The mapping must list three subscripts to match the number of indices in the map-from dimension. But the total number of indices in the map-to dimension can be greater than in the map-from dimension.
These dimensions are simple lists of indices.
DimE: E1, E2, E3
DimF: F1, F2, F3
DimR: R1, R2, R3
If we expand the DimF
dimension in the EFGroups
map-to dimension, we see that EFGroups
has a total of six indices.
EFGroups: F1, F2, F3, E1, E2, E3
Therefore, the mapping in DimA
maps the the three indices in DimA
to the six indices in EFGroups
in a different order than they occur in the definition of EFGroups
, through other dimensions Group1
, Group2
, and Group3
.
DimA: R1, R2, R3 -> (EFGroups: Group1, Group2, Group3)
Group1: F1, E1
Group2: F2, E2
Group3: F3, E3
R1 → Group1 → F1, E1
R2 → Group2 → F2, E2
R3 → Group3 → F3, E3
What this mapping accomplishes is to group the subscripts in EFGroups
in a different way when it occurs in an equation with DimA
. For instance:
x[EFGroups] = a[DimA] * 10
Notice that in an equation, the map-to dimension is on the LHS and the map-from dimension is on the RHS, the opposite of how they occur in the subscript range definition. This subscripted equation is evaluated as follows when expanded over its indices by SDEverywhere:
x[F1] = a[R1] * 10
x[E1] = a[R1] * 10
x[F2] = a[R2] * 10
x[E2] = a[R2] * 10
x[F3] = a[R3] * 10
x[E3] = a[R3] * 10
The Variable
class is defined in the variable
module and contains the parsed Equation
along with other metadata that was determined during the "read" and "analyze" phases.
The parsedEqn
property holds a reference to the parsed Equation
from the AST.
This enables the code generator to walk the subtree for the variable.
In the Variable
object, the modelLHS
and modelFormula
properties preserve the Vensim variable name (left-hand side of the equation, aka LHS) and the Vensim formula (right-hand side, aka RHS).
Everywhere else, names of variables are in a canonical format compatible with the C programming language.
The Vensim name is converted to lower case (it is case insensitive), spaces are replaced with underscores, and an underscore is prepended to the name.
Vensim function names are similar, but are upper-cased instead.
The unsubscripted form of the Vensim variable name, in canonical format, is saved in the varName
property.
If there are subscripts in the LHS, the maximal canonical dimension names in sorted "normal" order establish subscript families by position in the families
property.
The subscripts are saved as canonical dimension or index names in the LHS in normal order in the subscripts
property.
Lookup variables do not have a formula.
Instead, they have a list of 2D points and an optional range.
These are saved in the points
and range
properties.
Each variable has a refId
property that gives the variable's LHS in a normal form that can be used in lists of references.
The refId
is the same as the varName
for unsubscripted variables.
A subscripted variable can include both dimension and index subscripts on the LHS.
When another variable refers to the subscripted variable, we add its refId
to the list of references.
The normal form for a refId
has the canonical name of each dimension or index sorted by their subscript families, separated by commas in a single pair of brackets, for example: _a[_dima,_dimb]
.
The references
array property lists the refIds of variables that this variable's formula references.
This determines the dependency order and thus evaluation order during code generation.
Some Vensim functions such as _INTEG
have a special initialization argument that is evaluated before the normal run loop.
The references in the expression for this argument are stored in the initReferences
property and do not appear in references
unless they occur elsewhere in the formula.
The varType
property holds the variable type, which determines where the variable is evaluated in the sim’s run loop.
The Vensim variable types that SDEverywhere supports are:
- constant
- auxiliary
- level
- lookup
- initial
- data
Lookups may occur as function arguments as well as variables in their own right.
When this happens, the code generator generates an internal lookup variable to hold the lookup's points.
The name of the generated variable is saved in the lookupArgVarName
property.
It replaces the lookup as the function argument when code is generated.
SMOOTH*
calls are replaced by a generated level variable named in smoothVarName
.
DELAY3*
calls are replaced by a level named in delayVarName
and an aux variable named in delayTimeVarName
.
Each section of a complete model program in C is written in sequence. The decl section declares C variables, including arrays of the proper size. The init section initializes constant variables and evaluates levels and the auxiliary variables necessary to evaluate them. The eval section is the main run loop. It evaluates aux variables and then outputs the state. The time is advanced to the next time step. Levels are evaluated next, and then the loop is finished. The input/output section has the code that sends output variable values to the output channel and optionally sets input values when the program starts.
graph TB;
A["Declare variables"]
B["Initialize constants"]
C["Set input variable values"]
D["Initialize levels"]
E["Evaluate aux variables<br/><div style='color:red;font-size:0.8em'>time = <em>t</em>"]
F["Capture output variable values"]
G["Advance the time<br/><div style='color:red;font-size:0.8em'>time = <em>t</em> + time step"]
H["Evaluate level variables"]
A-->B
B-->C
C-->D
D-->E
E-->F
H-->E
F-->G
G-->H