Formulas serve as a convenience syntax to describe computations that
are supposed to be done on a DataFrame
. These come in three forms:
- formulas that compute a new
Column
(map operations) - formulas that reduce one or more columns to a single value (reduce operations)
- simple constants or pure (arithmetic) operations, i.e. things that
can be evaluated without the context of a
DataFrame
This means we can write prototypes for these three operations as:
- map operations require both a datatype to read and one result data type. In theory there may be multiple read data types with the caveat that all “read” datatypes must be convertible to the “result” datatype. In addition they of course require some operationn op to be done using all “read” tensors t_i. Finally, a name for the new (or overwritten) column in the dataframe is needed outname.
proc(df: DataFrame): Column =
let
t1 = df["col1", readType1]
...
tn = df["colN", readTypeN]
var res = newTensor[resultType](df.len)
for idx in 0 ..< df.len:
res[idx] = op(t1, ..., tN)
result = toColumn res
- reduce operations are a bit simpler. The type that will be returned
is always going to be a
Value
. The “read” datatypes are required same as for map operations for all tensors t_i. An operation has to be supplied to perform an action on all tensors that reduces to a single scalar value. (NOTE: we could extend it to make it return a specific type, same as we deduce the return type ofres
in the above case. But in that case the signature of the proc changes, which means we’d need to allow genericFormulaNodes
. That brings the problem of takingvarargs[FormulaNode]
, because that cannot work anymore. So it’s a bit messy and the overhead of storing a number in a value should be negligible).
proc(df: DataFrame): Value =
let
t1 = df["col1", readType1]
...
tn = df["colN", readTypeN]
res[idx] = op(t1, ..., tN)
result = %~ res
- are self explanatory, it’s just Nim code that can be evaluated w/o additional input. The complication about this is how to detect it is just that.
In the current implementation of the formula macro the types of all
input tensors t_i
has to be the same. As mentioned above this is an
artificial requirement in principle as long as the type can be
converted to the output type (or in case of boolean return types to
the comparison type).
Before jumping into the macro craze again to make changes to the macro, we will now:
- design all types of operations that are expected to work given our DSL
- write helper procedures to allow constructing formulas manually without the use of the DSL (which is not really supported right now, for no good reason)
- extend the spec to allow multiline arguments and possibly a smarter looping using loop fusion (forEach if no string / Value types used, else regular for)
The actual FormulaNode
type needs to store some additional
information about the kind of formula, its name and value / closure.
type
FormulaKind* = enum
fkVariable, fkAssign, fkVector, fkScalar
FormulaNode* = object
name*: string # stringification of whole formula. Only for printing and
# debugging
case kind*: FormulaKind
of fkVariable:
# just some constant value. Result of a simple computation as a `Value`
# This is mainly used to rename columns / provide a constant value
val*: Value
of fkAssign:
lhs*: string # can this be something else?
rhs*: Value
of fkVector:
colName*: string
resType*: ColKind
fnV*: proc(df: DataFrame): Column
of fkScalar:
valName*: string
valKind*: ValueKind
fnS*: proc(c: DataFrame): Value
Differentiating between “pure” and “impure” formulas:
By having a test on “pureness” of the formula (that means checking for
references to columns of a DataFrame
or usage of any of the
following symbols / expressions:
~ << <- <dtype> -> <dtype2>: `someCol` c"someCol" df["foo"] df["foo"][idx] idx
If none of the above is present, we can dispatch to the formula
template above and assign create a FormulaNode
of type fkVariable
with the value being the result of the expression as a Value
.
Essentially pure formulas are nothing but a template.
f{1}
f{1.0}
f{"hello"}
f{1 + 2}
block:
let x = 2
f{1 + x}
block:
proc foo(): int = 2
f{foo()}
f{1 + foo()}
proc bar(x: int): int = 1 + x
f{bar(2)}
f{3 + bar(3)}
block:
# control flow
let x = 2
f{if x > 1:
5
else:
10}
Possible implementation:
import macros
template formula(body: untyped): untyped =
body
macro `{}`(arg, stmts: untyped): untyped =
if arg.kind == nnkIdent and arg.strVal == "f":
result = formula(stmts)
else:
error("Invalid formula!")
echo f{1 + 2}
let x = 2
echo f{if x > 1:
5
else:
10}
This works fine. Invalid code will just generate a regular Nim compilation error.
This means something like the following:
import macros, ggplotnim
proc initVariable[T](x: T): FormulaNode =
result = FormulaNode(name: "foo", kind: fkVariable,
val: %~ x)
proc arePureFormula(n: NimNode): bool = true # a stub
macro foo(arg, stmts: untyped): untyped =
if arg.kind == nnkIdent and arg.strVal == "f":
if stmts.arePureFormula():
result = quote do:
initVariable(`stmts`)
else:
error("Invalid formula!")
let x = foo(f, 1 + 2)
echo x.repr
let a = 2
let y = foo(f):
if a > 1:
5
else:
10
echo y.repr
where we have replaced the f{}
syntax by a foo()
call so that we
can actually import ggplotnim
and access the actual FormulaNode
type. In this version the formula
template is not even needed,
because we can just insert the pure (implementation pending) formulas
to the new generic initVariable
.
If a formula is detected as being impure, we first have to determine
the actual formula kind. This can be done in the same manner as
currently done in compileFormula
.
Before we get into any of this, let’s try to write some helpers that allow us to construct formulas manually of specific types.
import ggplotnim, sequtils
import arraymancer / laser / strided_iteration / foreach
proc initVectorFormula[T](fnV: proc(df: DataFrame): Column): FormulaNode =
result = FormulaNode(name: "foo", kind: fkVector, resType: toColKind(T), fnV: fnV)
proc initScalarFormula[T](fnS: proc(df: DataFrame): Value): FormulaNode =
result = FormulaNode(name: "foo", kind: fkScalar, valKind: toValKind(T), fnS: fnS)
template vecFn(dtype, body: untyped): untyped =
let cl = proc(df: DataFrame): Column =
let df {.inject.} = df
body
initVectorFormula[dtype](cl)
template scalarFn(dtype, body: untyped): untyped =
let cl = proc(df: DataFrame): Value =
let df {.inject.} = df
body
initScalarFormula[dtype](cl)
let f = vecFn(int):
let t = df["x", int]
result = toColumn t.map_inline(x * 2)
let f2 = vecFn(int):
let t = df["x", int]
var res = newTensor[int](df.len)
for idx in 0 ..< df.len:
res[idx] = t[idx] * 2
result = toColumn res
let f3 = vecFn(int):
let t = df["x", int]
var res = newTensor[int](df.len)
forEach x in res, y in t:
x = y * 2
result = toColumn res
let fs = scalarFn(int):
let t = df["x", int]
var res: int
forEach x in t:
res += x
result = %~ res
echo f.repr
let df = seqsToDf({"x" : toSeq(0 .. 10)})
echo f.evaluate(df)
echo f2.evaluate(df)
echo f3.evaluate(df)
echo fs.reduce(df)
In summary the closure implementation of a vector formula consits of 3 parts:
- the preface:
let t1 = df["t1", dtype1] ... tn = df["tn", dtypen] var res = newTensor[resType](df.len
- the loop:
for / forEach / map_inline / ...: # some code using `res`
- the result:
result = toColumn res
Complications are thus:
- how do we determine the datatypes of the input tensors?
- how do we determine the datatype of the result?
and for the vecFn
template:
- how do we determine the column type so that
vecFn
could be independent to create aFormulaNode
?
The implementation of reducing formulas is essentially the same, with the following changes to each of the 3 stages:
- preface:
var res: resType
- the loop:
# some code modifying `res`
- the result:
result = %~ res
and in many cases the whole body can be reduced to a single line, if
the reducing operation simply calls a procedure (e.g. mean
).
import macros
f{
preface:
t in df["foo", int] # t refers to each element of `foo` in the loop
u in df["bar", float]
v = df["baz", int] # v refers to the ``Tensor`` `baz`
dtype: float
name: "fooBar"
loop:
t.float * u + v[idx].float
}
f{
preface:
t in df["foo", int] # t refers to each element of `foo` in the loop
u in df["bar", float]
v = df["baz", int] # v refers to the ``Tensor`` `baz`
r in result
dtype: bool
name: "filterme"
loop:
r = t.float > u and v[idx] < 2.2
}
f{
preface:
t in df["foo", float] # t refers to each element of `foo` in the loop
dtype: bool
name: "noNan"
loop:
not classify(t) == fcNan
}
# then
"noNan" ~ not classify(t) == fcNan
# is just a short form of
name: "noNan"
loop:
not classify(t) == fcNan
Have a data type to store this information at CT:
type
## either: `t in df["foo", int]`
## or: `t = df["foo", int]`
Assign = object
element: NimNode # e.g. `t`
tensor: NimNode # either `t` or `t_T` if `elmenent` used
col: NimNode # name of the column
colType: NimNode # e.g. `float`
Preface = object
args: seq[Assign]
FormulaCT = object
preface: Preface
resType: NimNode # e.g. `ident"float"`
name: string # name of the formula body as lisp
loop: NimNode # loop needs to be patched to remove accented quotes etc
And a function to convert FormulaCT
into a vector / scalar closure procedure.
Accented quotes / call string literals are converted into Assign
:
`foo` -> Assign(
element: fooEl, # gensym
tensor: fooT, # gensym
col: foo,
colType, # heuristics or type hint
)
# same with `c"foo"`
f{`x`}
f{`x` + `y`}
f{"foo" ~ `x` + `y`}
f{"foo" << `x` + `y`}
f{"foo" ~ `x` + fn(`y`)}
f{"foo" ~ fn(`x`)}
let bar = "hello"
f{"foo" ~ fn(`x`, bar)}
# for filtering
f{not classify(`x`) == fcNaN}
f{not isNull(`x`)}
f{isBool(`x`)}
f{`x` > 5}
f{df["x"][idx] > 5 and `y` < 4}
f{"newCol" ~ if `x` < 3:
foo(`x`)
else:
bar(`x)}
and many more…
One of the major issues in the current implementation is our hacky pass from untyped to typed macros. In principle this is not an issue, but the problem is that we cannot pass multiple procedures to a typed macro, due to the following Nim issue: nim-lang/Nim#13913
However, we can work around it using an approach such as:
import macros, math, tables
proc fn(x: int, s: string): string = $x & s
var EncounteredSymbols {.compileTime.}: Table[string, seq[NimNode]]
macro doTyped(name: static string, arg: typed): untyped =
if name notin EncounteredSymbols:
EncounteredSymbols[name] = newSeq[NimNode]()
case arg.kind
of nnkClosedSymChoice, nnkOpenSymChoice:
for ch in arg:
EncounteredSymbols[name].add ch
else:
EncounteredSymbols[name].add arg
macro foo(n: static string, fns: varargs[untyped]): untyped =
result = newStmtList()
for t in fns:
result.add quote do:
doTyped(`n`, `t`)
foo("bar", sqrt, pow, ln, fn)
static:
for k, val in EncounteredSymbols:
for s in val:
echo s.getImpl.treeRepr
It requires neither {.experimental: "dynamicBindSym".}
nor a macro
taking multiple typed parameters. Instead we have a compilation as
follows:
- an untyped macro, which extracts all symbols (nnkCall, nnkCommand?)
used in the body, it outputs calls as:
- add each symbol using a
doTyped
equivalent to some globalTable
where the key is the name of the formula - gen code to call another macro with the same arguments and the name of the just inserted key
- said macro can now read all typed symbols from the
Table
- add each symbol using a
The final implementation is a bit more complicated in some sense,
because it requires us to skip identifiers df
, idx
as well as
nnkBracketExpr
of df[<someCol>]([idx])
, but in the latter case
take all identifiers found in a chain of nnkDotExpr
.
nnkDotExpr
however has to be kept as is (as a chain of calls) iff it
only contains identifiers.
var TypedSymbols {.compileTime.}: Table[string, seq[NimNode]]
macro addSymbols(name: string, n: typed): untyped =
if name.repr notin TypedSymbols:
TypedSymbols[name.repr] = newSeq[NimNode]()
TypedSymbols[name.repr].add n
proc extractSymbols(n: NimNode): seq[NimNode] =
case n.kind
of nnkIdent, nnkSym:
# take any identifier or symbol
if n.strVal notin ["df", "idx"]: # these are reserved identifiers
result.add n
of nnkBracketExpr:
# check if contains df[<something>], df[<something>][idx]
if not ((n[0].kind == nnkIdent and n[0].strVal == "df") or
(n[0].kind == nnkBracketExpr and
n[0][0].kind == nnkIdent and n[0][0].strVal == "df" and
n[1].kind == nnkIdent and n[1].strVal == "idx")):
result.add n
of nnkDotExpr:
## If `DotExpr` consists only of Idents during the untyped pass,
## it's either field access or multiple calls taking no arguments.
## In that case we can just keep the chain and pass it to the typed
## macro. In case other things are contained (possibly `df[<...>]` or
## a regular call) take the individual fields.
## For something like `ms.trans` in ggplotnim (`trans` field of a scale)
## we need to pass `ms.trans` to typed macro!
proc isAllIdent(n: NimNode): bool =
result = true
case n.kind
of nnkIdent: discard
of nnkDotExpr:
if n[1].kind != nnkIdent: return false
result = isAllIdent(n[0])
else: return false
let allIdent = isAllIdent(n)
if allIdent:
result.add n
else:
# add all identifiers found
for ch in n:
result.add extractSymbols(ch)
of nnkAccQuoted, nnkCallStrLit:
# do not look at these, since they are untyped identifiers referring to
# DF columns
return
else:
for i in 0 ..< n.len:
result.add extractSymbols(n[i])
With this information in place, we can determine more type information than before. There are still 4 different ways for us to determine types:
- if explicit usage of
<<
,~
,<-
: gives us information about whether the total computation is a tensor or a scalar - heuristics based on boolean / numerical operators, which we can use to deduce the result of operations (but in case of boolean operations not of the input!)
- the newly gained type information from typed symbols: we can walk
over the found symbols and check all open sym choices for things
that:
- take either scalars or tensors as input
- from these look at the output
to determine the required inputs for DF columns
- type hints of
float -> int
kind
The latter should be extended in a way similar to what’s explained in Impure formulas (formulas accessing DF).
Type lookup has to work in the following way. Once we have filled the
TypedSymbols
table, we walk over it in combination with the loop
body.
Using the loop body we find all symbols that are in TypedSymbols
and
check what their arguments are. Using the arguments:
- number of arguments
- types (reference to column, index or general symbol)
can find the correct overloads. The overload is important in combination with the body, because it allows us to find the correct way to replace each column reference by.
For explicit index references, via df["someCol"][idx]
we have won,
because we only have to consider overloads which take scalars and the
access will just be replaced either by index or by value (depending on
type).
For explicit column refernce, via df["someCol"]
we also have won,
except we now look for something taking Tensor[T]
.
For the general column reference using nnkAccQuoted
or
nnkCallStrLit
we have to check what kind of overloads exist. Scalar
vs. tensor. In case of multiple overloads that match (which seems very
unusual), we throw a CT error that we cannot determine the overload
unambiguously and ask the user to use either df["foo"]
or
df["foo"][idx]
.
The result for the input types will be stored in each Assign
.
The final output type will be the “union” of all encountered types
that were chosen.
What if we have:
f{"col" ~ foo($`x`.toBool)}
StmtList CurlyExpr Ident "f" Infix Ident "~" StrLit "col" Call Ident "foo" Prefix Ident "$" DotExpr AccQuoted Ident "x" Ident "toBool"
So we walk over the loop body and keep the information about the last encountered ident that is one of:
nnkCall
: index 0nnkCommand
: index 0nnkPrefix
: index 0
Name as string is enough, from it we can ask TypedSymbol
table to
extract symbol.
For nnkInfix
and nnkDotExpr
this is a bit different. In those
cases we have to use information from the later nodes.
In case of nnkInfix
the type information is special. We can assume
that both sides of an nnkInfix
will be of the same type as long as
the infix is a boolean or mathematical operation. We can use this
information to fill in holes of type information. Assume the
following:
f{`x` < 5}
Assuming the user writes correct code (it’s not our fault if they don’t :) ), we can deduce from this that `x` should be read as an integer (or arguably as a float and the user was just lazy / this needs to be considered). The AST of this expression looks something like the following in slightly simplified form:
Infix
Ident "<"
AccQuoted
Ident "x"
IntLit 5
Thus, after extracting the typed symbols (<
and 5
) we can study
the types of the full tree:
- call:
Infix # ? infix so look at children
Ident "<" # boolean operator, look for valid types: Tensor, Tensor and Number, Number
AccQuoted # ?
Ident "x"
IntLit 5
- call for
nnkAccQuote
:
Infix
Ident "<"
AccQuoted # just a column reference, type unknown, bubble up
Ident "x"
IntLit 5
- call for
nnkIntLit
:
Infix
Ident "<"
AccQuoted
Ident "x"
IntLit 5 # type is clearly int
Which leaves us with the following final tree:
Infix
Ident "<" # bool: Number, Number
AccQuoted # unknown
Ident "x"
IntLit 5 # int
With simple constraint solving (why not use Prolog huh?), we can
uniquely determine that “unknown” has to be int
too.
How do we easily do that?
We look at the type signature of the <
symbol, find type A == B,
From node 2 we know B == int. Thus it follows A == int from
transitivity.
The problem with this is that probably solving thes transitive
problems might get hard for complex trees? For nnkInfix
it’s pretty
easy, because there will always only be 2 child nodes of interest and
they will usually have the same type.
For (possibly nested) nnkDotExpr
we need to consider the last
We define a helper type to store type information about all symbols we study:
type
TypedNode = ref object
node: NimNode # the node that has these types
input: seq[NimNode] # types of input
output: NimNode # output type
children: seq[ProcType]
which we use to extract information:
proc isColumn(n: TypedNode): bool = n.node.kind in {nnkAccQuoted, nnkCallStrLit, nnkBracketExpr}
proc foo(n: NimNode, tab: Table[string, NimNode]): TypedNode =
case n.kind
of nnkIdent, nnkSym, nnkIntLit .. nnkFloat64Lit, nnkStrLit:
# determine type of this field
let nSym = tab[buildFormula(n)
let typ = findType(nSym)
new(result)
result.node = n
result.input = typ.inputTypes
result.output = typ.resType
of nnkAccQuoted, nnkCallStrLit, nnkBracketExpr:
new(result)
result.node = n
result.input = newEmptyNode()
result.output = newEmptyNode()
else:
new(result)
result.node = newTree(n.kind)
result
for ch in n:
result.children.add foo(ch, tab)
This should produce a tree something like the following (if there was a way to pretty print).
Infix
Ident "==" SomeNumber, SomeNumber -> bool
DotExpr
Call
Ident "isNull" Value -> Value
BracketExpr T (due to `idx` present)
BracketExpr
Ident "df"
Ident "localCol"
Ident "idx"
Ident "toBool" Value -> bool
Ident "false" bool
From this typed tree we can do 2 things in one pass over the typed tree:
- extract all reducing symbols from the for loop (
Lift
discussed below) - extract all
Assign
by deducing the likely types from “the environment”.
For the latter “the environment” depends on the nim node
kind. E.g. for nnkInfix
we look at the other typed node. For calls
we look at the type of the arguments. If they are generic we go “one
level up” and use the information from there if possible.
Additional heuristics:
- if
SomeNumber
: choosefloat
- if
float
andint
: choosefloat
So for nnkCall, nnkCommand, nnkPrefix
the type is directly
determined from the type of the proc iff it is not a generic
procedure. 2
A last resort resolution could be heuristics for individual
nodes. For instance a symbol is being investigated that can both take
and return either T
or Tensor[T]
. In that case we can use
heuristics:
- is numeric:
float
(Tensor
product does not really make too much sense in the context of a DF formula), should be infix - is boolean:
bool
return type, should be infix
For an arbitrary proc with overload like:
proc foo[T: SomeNumber](x: T): T
proc foo[T](x: Tensor[T]): Tensor[T]
we would give up and throw an ambiguous identifier error.
This is for single symbols though.
One of the most important “optimizations” (if one should even call it as such) is lifting operations, which perform reducing operations on a column within a for loop, such that the reduced value acts as a scalar to be used within the loop.
Consider the following example formula:
f{"xNoMean" ~ `x` - mean(`x`)}
which will certainly come up in practice to remove the mean from all
values in column “x”. It is implicitly implied that mean
must of
course take the whole tensor contained in “x” instead of a single
value, i.e. a form of broadcasting is performed.
Now the naive transformation into a closure yields the following code:
proc (df: DataFrame): Column =
let
colT = df["x", float]
var res = newTensor[float](df.len)
forEach r in res, x in colT:
r = x - mean(colT)
result = toColumn res
which is of course extremely wasteful! We recompute the mean df.len
times! That’s that many iterations over a tensor.
Thus, our macro code needs to analyze the symbols found in the formula and automatically lift such symbols to temporary variables computed once out of the for loop. Instead of the above we need to generate:
proc (df: DataFrame): Column =
let
colT = df["x", float]
meanX = mean(colT)
var res = newTensor[float](df.len)
forEach r in res, x in colT:
r = x - meanX
result = toColumn res
How can this be done?
In our current implementation we keep track of all symbols in the body. The solution is relatively simple. We analyze the body of the formula for symbols, which have reducing characteristics, i.e. signatures like:
Tensor[T] -> T
For any such symbol we find, we store the whole =NimNode= in a
Lift
object:
type
Lift = object
toLift: NimNode # the whole nim node to be lifted out
liftedNode: NimNode # the name of the symbol the lifted node is assigned to
The toLift
body then simply needs to be replaced in the same fashion
as the rest of the loop body for column references. In this case the
column reference to be inserted in the toLift
node will always be
the Assign.tensor
field.
The lifted toLift
node will then be replaced by liftedNode
, which
is just the name of the temporary result variable. It’s name is
derived from the name of the symbol to be lifted (usually
toLift[0]
).
In addition to the above example of performing an action on a full
column, another case should be lifted. Namely access of a local
sequence / tensor that is being reduced. An example from the
axionMass.nim
file:
.filter(f{float: `Energy` >= energies.min and `Energy` <= energies.max})
where this code will recalculate energies.min/max
for every single
element it tests!
Fortunately, the lifting of these can work essentially in the same
way. As long as we can deduce types of the symbols involved, we should
be able to deduce that energies.min
has type signature
seq/Tensor[T] -> T
. With that knowledge we can lift out the full
nnkDotExpr
node.
Maybe we could have an additional pass over the tree like
extractNodesToLift
or something like this.
Subnodes that do not reference a column can always be lifted out of a loop, because they reduce to a constant value.
For the time being this will be left out of the current implementation.
Main issue that’s left:
We can have multiple Assign
for a single column refence. In some
cases that might be byIndex
and in others byTensor
.
get(preface)
only returns the first assignment- we also generate multiple symbols and statements to get tensors from the preface
We could imagine to introduce different nnkCallStrLit
like the
following:
c"foo" <- refers to a column access
t"foo" <- refers to the full column (t = tensor)
Maybe a breaking change for c"foo"
to e"foo"
might be thinkable,
where e
would stand for element and then we could use c
for full column?
In implementing the above functionality I stumbled over many problems. While we are able to determine types for symbols, these are as isolated symbols not particularly useful.
There are firstly two cases:
- the symbol is unique and thus we can determine the required type of the DF access argument from it.
- the symbol is an unresolved overload. In this case we may be unable to resolve the ambiguity in case there is are symbols taking either tensors or scalars. For such cases we need a syntax to differentiate between column / index access regardless!
While dealing with these two is possible, if complex (throw CT error
in latter case), there is an additional factor making things much
worse: generics.
For generic procedures we lack (possibly) every kind of information
about the types. We may not even know if T
may refer only to scalars
or either tensors and scalars. While we can make restrictions such as
T
referring to scalars and requiring Tensor[T]
for tensor procs,
this becomes extremely complicated as we delegate way too much to the
signature of procedures.
In the end all this results in something that is akin to writing our own “compiler”. Just worse, because doing all this using macros only is problematic to say the least. In practice we have to deal with all sorts of procedures that are perfectly valid as if written in a manual loop, but simply have weird signatures.
Thus, the solution will be more explicit. There’s no point in trying to make everything as succinct as possible. Often explicit is better than implicit as we know as users of a language with a very strong type system!
We will delegate type checking to the Nim compiler, with the following implicit / explicit rules:
- any reference to a DF column using the existing syntax of:
`someCol` c"someCol" # and to an extent the explicit variants: df["someCol"][idx] df[stringVar][idx]
will keep referring to indices of the referred columns. This is because the majority of use cases acts on indices, due to the fact that our DF is column based after all
- an additional syntax is added to avoid the explicit
df
access, while staying reasonably succinct:# replace by `df["SomeCol"][idx]` f{"test" ~ idx("SomeCol")} f{"test" ~ idx(SomeDeclaredVariable)} f{"test" ~ idx(`SomeCol`)} f{"test" ~ idx("SomeCol", float)} f{"test" ~ idx(SomeDeclaredVariable, float)} f{"test" ~ idx(`SomeCol`, float)} # replace by `df["SomeCol"]`, lift these out of the loop body f{"test" ~ col("SomeCol")} f{"test" ~ col(SomeDeclaredVariable)} f{"test" ~ col(`SomeCol`)} f{"test" ~ col("SomeCol", float)} f{"test" ~ col(SomeDeclaredVariable, float)} f{"test" ~ col(`SomeCol`, float)} ## allows for f{"test" ~ `Energy` * 10.0 * idx(MyColRef, float) > mean(col(`SomeCol`, float))}
i.e.
col
refers to a column with the given argument (supporting variables, strings and accented quotes) whileidx
refers to indices (which means the implicit default is equivalent toidx(`someCol`)
. Both of these in addition allow a second argument referring to a type, in which case this is the type with which to read the tensor from the DF, useful to combine multiple columns with different types.
In particular the col
syntax allows for something else as
well. Lifting computations out of the loop.
Given the col
syntax introduced in <a href=”Thoughts on type
determination”>Thoughts on type
determination, we can more easily perform lifting.
Essentially the logic for lifting works as follows:
- recurse on tree and check if
col
access is used - if so, lift out parent tree containing
col
access until.- determine parent tree by: go up until the first infix tree is found and take infix
Take the following contrived example:
f{$(`numbers`) & " m/s" != mean(col(`someCol`, float)).formatFloat(precision = 2) & " m/s"
Here we know we need to:
- LHS: use indices for “numbers” column, convert each value to
string and append
" m/s"
- RHS: the
mean
computation should not happen in the loop. We can lift out the whole part up to the next infix node, which is&
on the RHS in this case. Any operation that acts on the reduced value can, and should, be lifted.
To generate a forEach
statement from the formula macro instead of a
normal for loop, we need to understand what code we have to
generate. Let’s look at the AST of a forEach
call:
import macros
macro dumpAtRT(body: untyped): untyped =
let s = body.treeRepr
result = quote do:
echo `s`
dumpAtRT:
forEach a in x, b in y, c in z:
echo a, b, c
So it’s simply a nnkCommand
with a bunch of nnkInfix
arguments. Easy enough to generate.