Skip to content

Latest commit

 

History

History
1071 lines (936 loc) · 32.5 KB

formula_dev_notes.org

File metadata and controls

1071 lines (936 loc) · 32.5 KB

“Spec” for DF formulas, aka development ideas and notes

Formulas serve as a convenience syntax to describe computations that are supposed to be done on a DataFrame. These come in three forms:

  1. formulas that compute a new Column (map operations)
  2. formulas that reduce one or more columns to a single value (reduce operations)
  3. simple constants or pure (arithmetic) operations, i.e. things that can be evaluated without the context of a DataFrame

This means we can write prototypes for these three operations as:

  1. map operations require both a datatype to read and one result data type. In theory there may be multiple read data types with the caveat that all “read” datatypes must be convertible to the “result” datatype. In addition they of course require some operationn op to be done using all “read” tensors t_i. Finally, a name for the new (or overwritten) column in the dataframe is needed outname.
proc(df: DataFrame): Column = 
  let
    t1 = df["col1", readType1]
    ...
    tn = df["colN", readTypeN] 
  var res = newTensor[resultType](df.len)
  for idx in 0 ..< df.len:
    res[idx] = op(t1, ..., tN)
  result = toColumn res
  1. reduce operations are a bit simpler. The type that will be returned is always going to be a Value . The “read” datatypes are required same as for map operations for all tensors t_i. An operation has to be supplied to perform an action on all tensors that reduces to a single scalar value. (NOTE: we could extend it to make it return a specific type, same as we deduce the return type of res in the above case. But in that case the signature of the proc changes, which means we’d need to allow generic FormulaNodes. That brings the problem of taking varargs[FormulaNode], because that cannot work anymore. So it’s a bit messy and the overhead of storing a number in a value should be negligible).
proc(df: DataFrame): Value = 
  let
    t1 = df["col1", readType1]
    ...
    tn = df["colN", readTypeN] 
  res[idx] = op(t1, ..., tN)
  result = %~ res
  1. are self explanatory, it’s just Nim code that can be evaluated w/o additional input. The complication about this is how to detect it is just that.

In the current implementation of the formula macro the types of all input tensors t_i has to be the same. As mentioned above this is an artificial requirement in principle as long as the type can be converted to the output type (or in case of boolean return types to the comparison type).

Before jumping into the macro craze again to make changes to the macro, we will now:

  • design all types of operations that are expected to work given our DSL
  • write helper procedures to allow constructing formulas manually without the use of the DSL (which is not really supported right now, for no good reason)
  • extend the spec to allow multiline arguments and possibly a smarter looping using loop fusion (forEach if no string / Value types used, else regular for)

The actual FormulaNode type needs to store some additional information about the kind of formula, its name and value / closure.

The FormulaNode type

type
  FormulaKind* = enum
    fkVariable, fkAssign, fkVector, fkScalar

  FormulaNode* = object
    name*: string # stringification of whole formula. Only for printing and
                  # debugging
    case kind*: FormulaKind
    of fkVariable:
      # just some constant value. Result of a simple computation as a `Value`
      # This is mainly used to rename columns / provide a constant value
      val*: Value
    of fkAssign:
      lhs*: string # can this be something else?
      rhs*: Value
    of fkVector:
      colName*: string
      resType*: ColKind
      fnV*: proc(df: DataFrame): Column
    of fkScalar:
      valName*: string
      valKind*: ValueKind
      fnS*: proc(c: DataFrame): Value

Expressions to be supported

Differentiating between “pure” and “impure” formulas:

By having a test on “pureness” of the formula (that means checking for references to columns of a DataFrame or usage of any of the following symbols / expressions:

~
<<
<-
<dtype> -> <dtype2>:
`someCol`
c"someCol"
df["foo"]
df["foo"][idx]
idx

If none of the above is present, we can dispatch to the formula template above and assign create a FormulaNode of type fkVariable with the value being the result of the expression as a Value.

Pure Nim expressions

Essentially pure formulas are nothing but a template.

f{1}
f{1.0}
f{"hello"}
f{1 + 2}
block:
  let x = 2
  f{1 + x}
block:
  proc foo(): int = 2
  f{foo()}
  f{1 + foo()}
  proc bar(x: int): int = 1 + x
  f{bar(2)}
  f{3 + bar(3)}
block:
  # control flow
  let x = 2
  f{if x > 1: 
      5
    else:
      10}

Possible implementation:

import macros

template formula(body: untyped): untyped =
  body

macro `{}`(arg, stmts: untyped): untyped = 
  if arg.kind == nnkIdent and arg.strVal == "f":
    result = formula(stmts)
  else:
    error("Invalid formula!")

echo f{1 + 2}
let x = 2
echo f{if x > 1: 
         5
       else:
         10}

This works fine. Invalid code will just generate a regular Nim compilation error.

This means something like the following:

import macros, ggplotnim

proc initVariable[T](x: T): FormulaNode =
  result = FormulaNode(name: "foo", kind: fkVariable, 
                       val: %~ x)

proc arePureFormula(n: NimNode): bool = true # a stub

macro foo(arg, stmts: untyped): untyped = 
  if arg.kind == nnkIdent and arg.strVal == "f":
    if stmts.arePureFormula():
      result = quote do:
        initVariable(`stmts`)
  else:
    error("Invalid formula!")

let x = foo(f, 1 + 2)
echo x.repr
let a = 2
let y = foo(f):
  if a > 1:
    5
  else:
    10
echo y.repr

where we have replaced the f{} syntax by a foo() call so that we can actually import ggplotnim and access the actual FormulaNode type. In this version the formula template is not even needed, because we can just insert the pure (implementation pending) formulas to the new generic initVariable.

Impure formulas (formulas accessing DF)

If a formula is detected as being impure, we first have to determine the actual formula kind. This can be done in the same manner as currently done in compileFormula.

Before we get into any of this, let’s try to write some helpers that allow us to construct formulas manually of specific types.

import ggplotnim, sequtils
import arraymancer / laser / strided_iteration / foreach
proc initVectorFormula[T](fnV: proc(df: DataFrame): Column): FormulaNode =
  result = FormulaNode(name: "foo", kind: fkVector, resType: toColKind(T), fnV: fnV)

proc initScalarFormula[T](fnS: proc(df: DataFrame): Value): FormulaNode =
  result = FormulaNode(name: "foo", kind: fkScalar, valKind: toValKind(T), fnS: fnS)

template vecFn(dtype, body: untyped): untyped =
  let cl = proc(df: DataFrame): Column = 
    let df {.inject.} = df
    body
  initVectorFormula[dtype](cl)

template scalarFn(dtype, body: untyped): untyped =
  let cl = proc(df: DataFrame): Value =
    let df {.inject.} = df
    body
  initScalarFormula[dtype](cl)

let f = vecFn(int):
  let t = df["x", int]
  result = toColumn t.map_inline(x * 2)

let f2 = vecFn(int):
  let t = df["x", int]
  var res = newTensor[int](df.len)
  for idx in 0 ..< df.len:
    res[idx] = t[idx] * 2
  result = toColumn res

let f3 = vecFn(int):
  let t = df["x", int]
  var res = newTensor[int](df.len)
  forEach x in res, y in t:
    x = y * 2
  result = toColumn res

let fs = scalarFn(int):
  let t = df["x", int]
  var res: int
  forEach x in t:
    res += x
  result = %~ res

echo f.repr

let df = seqsToDf({"x" : toSeq(0 .. 10)})
echo f.evaluate(df)
echo f2.evaluate(df)
echo f3.evaluate(df)
echo fs.reduce(df)

In summary the closure implementation of a vector formula consits of 3 parts:

  • the preface:
    let
      t1 = df["t1", dtype1]
      ...
      tn = df["tn", dtypen]
    var res = newTensor[resType](df.len
        
  • the loop:
    for / forEach / map_inline / ...:
      # some code using `res`
        
  • the result:
    result = toColumn res
        

Complications are thus:

  • how do we determine the datatypes of the input tensors?
  • how do we determine the datatype of the result?

and for the vecFn template:

  • how do we determine the column type so that vecFn could be independent to create a FormulaNode?

The implementation of reducing formulas is essentially the same, with the following changes to each of the 3 stages:

  • preface:
    var res: resType
        
  • the loop:
    # some code modifying `res`
        
  • the result:
    result = %~ res
        

and in many cases the whole body can be reduced to a single line, if the reducing operation simply calls a procedure (e.g. mean).

import macros

f{
  preface:
    t in df["foo", int] # t refers to each element of `foo` in the loop
    u in df["bar", float]
    v = df["baz", int] # v refers to the ``Tensor`` `baz`
  dtype: float
  name: "fooBar"
  loop:
    t.float * u + v[idx].float
}

f{
  preface:
    t in df["foo", int] # t refers to each element of `foo` in the loop
    u in df["bar", float]
    v = df["baz", int] # v refers to the ``Tensor`` `baz`
    r in result
  dtype: bool
  name: "filterme"
  loop:
    r = t.float > u and v[idx] < 2.2
}

f{
  preface:
    t in df["foo", float] # t refers to each element of `foo` in the loop
  dtype: bool
  name: "noNan"
  loop:
    not classify(t) == fcNan
}
# then
"noNan" ~ not classify(t) == fcNan
# is just a short form of 
name: "noNan"
loop:
  not classify(t) == fcNan

Have a data type to store this information at CT:

type
  ## either: `t in df["foo", int]`
  ## or: `t = df["foo", int]`
  Assign = object
    element: NimNode # e.g. `t`
    tensor: NimNode # either `t` or `t_T` if `elmenent` used
    col: NimNode # name of the column
    colType: NimNode # e.g. `float`
  Preface = object
    args: seq[Assign]
  FormulaCT = object
    preface: Preface
    resType: NimNode # e.g. `ident"float"`
    name: string # name of the formula body as lisp
    loop: NimNode # loop needs to be patched to remove accented quotes etc

And a function to convert FormulaCT into a vector / scalar closure procedure.

Accented quotes / call string literals are converted into Assign:

`foo` -> Assign(
  element: fooEl, # gensym
  tensor: fooT, # gensym
  col: foo,
  colType, # heuristics or type hint
)
# same with `c"foo"`

Operations to be supported

f{`x`}
f{`x` + `y`}
f{"foo" ~ `x` + `y`}
f{"foo" << `x` + `y`}
f{"foo" ~ `x` + fn(`y`)}
f{"foo" ~ fn(`x`)}
let bar = "hello"
f{"foo" ~ fn(`x`, bar)}
# for filtering
f{not classify(`x`) == fcNaN}
f{not isNull(`x`)}
f{isBool(`x`)}
f{`x` > 5}
f{df["x"][idx] > 5 and `y` < 4}
f{"newCol" ~ if `x` < 3:
               foo(`x`)
             else:
               bar(`x)}

and many more…

Automatic type determination

One of the major issues in the current implementation is our hacky pass from untyped to typed macros. In principle this is not an issue, but the problem is that we cannot pass multiple procedures to a typed macro, due to the following Nim issue: nim-lang/Nim#13913

However, we can work around it using an approach such as:

import macros, math, tables

proc fn(x: int, s: string): string = $x & s
var EncounteredSymbols {.compileTime.}: Table[string, seq[NimNode]]

macro doTyped(name: static string, arg: typed): untyped =
  if name notin EncounteredSymbols:
    EncounteredSymbols[name] = newSeq[NimNode]()
  case arg.kind
  of nnkClosedSymChoice, nnkOpenSymChoice:
    for ch in arg:
      EncounteredSymbols[name].add ch
  else:
    EncounteredSymbols[name].add arg

macro foo(n: static string, fns: varargs[untyped]): untyped =
  result = newStmtList()
  for t in fns:
    result.add quote do:
      doTyped(`n`, `t`)

foo("bar", sqrt, pow, ln, fn)
static:
  for k, val in EncounteredSymbols:
    for s in val:
      echo s.getImpl.treeRepr

It requires neither {.experimental: "dynamicBindSym".} nor a macro taking multiple typed parameters. Instead we have a compilation as follows:

  • an untyped macro, which extracts all symbols (nnkCall, nnkCommand?) used in the body, it outputs calls as:
    1. add each symbol using a doTyped equivalent to some global Table where the key is the name of the formula
    2. gen code to call another macro with the same arguments and the name of the just inserted key
    3. said macro can now read all typed symbols from the Table

The final implementation is a bit more complicated in some sense, because it requires us to skip identifiers df, idx as well as nnkBracketExpr of df[<someCol>]([idx]), but in the latter case take all identifiers found in a chain of nnkDotExpr.

nnkDotExpr however has to be kept as is (as a chain of calls) iff it only contains identifiers.

var TypedSymbols {.compileTime.}: Table[string, seq[NimNode]]

macro addSymbols(name: string, n: typed): untyped =
  if name.repr notin TypedSymbols:
    TypedSymbols[name.repr] = newSeq[NimNode]()
  TypedSymbols[name.repr].add n

proc extractSymbols(n: NimNode): seq[NimNode] =
  case n.kind
  of nnkIdent, nnkSym:
    # take any identifier or symbol
    if n.strVal notin ["df", "idx"]: # these are reserved identifiers
      result.add n
  of nnkBracketExpr:
    # check if contains df[<something>], df[<something>][idx]
    if not ((n[0].kind == nnkIdent and n[0].strVal == "df") or
            (n[0].kind == nnkBracketExpr and
             n[0][0].kind == nnkIdent and n[0][0].strVal == "df" and
             n[1].kind == nnkIdent and n[1].strVal == "idx")):
      result.add n
  of nnkDotExpr:
    ## If `DotExpr` consists only of Idents during the untyped pass,
    ## it's either field access or multiple calls taking no arguments.
    ## In that case we can just keep the chain and pass it to the typed
    ## macro. In case other things are contained (possibly `df[<...>]` or
    ## a regular call) take the individual fields.
    ## For something like `ms.trans` in ggplotnim (`trans` field of a scale)
    ## we need to pass `ms.trans` to typed macro!
    proc isAllIdent(n: NimNode): bool =
      result = true
      case n.kind
      of nnkIdent: discard
      of nnkDotExpr:
        if n[1].kind != nnkIdent: return false
        result = isAllIdent(n[0])
      else: return false
    let allIdent = isAllIdent(n)
    if allIdent:
      result.add n
    else:
      # add all identifiers found
      for ch in n:
        result.add extractSymbols(ch)
  of nnkAccQuoted, nnkCallStrLit:
    # do not look at these, since they are untyped identifiers referring to
    # DF columns
    return
  else:
    for i in 0 ..< n.len:
      result.add extractSymbols(n[i])

With this information in place, we can determine more type information than before. There are still 4 different ways for us to determine types:

  1. if explicit usage of <<, ~, <-: gives us information about whether the total computation is a tensor or a scalar
  2. heuristics based on boolean / numerical operators, which we can use to deduce the result of operations (but in case of boolean operations not of the input!)
  3. the newly gained type information from typed symbols: we can walk over the found symbols and check all open sym choices for things that:
    • take either scalars or tensors as input
    • from these look at the output

    to determine the required inputs for DF columns

  4. type hints of float -> int kind

The latter should be extended in a way similar to what’s explained in Impure formulas (formulas accessing DF).

Type lookup has to work in the following way. Once we have filled the TypedSymbols table, we walk over it in combination with the loop body. Using the loop body we find all symbols that are in TypedSymbols and check what their arguments are. Using the arguments:

  • number of arguments
  • types (reference to column, index or general symbol)

can find the correct overloads. The overload is important in combination with the body, because it allows us to find the correct way to replace each column reference by.

For explicit index references, via df["someCol"][idx] we have won, because we only have to consider overloads which take scalars and the access will just be replaced either by index or by value (depending on type).

For explicit column refernce, via df["someCol"] we also have won, except we now look for something taking Tensor[T].

For the general column reference using nnkAccQuoted or nnkCallStrLit we have to check what kind of overloads exist. Scalar vs. tensor. In case of multiple overloads that match (which seems very unusual), we throw a CT error that we cannot determine the overload unambiguously and ask the user to use either df["foo"] or df["foo"][idx].

The result for the input types will be stored in each Assign. The final output type will be the “union” of all encountered types that were chosen.

What if we have:

f{"col" ~ foo($`x`.toBool)}
StmtList
  CurlyExpr
    Ident "f"
    Infix
      Ident "~"
      StrLit "col"
      Call
        Ident "foo"
        Prefix
          Ident "$"
          DotExpr
            AccQuoted
              Ident "x"
            Ident "toBool"

So we walk over the loop body and keep the information about the last encountered ident that is one of:

  • nnkCall: index 0
  • nnkCommand: index 0
  • nnkPrefix: index 0

Name as string is enough, from it we can ask TypedSymbol table to extract symbol.

For nnkInfix and nnkDotExpr this is a bit different. In those cases we have to use information from the later nodes.

In case of nnkInfix the type information is special. We can assume that both sides of an nnkInfix will be of the same type as long as the infix is a boolean or mathematical operation. We can use this information to fill in holes of type information. Assume the following:

f{`x` < 5}

Assuming the user writes correct code (it’s not our fault if they don’t :) ), we can deduce from this that `x` should be read as an integer (or arguably as a float and the user was just lazy / this needs to be considered). The AST of this expression looks something like the following in slightly simplified form:

Infix
  Ident "<"
  AccQuoted
    Ident "x"
  IntLit 5

Thus, after extracting the typed symbols (< and 5) we can study the types of the full tree:

  1. call:
Infix # ? infix so look at children
  Ident "<" # boolean operator, look for valid types: Tensor, Tensor and Number, Number
  AccQuoted # ?
    Ident "x"
  IntLit 5 
  1. call for nnkAccQuote:
Infix 
  Ident "<" 
  AccQuoted # just a column reference, type unknown, bubble up
    Ident "x"
  IntLit 5 
  1. call for nnkIntLit:
Infix 
  Ident "<" 
  AccQuoted 
    Ident "x"
  IntLit 5 # type is clearly int

Which leaves us with the following final tree:

Infix 
  Ident "<"  # bool: Number, Number
  AccQuoted # unknown
    Ident "x" 
  IntLit 5 # int

With simple constraint solving (why not use Prolog huh?), we can uniquely determine that “unknown” has to be int too. How do we easily do that?

We look at the type signature of the < symbol, find type A == B, From node 2 we know B == int. Thus it follows A == int from transitivity.

The problem with this is that probably solving thes transitive problems might get hard for complex trees? For nnkInfix it’s pretty easy, because there will always only be 2 child nodes of interest and they will usually have the same type.

For (possibly nested) nnkDotExpr we need to consider the last

We define a helper type to store type information about all symbols we study:

type
  TypedNode = ref object
    node: NimNode # the node that has these types
    input: seq[NimNode] # types of input 
    output: NimNode # output type
    children: seq[ProcType]

which we use to extract information:

proc isColumn(n: TypedNode): bool = n.node.kind in {nnkAccQuoted, nnkCallStrLit, nnkBracketExpr}

proc foo(n: NimNode, tab: Table[string, NimNode]): TypedNode =
  case n.kind
  of nnkIdent, nnkSym, nnkIntLit .. nnkFloat64Lit, nnkStrLit:
    # determine type of this field 
    let nSym = tab[buildFormula(n)
    let typ = findType(nSym)
    new(result)
    result.node = n
    result.input = typ.inputTypes
    result.output = typ.resType
  of nnkAccQuoted, nnkCallStrLit, nnkBracketExpr:
    new(result)
    result.node = n
    result.input = newEmptyNode()
    result.output = newEmptyNode()
  else:
    new(result)
    result.node = newTree(n.kind)
    result
    for ch in n:
      result.children.add foo(ch, tab)

This should produce a tree something like the following (if there was a way to pretty print).

Infix
  Ident "==" SomeNumber, SomeNumber -> bool
  DotExpr
    Call
      Ident "isNull" Value -> Value
      BracketExpr T (due to `idx` present)
        BracketExpr
          Ident "df"
          Ident "localCol"
        Ident "idx" 
    Ident "toBool" Value -> bool
  Ident "false" bool

From this typed tree we can do 2 things in one pass over the typed tree:

  1. extract all reducing symbols from the for loop (Lift discussed below)
  2. extract all Assign by deducing the likely types from “the environment”.

For the latter “the environment” depends on the nim node kind. E.g. for nnkInfix we look at the other typed node. For calls we look at the type of the arguments. If they are generic we go “one level up” and use the information from there if possible.

Additional heuristics:

  • if SomeNumber: choose float
  • if float and int: choose float

So for nnkCall, nnkCommand, nnkPrefix the type is directly determined from the type of the proc iff it is not a generic procedure. 2

A last resort resolution could be heuristics for individual nodes. For instance a symbol is being investigated that can both take and return either T or Tensor[T]. In that case we can use heuristics:

  • is numeric: float (Tensor product does not really make too much sense in the context of a DF formula), should be infix
  • is boolean: bool return type, should be infix

For an arbitrary proc with overload like:

proc foo[T: SomeNumber](x: T): T 
proc foo[T](x: Tensor[T]): Tensor[T]

we would give up and throw an ambiguous identifier error.

This is for single symbols though.

modify findTypes to return types expected in above code

Lifting reducing calls from loop

One of the most important “optimizations” (if one should even call it as such) is lifting operations, which perform reducing operations on a column within a for loop, such that the reduced value acts as a scalar to be used within the loop.

Consider the following example formula:

f{"xNoMean" ~ `x` - mean(`x`)}

which will certainly come up in practice to remove the mean from all values in column “x”. It is implicitly implied that mean must of course take the whole tensor contained in “x” instead of a single value, i.e. a form of broadcasting is performed.

Now the naive transformation into a closure yields the following code:

proc (df: DataFrame): Column = 
  let
    colT = df["x", float]
  var res = newTensor[float](df.len)
  forEach r in res, x in colT:
    r = x - mean(colT)
  result = toColumn res

which is of course extremely wasteful! We recompute the mean df.len times! That’s that many iterations over a tensor.

Thus, our macro code needs to analyze the symbols found in the formula and automatically lift such symbols to temporary variables computed once out of the for loop. Instead of the above we need to generate:

proc (df: DataFrame): Column = 
  let
    colT = df["x", float]
    meanX = mean(colT)
  var res = newTensor[float](df.len)
  forEach r in res, x in colT:
    r = x - meanX
  result = toColumn res

How can this be done?

In our current implementation we keep track of all symbols in the body. The solution is relatively simple. We analyze the body of the formula for symbols, which have reducing characteristics, i.e. signatures like:

Tensor[T] -> T

For any such symbol we find, we store the whole =NimNode= in a Lift object:

type
  Lift = object
    toLift: NimNode # the whole nim node to be lifted out
    liftedNode: NimNode # the name of the symbol the lifted node is assigned to

The toLift body then simply needs to be replaced in the same fashion as the rest of the loop body for column references. In this case the column reference to be inserted in the toLift node will always be the Assign.tensor field.

The lifted toLift node will then be replaced by liftedNode, which is just the name of the temporary result variable. It’s name is derived from the name of the symbol to be lifted (usually toLift[0]).

In addition to the above example of performing an action on a full column, another case should be lifted. Namely access of a local sequence / tensor that is being reduced. An example from the axionMass.nim file:

.filter(f{float: `Energy` >= energies.min and `Energy` <= energies.max})

where this code will recalculate energies.min/max for every single element it tests!

Fortunately, the lifting of these can work essentially in the same way. As long as we can deduce types of the symbols involved, we should be able to deduce that energies.min has type signature seq/Tensor[T] -> T. With that knowledge we can lift out the full nnkDotExpr node.

Maybe we could have an additional pass over the tree like extractNodesToLift or something like this.

Subnodes that do not reference a column can always be lifted out of a loop, because they reduce to a constant value.

For the time being this will be left out of the current implementation.

Have to change replaceByElement and friends

Main issue that’s left: We can have multiple Assign for a single column refence. In some cases that might be byIndex and in others byTensor.

  1. get(preface) only returns the first assignment
  2. we also generate multiple symbols and statements to get tensors from the preface

Different call string literals

We could imagine to introduce different nnkCallStrLit like the following:

c"foo" <- refers to a column access
t"foo" <- refers to the full column (t = tensor)

Maybe a breaking change for c"foo" to e"foo" might be thinkable, where e would stand for element and then we could use c for full column?

Thoughts on type determination

In implementing the above functionality I stumbled over many problems. While we are able to determine types for symbols, these are as isolated symbols not particularly useful.

There are firstly two cases:

  • the symbol is unique and thus we can determine the required type of the DF access argument from it.
  • the symbol is an unresolved overload. In this case we may be unable to resolve the ambiguity in case there is are symbols taking either tensors or scalars. For such cases we need a syntax to differentiate between column / index access regardless!

While dealing with these two is possible, if complex (throw CT error in latter case), there is an additional factor making things much worse: generics. For generic procedures we lack (possibly) every kind of information about the types. We may not even know if T may refer only to scalars or either tensors and scalars. While we can make restrictions such as T referring to scalars and requiring Tensor[T] for tensor procs, this becomes extremely complicated as we delegate way too much to the signature of procedures.

In the end all this results in something that is akin to writing our own “compiler”. Just worse, because doing all this using macros only is problematic to say the least. In practice we have to deal with all sorts of procedures that are perfectly valid as if written in a manual loop, but simply have weird signatures.

Thus, the solution will be more explicit. There’s no point in trying to make everything as succinct as possible. Often explicit is better than implicit as we know as users of a language with a very strong type system!

We will delegate type checking to the Nim compiler, with the following implicit / explicit rules:

  • any reference to a DF column using the existing syntax of:
    `someCol`
    c"someCol"
    # and to an extent the explicit variants:
    df["someCol"][idx]
    df[stringVar][idx]
        

    will keep referring to indices of the referred columns. This is because the majority of use cases acts on indices, due to the fact that our DF is column based after all

  • an additional syntax is added to avoid the explicit df access, while staying reasonably succinct:
    # replace by `df["SomeCol"][idx]`
    f{"test" ~ idx("SomeCol")}
    f{"test" ~ idx(SomeDeclaredVariable)}
    f{"test" ~ idx(`SomeCol`)}
    f{"test" ~ idx("SomeCol", float)}
    f{"test" ~ idx(SomeDeclaredVariable, float)}
    f{"test" ~ idx(`SomeCol`, float)}
    # replace by `df["SomeCol"]`, lift these out of the loop body
    f{"test" ~ col("SomeCol")}
    f{"test" ~ col(SomeDeclaredVariable)}
    f{"test" ~ col(`SomeCol`)}
    f{"test" ~ col("SomeCol", float)}
    f{"test" ~ col(SomeDeclaredVariable, float)}
    f{"test" ~ col(`SomeCol`, float)}
    ## allows for
    f{"test" ~ `Energy` * 10.0 * idx(MyColRef, float) > mean(col(`SomeCol`, float))}   
        

    i.e. col refers to a column with the given argument (supporting variables, strings and accented quotes) while idx refers to indices (which means the implicit default is equivalent to idx(`someCol`). Both of these in addition allow a second argument referring to a type, in which case this is the type with which to read the tensor from the DF, useful to combine multiple columns with different types.

In particular the col syntax allows for something else as well. Lifting computations out of the loop.

Lifting column based computations out of the loop

Given the col syntax introduced in <a href=”Thoughts on type determination”>Thoughts on type determination, we can more easily perform lifting.

Essentially the logic for lifting works as follows:

  • recurse on tree and check if col access is used
  • if so, lift out parent tree containing col access until.
    • determine parent tree by: go up until the first infix tree is found and take infix

    Take the following contrived example:

    f{$(`numbers`) & " m/s" != mean(col(`someCol`, float)).formatFloat(precision = 2) & " m/s"
        

    Here we know we need to:

    • LHS: use indices for “numbers” column, convert each value to string and append " m/s"
    • RHS: the mean computation should not happen in the loop. We can lift out the whole part up to the next infix node, which is & on the RHS in this case. Any operation that acts on the reduced value can, and should, be lifted.

Replacing for by forEach

To generate a forEach statement from the formula macro instead of a normal for loop, we need to understand what code we have to generate. Let’s look at the AST of a forEach call:

import macros

macro dumpAtRT(body: untyped): untyped =
  let s = body.treeRepr
  result = quote do:
    echo `s`

dumpAtRT:
  forEach a in x, b in y, c in z:
    echo a, b, c

So it’s simply a nnkCommand with a bunch of nnkInfix arguments. Easy enough to generate.