Skip to content

c4: Clojure protocols for Drake

Jeffrey Hulten edited this page Jan 25, 2024 · 7 revisions

Overview

The c4 protocol is a specialized Clojure-based protocol for Drake that supports:

  • Drake steps written in Clojure, to transform your data
  • treating file rows as a sequence of hash-maps (something at which Clojure is phenomenal)
  • querying data from Factual's public API
  • convenient support for files formatted in CSV, TSV, JSON

The c4 protocol is meant mainly for local files that are organized as line by line records of data, and are less than terabytes in size.

c4row: Transform individual rows

The c4row protocol supports the Clojure code to transform an individual row. Every input row will be run through this transform and written to the output file.

Your code can refer to the row var for the input row, which is automatically provided to you. Your code is expected to return the output row, which will be written to the output file.

The c4row protocol is meant especially for cases where you have a self-contained row transform that you want applied to every single row in the input file, and you don't need any higher level control over the rows.

For example:

; Add a FullName column, using FirstName and LastName
out.csv <- in.csv [c4row]
  (assoc row "FullName"
    (format "%s %s" (row "FirstName") (row "LastName")))

If this step had been run with an in.csv and in.csv.header like...

in.csv

Brandon,Yoshimoto
Aaron,Crow
Artem,Boytsov

in.csv.header

FirstName
LastName

... the out.csv and out.csv.header file would look like:

out.csv

Brandon,Yoshimoto,Brandon Yoshimoto
Aaron,Crow,Aaron Crow
Artem,Boytsov,Artem Boytsov

out.csv.header

FirstName
LastName
FullName

c4rows: Handle the sequence of input rows

The c4rows protocol allows you to work with all rows from the input file as a sequence. Your code can refer to the predefined var rows to get the sequence, and your code is expected to return a sequence of rows you want written to the output file.

The c4rows protocol is meant especially for when you want control of the input rows as a sequence, for example to apply filtering before you transform.

For example:

; For rows where LastName has more than 5 letters,
; add a FullName column, using FirstName and LastName.
; Skip all rows where LastName doesn't have more than
; 5 letters.
out.csv <- in.csv [c4rows]
  (map
    (fn [row]
      (assoc row "FullName"
        (format "%s %s" (row "FirstName") (row "LastName"))))
    (filter #(> (count (% "LastName")) 5) rows))

If this step had been run with an in.csv like...

Brandon,Yoshimoto
Aaron,Crow
Artem,Boytsov

... the out.csv file would look like:

Brandon,Yoshimoto,Brandon Yoshimoto
Artem,Boytsov,Artem Boytsov

File Format Support

The c4 protocol provides native support for these file formats, based on file extension:

  • CSV (e.g., in.csv)
  • TSV (e.g., out.tsv)
  • JSON (e.g., in.json)

As long as you name your input files and output files following this convention, c4 will know how to parse and format them. You don't need to do anything special. Your c4 code only needs to be concerned with rows represented as hash-maps.

This also means c4 will allow you to easily translate from any supported file format to any other. Specifically, all these step specifications will work:

out.tsv <- in.csv [c4row]

out.json <- in.csv [c4row]

out.csv <- in.tsv [c4row]

out.json <- in.tsv [c4row]

out.csv <- in.json [c4row]

out.tsv <- in.json [c4row]

In fact, it's perfectly valid to have a c4 step with no commands, that serves only to translate from one file format to another. That is, any of the above will work as a complete step in Drake.

How to require runtime dependencies

c4 does not currently support pulling in external libraries not already included in Drake. Drake's plugin mechanism can be used as a workaround for this.

However, c4 does allow you to pull in namespaces and functions that are available at runtime but not already in the c4 namespace. This is done with the with-ns macro.

Here's an example c4 step that uses with-ns to pull in clojure.data/diff and clojure.set/union:

out.json <- in.json [c4row]
  (with-ns {clojure.data [diff]
            clojure.set  [union]}
    {:myunion (union (row "col1") (row "col2"))
     :mydiff  (diff  (row "col1") (row "col2"))})