Query Compilation 101

Prerequisite videos

Materialized performance 101 (skip forward to 5:13)
Materialized internals 101.
Introduction to one-off queries.
Materialize Decorrelation explained in Jamie Brandon’s Blog.

Current Compiler Pipeline

Representations:

SQL — source language
AST — a parsed version of a SQL query.
HIR — high-level intermediate representation.
MIR — mid-level intermediate representation.
LIR — low-level intermediate representation.
TDO — target language (timely & differential operators).

Transformations in the compile-time lifecycle of a dataflow.

SQL ⇒ AST.
- Parsing the SQL query.
AST ⇒ AST
- Resolving names against the catalog.
  - CatalogItemType lists the kinds of objects that can be resolved against the catalog.
AST ⇒ HIR.
- Resolving column references, column aliases, and table aliases
- If the SQL query is a one-off, the outermost TopK is converted to a RowSetFinishing at this point.
- EXPLAIN RAW PLAN returns the result of transformations up to this point.
HIR ⇒ HIR
- Predecorrelation rewrites:
  - Split out subquery conditions out as a separate predicate.
  - Try to rewrite other types of subqueries into EXISTS subqueries.
HIR ⇒ MIR.
- Decorrelation:
  - Correlated queries are rewritten as graphs with join and distinct.
- Lowering — express SQL-specific concepts as dataflow sub-graphs:
  - Outer joins are decomposed into multiple inner joins (see README.md).
  - Machinery for introducing defaults in empty global aggregates.
  - Machinery for introducing errors for SELECT subqueries with more than one return value.
- EXPLAIN DECORRELATED PLAN returns the result of transformations up to this point.
MIR ⇒ MIR.
- If the query is a view definition, run per-view logical optimizations against the SQL query. The catalog stores the result of transformations up to this point.
- Construct a dataflow for the query:
  - If the query depends on not-materialized views, the definitions of the not-materialized views get inlined.
  - For each materialized view that a query depends on, import all of its materializations. (This corresponds to all indexes on that view, which you can see if you call SHOW INDEXES IN <view>).
- Run optimizations against the dataflow:
  - Per-view logical.
  - Cross-view logical.
    - Propagating source information up: optimize_dataflow_monotonic
    - Pushing optimizations down to sources: LinearOperators
      - CSV decoding
      - Upsert
    - View inlining.
    - Theoretically supports producing more than one index/sink in the same dataflow.
  - Per-view logical (second round).
  - Per-view physical.
- EXPLAIN OPTIMIZED PLAN returns the result of transformations up to this point.
MIR ⇒ LIR.
- Decisions are made regarding rendering.
  - All aggregations are created equal in MIR, but from the rendering perspective, aggregations are evaluated differently according to what data needs to be kept to recalculate the aggregation after receiving a diff. A pictorial version can be found here.
  - Joins are broken down into multiple stages, and filters + projects run between each stage to shrink the intermediate result.
- RelationTypes (column types + unique keys) are discarded since we do no key or type of validation at render time.
- EXPLAIN PHYSICAL PLAN returns the result of transformations up to this point.
LIR ⇒ TDO.

For a one-off query, we run all the transformations until the MIR stage. Then we determine whether we need to serve the query on the "slow path", that is, creating a temporary dataflow and then deleting it. If we don't need to serve the query on the "slow path", then we can skip the MIR ⇒ LIR and the LIR ⇒ TDO steps. Existing "fast paths" include:

reading from an existing dataflow.
the adapter itself spitting out a constant set of rows.

Currently, the optimization team is mostly concerned with the HIR ⇒ MIR and MIR ⇒ MIR stages.

Testing

Integration tests

Sqllogictest
- Philip’s RQG tests will be in this format.
  - Add Philip to any PR where query plans may change.
- A PR can be merged if it passes Fast SLT.
- A PR does not need to pass Full SLT tests (test/sqllogictest/sqlite) to be merged.
  - Full SLT tests take 2-3 hours.
  - You can manually initiate full SLT tests on your branch here.
Testdrive
- We generally do not use testdrive except to see linear operators in action.

Unit tests

Datadriven
- Transform unit tests currently allow:
  - testing each transformation independently of the others.
  - Printing out which block of transformations change the plan and how.
- Unit tests in the mz-expr crate currently allow:
  - Testing the simplifying MirScalarExpr, predicates, join equivalences.
  - Testing MapFilterProject.
- There is a DSL to specifying arbitrary MIRs.
- DSL to specify arbitrary enums and structs.

Performance tests

TPCH

Tooling

mzt — can be used to create repositories of plans and write up a markdown that explains something based on those plans (see Alexander’s mzt-repos for example).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

101-query-compilation.md

101-query-compilation.md

Query Compilation 101

Prerequisite videos

Current Compiler Pipeline

Testing

Integration tests

Unit tests

Performance tests

Tooling

Files

101-query-compilation.md

Latest commit

History

101-query-compilation.md

File metadata and controls

Query Compilation 101

Prerequisite videos

Current Compiler Pipeline

Testing

Integration tests

Unit tests

Performance tests

Tooling