Introduce QueryBuilder #162

ienkovich · 2023-01-13T20:06:24Z

This PR introduces a public API for building Relational Algebra DAG. The aim of this API is to provide an access to low-level HDK IR API but in a much more friendly way comparing to direct nodes and expressions creation via their constructors.

Many builder methods have a lot of overloads to allow a Python-like variety for input operand types. I'm not sure we need that many overloads, so we might revise it in the future and reduce it.

The coverage is not even close to 100%. For nodes it doesn't cover TableFunction and LogicalUnion. For expressions it actually doesn't cover a lot of things including string expressions, subqueries, window functions, function operator, etc. But it is still good enough to cover NYC Taxi and TPCH Q1 and Q3.

This API will be exposed in PyHDK to provide similar interfaces for Python.

The patch is quite big, but most of it is TPCH data generator (I had to add it instead of data itself because of license issues) and builder tests.

ienkovich · 2023-01-13T20:09:52Z

@vlad-penkin Could you please check that TPCH data generator is added correctly from the licenses point of view?

ienkovich · 2023-01-20T22:08:53Z

@leshikus I've tried to reproduce the ASAN failure but actually failed to build with enabled ASAN in conda environment with some errors in folly (and I see the same errors trying ASAN build on the main branch). Is ASAN build enabled for docker only for now? Do we use folly in ASAN build?

Signed-off-by: ienkovich <[email protected]>

leshikus · 2023-01-24T17:51:14Z

Is ASAN build enabled for docker only for now?

I've mostly used conda version for ASAN and never tried it in docker

Signed-off-by: ienkovich <[email protected]>

kurapov-peter

I've skimmed through most of the PR except for the builder itself. Looks reasonable as far as I can tell. Just a couple comments.

kurapov-peter · 2023-01-23T16:35:14Z

omniscidb/QueryEngine/ArrowResultSetConverter.cpp

-      return arrow::decimal(type->as<hdk::ir::DecimalType>()->precision(),
+      // No reason to use 256-bit decimals since we always import 64-bit values.
+      CHECK_EQ(type->size(), 8);
+      return arrow::decimal(std::min(type->as<hdk::ir::DecimalType>()->precision(), 38),


38 is the max precision for a 64-bit decimal. In our result, we can get higher precision especially when multiplication is used.

omniscidb/SchemaMgr/SchemaMgr.cpp

kurapov-peter · 2023-01-24T09:52:35Z

omniscidb/IR/Context.cpp

+      return null();
+    }
+    // Boolean.
+    if (std::regex_match(val_lower, match_res, std::regex("bool(\\[nn\\])?"))) {


Do we define the string format formally somehow?

Not really. The best documentation, for now, is the appropriate test suite and docstrings in the proposed python API: https://github.com/intel-ai/hdk/pull/170/files#diff-d8b8cca193af0d6768b0e0809b22ef0871563d378ab087d5f6a95e0863019eafR1198

Signed-off-by: ienkovich <[email protected]>

kurapov-peter

Added some inline questions/comments on the API. A general question: what do we provide expression comparison operators for?

omniscidb/QueryBuilder/QueryBuilder.h

kurapov-peter · 2023-01-26T17:21:44Z

omniscidb/QueryBuilder/QueryBuilder.h

+  NullSortedPosition null_pos_;
+};
+
+class BuilderNode {


Could you please elaborate on the abstractions' relation to each other and the reasoning behind them?

BuilderNode wraps hdk::ir::Node and provides an additional building interface to it. BuilderExpr makes the same for hdk::ir::Expr.

kurapov-peter · 2023-01-26T17:36:36Z

omniscidb/QueryBuilder/QueryBuilder.h

+  InvalidQueryError(std::string desc) : Error(std::move(desc)) {}
+};
+
+class QueryBuilder;


Does a QueryBuilder belong in hdk::ir namespace and not just hdk?

QueryBuilder interfaces are purely to manipulate IR, so I guess this is the right namespace. I'm not sure about these two-level namespaces in the first place, but apparently, we already have class names intersections (e.g. we have two ColumnRef classes)

kurapov-peter · 2023-01-26T17:38:14Z

omniscidb/QueryBuilder/QueryBuilder.h

+
+class QueryBuilder;
+class BuilderNode;
+class ExprRewriter;


What is the rewriter exposed for?

For BuilderExpr::rewrite interface

What is the rewrite interface used for? I think I saw it for input replacement, but what are the major use cases?

kurapov-peter · 2023-01-26T17:39:47Z

omniscidb/QueryBuilder/QueryBuilder.h

+  BuilderExpr rewrite(ExprRewriter& rewriter) const;
+
+  BuilderExpr operator!() const;
+  BuilderExpr operator-() const;


What does it semantically mean "to subtract two builder expressions"?

This is a unary minus. It allows to use it as neg_val = -node.ref("value"). There is also a binary operator which allows you to build subtractions like diff = node["x"] - node["y"].

kurapov-peter · 2023-01-26T17:50:59Z

omniscidb/QueryBuilder/QueryBuilder.h

+
+  BuilderNode scan(TableInfoPtr table_info) const;
+
+  Context& ctx_;


Is it possible to reuse a chain of operators built with a different builder? Assuming they share the schema and data source.

IR contexts were introduced for the isolation of IRs built in different threads. If you work on the same thread and have two builders working with the same context, then you can mix their expressions, but I don't see why you would need it.

So far, we don't actually manage contexts nicely in our code and always use the same default context. That means we shouldn't try to build multiple queries from different threads. I hope it will change some days and we start creating query-scoped contexts. For Python, I think we will continue using a single global context for all queries.

Signed-off-by: ienkovich <[email protected]>

kurapov-peter

Added some comments on the QueryBuilder implementation.

kurapov-peter · 2023-01-31T18:44:50Z