-
Notifications
You must be signed in to change notification settings - Fork 14
Conversation
@vlad-penkin Could you please check that TPCH data generator is added correctly from the licenses point of view? |
@leshikus I've tried to reproduce the ASAN failure but actually failed to build with enabled ASAN in conda environment with some errors in folly (and I see the same errors trying ASAN build on the main branch). Is ASAN build enabled for docker only for now? Do we use folly in ASAN build? |
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
acaa5a0
to
c012268
Compare
I've mostly used conda version for ASAN and never tried it in docker |
Signed-off-by: ienkovich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've skimmed through most of the PR except for the builder itself. Looks reasonable as far as I can tell. Just a couple comments.
return arrow::decimal(type->as<hdk::ir::DecimalType>()->precision(), | ||
// No reason to use 256-bit decimals since we always import 64-bit values. | ||
CHECK_EQ(type->size(), 8); | ||
return arrow::decimal(std::min(type->as<hdk::ir::DecimalType>()->precision(), 38), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 38
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
38 is the max precision for a 64-bit decimal. In our result, we can get higher precision especially when multiplication is used.
return null(); | ||
} | ||
// Boolean. | ||
if (std::regex_match(val_lower, match_res, std::regex("bool(\\[nn\\])?"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we define the string format formally somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. The best documentation, for now, is the appropriate test suite and docstrings in the proposed python API: https://github.com/intel-ai/hdk/pull/170/files#diff-d8b8cca193af0d6768b0e0809b22ef0871563d378ab087d5f6a95e0863019eafR1198
Signed-off-by: ienkovich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some inline questions/comments on the API. A general question: what do we provide expression comparison operators for?
NullSortedPosition null_pos_; | ||
}; | ||
|
||
class BuilderNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on the abstractions' relation to each other and the reasoning behind them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BuilderNode
wraps hdk::ir::Node
and provides an additional building interface to it. BuilderExpr
makes the same for hdk::ir::Expr
.
InvalidQueryError(std::string desc) : Error(std::move(desc)) {} | ||
}; | ||
|
||
class QueryBuilder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a QueryBuilder
belong in hdk::ir
namespace and not just hdk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueryBuilder
interfaces are purely to manipulate IR, so I guess this is the right namespace. I'm not sure about these two-level namespaces in the first place, but apparently, we already have class names intersections (e.g. we have two ColumnRef
classes)
|
||
class QueryBuilder; | ||
class BuilderNode; | ||
class ExprRewriter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rewriter exposed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For BuilderExpr::rewrite
interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rewrite
interface used for? I think I saw it for input replacement, but what are the major use cases?
BuilderExpr rewrite(ExprRewriter& rewriter) const; | ||
|
||
BuilderExpr operator!() const; | ||
BuilderExpr operator-() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it semantically mean "to subtract two builder expressions"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a unary minus. It allows to use it as neg_val = -node.ref("value")
. There is also a binary operator which allows you to build subtractions like diff = node["x"] - node["y"]
.
|
||
BuilderNode scan(TableInfoPtr table_info) const; | ||
|
||
Context& ctx_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reuse a chain of operators built with a different builder? Assuming they share the schema and data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IR contexts were introduced for the isolation of IRs built in different threads. If you work on the same thread and have two builders working with the same context, then you can mix their expressions, but I don't see why you would need it.
So far, we don't actually manage contexts nicely in our code and always use the same default context. That means we shouldn't try to build multiple queries from different threads. I hope it will change some days and we start creating query-scoped contexts. For Python, I think we will continue using a single global context for all queries.
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on the QueryBuilder
implementation.
return getNodeColumnRef(proj, (unsigned)i); | ||
} | ||
} | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these situations not an exception in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled on a higher level. This code is caused by the lack of field names in join nodes. For joins, we use the same method to search for a field in its input nodes. If we get nullptr
from all inputs then an exception is thrown.
} | ||
} | ||
|
||
// Build the resulting vector adding suffixes to auto-names when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a auto-named expression have an empty name? If so, this can be misleading as chooseName
can add a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say, that for non-empty names suffix is _N
, and for empty names, the suffix is expr_N
, we never add prefixes to non-empty names.
} | ||
|
||
BuilderExpr BuilderExpr::min() const { | ||
if (!expr_->type()->isNumber() && !expr_->type()->isDateTime()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support min/max aggregate for strings.
} | ||
|
||
BuilderExpr BuilderExpr::count(bool is_distinct) const { | ||
if (!expr_->is<hdk::ir::ColumnRef>()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a scan over the whole table be represented by a column ref? I'm thinking of something similar to select (*)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, column ref is just a reference to a column in a scalar expression. E.g. SELECT x FROM table;
would have a single column reference in a projection. SELECT * FROM table;
would require as many column references, as the table has columns.
} | ||
|
||
auto kind = agg_names.at(agg_str_lower); | ||
if (kind == AggType::kApproxQuantile && val == HUGE_VAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value is huge? 3? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HUGE_VAL
is a constant used as +INF
double value. In this function, we use this value by default to allow users to skip val
arg for most of the aggregates. And if APPROX QUANTILE
aggregate is used but val
is missing, then an exception is thrown.
|
||
BuilderExpr BuilderExpr::div(const BuilderExpr& rhs) const { | ||
try { | ||
auto bin_oper = Analyzer::normalizeOperExpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it check for a constant div by 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Normalization only makes sure we produce an expression we can handle in our codegen. It is mostly to add required casts to operands and choose the proper type for the result. I don't know if it makes sense to check for a division by zero because just having it is not an error (expression might be conditional and never execute).
One more general question. Do we intend to allow any IR extensions? |
I don't see any reason to have IR extensions for now. We hardly can provide convenient interfaces to integrate user-defined expressions into our codegen and it's even harder to do that for nodes. UDF is currently the only way to add user-defined features. |
Signed-off-by: ienkovich <[email protected]>
Signed-off-by: ienkovich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think I went through most of it, but we probably need yet another pair of eyes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Did we get an answer on tpch data gen licensing? @vlad-penkin ?
Signed-off-by: ienkovich <[email protected]>
This PR introduces a public API for building Relational Algebra DAG. The aim of this API is to provide an access to low-level HDK IR API but in a much more friendly way comparing to direct nodes and expressions creation via their constructors.
Many builder methods have a lot of overloads to allow a Python-like variety for input operand types. I'm not sure we need that many overloads, so we might revise it in the future and reduce it.
The coverage is not even close to 100%. For nodes it doesn't cover TableFunction and LogicalUnion. For expressions it actually doesn't cover a lot of things including string expressions, subqueries, window functions, function operator, etc. But it is still good enough to cover NYC Taxi and TPCH Q1 and Q3.
This API will be exposed in PyHDK to provide similar interfaces for Python.
The patch is quite big, but most of it is TPCH data generator (I had to add it instead of data itself because of license issues) and builder tests.