-
Notifications
You must be signed in to change notification settings - Fork 14
Conversation
Signed-off-by: ienkovich <[email protected]>
e5cd6ca
to
e543fa5
Compare
} | ||
if (count == 0) { | ||
throw InvalidQueryError() | ||
<< "Non-zero integer argument is expected for topK aggregate. Provided: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't negative topK
confusing? Sure, in python, this is common semantics, but not in C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was choosing between introducing two aggregates topK
and bottomK
and using a single aggregate for both cases. I didn't want to have == hdk::ir::AggType::kTopK || ==hdk::ir::AggType::kBottomK
everywhere instead of == hdk::ir::AggType::kTopK
. Also, non-negative values would make me add additional parameters to runtime functions and TargetInfo
for the type of aggregate. And I guess negative values would be expected to be accepted in the top-level Python API anyway. So I chose to have a single aggregate allowing negative parameters with topK
and bottomK
APIs on the builder level.
@@ -658,7 +680,7 @@ BuilderExpr BuilderExpr::lastValue() const { | |||
return {builder_, expr, name, true}; | |||
} | |||
|
|||
BuilderExpr BuilderExpr::agg(const std::string& agg_str, const BuilderExpr& arg) const { | |||
BuilderExpr BuilderExpr::agg(const std::string& agg_str, BuilderExpr arg) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not constant anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because of arg = arg.uminus()
for the bottom_k
case.
@@ -211,7 +212,7 @@ bool QueryMemoryDescriptor::operator==(const QueryMemoryDescriptor& other) const | |||
if (has_nulls_ != other.has_nulls_) { | |||
return false; | |||
} | |||
if (count_distinct_descriptors_.size() != count_distinct_descriptors_.size()) { | |||
if (count_distinct_descriptors_.size() != other.count_distinct_descriptors_.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice!
This patch introduces new aggregates to get top/bottom K elements. The result is a variable-length array. Nulls are excluded, so the resulting array can be empty (but never null).
Initially, I wanted to use our regular layout for variable-length arrays, but then I thought that to get the top 2 integers we would need two 8-byte slots + 8 bytes for array data. That is 3x of the actual stored data and we would need to allocate it for all entries and pay additional initialization costs. So instead of that, I used null-terminated arrays with a known max size. Also, if an array is small enough (<=8 bytes), it is inlined into the slot and we don't pay any additional memory at all. E.g. for the top 2 32-bit integers we would simply have a 8-byte slot and no additional allocations. For the top 3 32-bit integers we would have an 8-byte slot with a pointer to 12-byte array data.
To minimize the number of comparisons and copies, TopK arrays actually hold heaps. We build a min-heap for TopK and a max-heap for BottomK. Therefore insertion is log(N) copies at most. I'm not sure if it is worth the effort, because it has no benefit for a small number of elements. But it still can be beneficial for bigger arrays, so I kept the heaps. This can be revised and adjusted later when we have some real cases to analyze.
It seemed useful to not only have an array with top K elements but also have it sorted in the final. I do this sort on a fetch. We copy data on fetch anyway, so it is a good time to reverse and sort. If we want this sort to be included in execution for more accurate perf measurement, we can make it a part of the reduction. Though it would make reduction necessary even if we have a single fragment and some special reduction would be required in case of the partitioned aggregation with TopK. We can also get rid of heaps and store sorted arrays from the first place (at least for small arrays).
Similar to quantile, TopK disables columnar format. There is no real problem preventing such support, but using the columnar format for aggregations is usually a bad idea overall (at least on CPU), so I decided it doesn't worth the effort.