Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Add TopK/BottomK aggregate. #606

Merged
merged 1 commit into from
Aug 7, 2023
Merged

Add TopK/BottomK aggregate. #606

merged 1 commit into from
Aug 7, 2023

Conversation

ienkovich
Copy link
Contributor

This patch introduces new aggregates to get top/bottom K elements. The result is a variable-length array. Nulls are excluded, so the resulting array can be empty (but never null).

Initially, I wanted to use our regular layout for variable-length arrays, but then I thought that to get the top 2 integers we would need two 8-byte slots + 8 bytes for array data. That is 3x of the actual stored data and we would need to allocate it for all entries and pay additional initialization costs. So instead of that, I used null-terminated arrays with a known max size. Also, if an array is small enough (<=8 bytes), it is inlined into the slot and we don't pay any additional memory at all. E.g. for the top 2 32-bit integers we would simply have a 8-byte slot and no additional allocations. For the top 3 32-bit integers we would have an 8-byte slot with a pointer to 12-byte array data.

To minimize the number of comparisons and copies, TopK arrays actually hold heaps. We build a min-heap for TopK and a max-heap for BottomK. Therefore insertion is log(N) copies at most. I'm not sure if it is worth the effort, because it has no benefit for a small number of elements. But it still can be beneficial for bigger arrays, so I kept the heaps. This can be revised and adjusted later when we have some real cases to analyze.

It seemed useful to not only have an array with top K elements but also have it sorted in the final. I do this sort on a fetch. We copy data on fetch anyway, so it is a good time to reverse and sort. If we want this sort to be included in execution for more accurate perf measurement, we can make it a part of the reduction. Though it would make reduction necessary even if we have a single fragment and some special reduction would be required in case of the partitioned aggregation with TopK. We can also get rid of heaps and store sorted arrays from the first place (at least for small arrays).

Similar to quantile, TopK disables columnar format. There is no real problem preventing such support, but using the columnar format for aggregations is usually a bad idea overall (at least on CPU), so I decided it doesn't worth the effort.

}
if (count == 0) {
throw InvalidQueryError()
<< "Non-zero integer argument is expected for topK aggregate. Provided: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't negative topK confusing? Sure, in python, this is common semantics, but not in C++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was choosing between introducing two aggregates topK and bottomK and using a single aggregate for both cases. I didn't want to have == hdk::ir::AggType::kTopK || ==hdk::ir::AggType::kBottomK everywhere instead of == hdk::ir::AggType::kTopK. Also, non-negative values would make me add additional parameters to runtime functions and TargetInfo for the type of aggregate. And I guess negative values would be expected to be accepted in the top-level Python API anyway. So I chose to have a single aggregate allowing negative parameters with topK and bottomK APIs on the builder level.

@@ -658,7 +680,7 @@ BuilderExpr BuilderExpr::lastValue() const {
return {builder_, expr, name, true};
}

BuilderExpr BuilderExpr::agg(const std::string& agg_str, const BuilderExpr& arg) const {
BuilderExpr BuilderExpr::agg(const std::string& agg_str, BuilderExpr arg) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not constant anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because of arg = arg.uminus() for the bottom_k case.

@@ -211,7 +212,7 @@ bool QueryMemoryDescriptor::operator==(const QueryMemoryDescriptor& other) const
if (has_nulls_ != other.has_nulls_) {
return false;
}
if (count_distinct_descriptors_.size() != count_distinct_descriptors_.size()) {
if (count_distinct_descriptors_.size() != other.count_distinct_descriptors_.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

Copy link
Contributor

@alexbaden alexbaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice!

@ienkovich ienkovich merged commit cd9f802 into main Aug 7, 2023
@ienkovich ienkovich deleted the ienkovich/topk branch August 7, 2023 18:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants