Add TopK/BottomK aggregate. #606

ienkovich · 2023-08-01T00:09:54Z

This patch introduces new aggregates to get top/bottom K elements. The result is a variable-length array. Nulls are excluded, so the resulting array can be empty (but never null).

Initially, I wanted to use our regular layout for variable-length arrays, but then I thought that to get the top 2 integers we would need two 8-byte slots + 8 bytes for array data. That is 3x of the actual stored data and we would need to allocate it for all entries and pay additional initialization costs. So instead of that, I used null-terminated arrays with a known max size. Also, if an array is small enough (<=8 bytes), it is inlined into the slot and we don't pay any additional memory at all. E.g. for the top 2 32-bit integers we would simply have a 8-byte slot and no additional allocations. For the top 3 32-bit integers we would have an 8-byte slot with a pointer to 12-byte array data.

To minimize the number of comparisons and copies, TopK arrays actually hold heaps. We build a min-heap for TopK and a max-heap for BottomK. Therefore insertion is log(N) copies at most. I'm not sure if it is worth the effort, because it has no benefit for a small number of elements. But it still can be beneficial for bigger arrays, so I kept the heaps. This can be revised and adjusted later when we have some real cases to analyze.

It seemed useful to not only have an array with top K elements but also have it sorted in the final. I do this sort on a fetch. We copy data on fetch anyway, so it is a good time to reverse and sort. If we want this sort to be included in execution for more accurate perf measurement, we can make it a part of the reduction. Though it would make reduction necessary even if we have a single fragment and some special reduction would be required in case of the partitioned aggregation with TopK. We can also get rid of heaps and store sorted arrays from the first place (at least for small arrays).

Similar to quantile, TopK disables columnar format. There is no real problem preventing such support, but using the columnar format for aggregations is usually a bad idea overall (at least on CPU), so I decided it doesn't worth the effort.

Signed-off-by: ienkovich <[email protected]>

kurapov-peter · 2023-08-04T16:46:19Z

omniscidb/QueryBuilder/QueryBuilder.cpp

+  }
+  if (count == 0) {
+    throw InvalidQueryError()
+        << "Non-zero integer argument is expected for topK aggregate. Provided: "


Isn't negative topK confusing? Sure, in python, this is common semantics, but not in C++.

I was choosing between introducing two aggregates topK and bottomK and using a single aggregate for both cases. I didn't want to have == hdk::ir::AggType::kTopK || ==hdk::ir::AggType::kBottomK everywhere instead of == hdk::ir::AggType::kTopK. Also, non-negative values would make me add additional parameters to runtime functions and TargetInfo for the type of aggregate. And I guess negative values would be expected to be accepted in the top-level Python API anyway. So I chose to have a single aggregate allowing negative parameters with topK and bottomK APIs on the builder level.

kurapov-peter · 2023-08-04T16:50:20Z

omniscidb/QueryBuilder/QueryBuilder.cpp

@@ -658,7 +680,7 @@ BuilderExpr BuilderExpr::lastValue() const {
  return {builder_, expr, name, true};
 }

-BuilderExpr BuilderExpr::agg(const std::string& agg_str, const BuilderExpr& arg) const {
+BuilderExpr BuilderExpr::agg(const std::string& agg_str, BuilderExpr arg) const {


Why is it not constant anymore?

That's because of arg = arg.uminus() for the bottom_k case.

kurapov-peter · 2023-08-04T17:01:09Z

omniscidb/ResultSet/QueryMemoryDescriptor.cpp

@@ -211,7 +212,7 @@ bool QueryMemoryDescriptor::operator==(const QueryMemoryDescriptor& other) const
  if (has_nulls_ != other.has_nulls_) {
    return false;
  }
-  if (count_distinct_descriptors_.size() != count_distinct_descriptors_.size()) {
+  if (count_distinct_descriptors_.size() != other.count_distinct_descriptors_.size()) {


alexbaden

Looks nice!

ienkovich requested review from alexbaden and kurapov-peter August 1, 2023 00:09

Add TopK/BottomK aggregate.

e543fa5

Signed-off-by: ienkovich <[email protected]>

ienkovich force-pushed the ienkovich/topk branch from e5cd6ca to e543fa5 Compare August 1, 2023 01:04

kurapov-peter reviewed Aug 4, 2023

View reviewed changes

alexbaden approved these changes Aug 7, 2023

View reviewed changes

ienkovich merged commit cd9f802 into main Aug 7, 2023

ienkovich deleted the ienkovich/topk branch August 7, 2023 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TopK/BottomK aggregate. #606

Add TopK/BottomK aggregate. #606

ienkovich commented Aug 1, 2023

kurapov-peter Aug 4, 2023

ienkovich Aug 4, 2023

kurapov-peter Aug 4, 2023

ienkovich Aug 4, 2023

kurapov-peter Aug 4, 2023

alexbaden left a comment

Add TopK/BottomK aggregate. #606

Add TopK/BottomK aggregate. #606

Conversation

ienkovich commented Aug 1, 2023

kurapov-peter Aug 4, 2023

Choose a reason for hiding this comment

ienkovich Aug 4, 2023

Choose a reason for hiding this comment

kurapov-peter Aug 4, 2023

Choose a reason for hiding this comment

ienkovich Aug 4, 2023

Choose a reason for hiding this comment

kurapov-peter Aug 4, 2023

Choose a reason for hiding this comment

alexbaden left a comment

Choose a reason for hiding this comment