Skip to content

Conversation

@haohuaijin
Copy link
Contributor

@haohuaijin haohuaijin commented Jan 5, 2026

Which issue does this PR close?

Rationale for this change

see issue #19638

What changes are included in this PR?

  1. Introduced LimitOptions struct limit field with both limit and optional descending ordering direction
  2. Extended TopKAggregation optimizer rule to DISTINCT queries by recognizing GROUP BY queries without aggregates and setting the descending flag based on ordering direction
  3. Enhanced GroupedTopKAggregateStream to handle DISTINCT by using group key as both priority queue key and value for DISTINCT operations
  4. Updated Proto definitions to add optional descending field to AggLimit message for serialization/deserialization

benchmark result

image

Are these changes tested?

yes, add test case in aggregates_topk.slt

Are there any user-facing changes?

no

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Jan 5, 2026
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Jan 5, 2026
Comment on lines +214 to +219
let mut cols = self.priority_map.emit()?;
// For DISTINCT case (no aggregate expressions), only use the group key column
// since the schema only has one field and key/value are the same
if self.aggregate_arguments.is_empty() {
cols.truncate(1);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can further improve this part, because for the query

select distinct id from t order by id limit 10

it do not have any aggregate, so we only need mantain the topk heap, and skip the group keys

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@haohuaijin
Copy link
Contributor Author

haohuaijin commented Jan 14, 2026

Thanks for your reviews @kosiew, already apply suggestion

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/// Create deterministic data for DISTINCT benchmarks with predictable trace_ids
/// This ensures consistent results across benchmark runs
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this marked dead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, clippy will have warn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 -- I am not sure why but maybe it has to do with feature flags or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is code that is only used for benchmarks that I wasn't sure how else to allow. Perhaps it needs to be moved into the benchmark?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream

4 participants