feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

haohuaijin · 2026-01-05T15:53:04Z

Which issue does this PR close?

close Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream #19638

Rationale for this change

see issue #19638

What changes are included in this PR?

Introduced LimitOptions struct limit field with both limit and optional descending ordering direction
Extended TopKAggregation optimizer rule to DISTINCT queries by recognizing GROUP BY queries without aggregates and setting the descending flag based on ordering direction
Enhanced GroupedTopKAggregateStream to handle DISTINCT by using group key as both priority queue key and value for DISTINCT operations
Updated Proto definitions to add optional descending field to AggLimit message for serialization/deserialization

benchmark result

Are these changes tested?

yes, add test case in aggregates_topk.slt

Are there any user-facing changes?

no

… GroupedTopKAggregateStream

haohuaijin · 2026-01-10T08:23:12Z

datafusion/physical-plan/src/aggregates/topk_stream.rs

+                        let mut cols = self.priority_map.emit()?;
+                        // For DISTINCT case (no aggregate expressions), only use the group key column
+                        // since the schema only has one field and key/value are the same
+                        if self.aggregate_arguments.is_empty() {
+                            cols.truncate(1);
+                        }


we can further improve this part, because for the query

select distinct id from t order by id limit 10

it do not have any aggregate, so we only need mantain the topk heap, and skip the group keys

kosiew

LGTM!

datafusion/core/benches/topk_aggregate.rs

haohuaijin · 2026-01-14T14:17:10Z

Thanks for your reviews @kosiew, already apply suggestion

alamb

FYI @avantgardnerio

alamb · 2026-01-16T12:18:56Z

datafusion/core/benches/data_utils/mod.rs

+
+/// Create deterministic data for DISTINCT benchmarks with predictable trace_ids
+/// This ensures consistent results across benchmark runs
+#[allow(dead_code)]


Why is this marked dead?

without this, clippy will have warn

🤔 -- I am not sure why but maybe it has to do with feature flags or something

I think it is code that is only used for benchmarks that I wasn't sure how else to allow. Perhaps it needs to be moved into the benchmark?

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use…

af19d9f

… GroupedTopKAggregateStream

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Jan 5, 2026

haohuaijin added 2 commits January 5, 2026 23:53

Merge branch 'main' into topk-distinct

f8d8ac7

update

a790c43

github-actions bot removed the logical-expr Logical plan and expressions label Jan 5, 2026

haohuaijin added 5 commits January 7, 2026 14:20

Merge branch 'main' into topk-distinct

8862990

Merge branch 'main' into topk-distinct

b6dca88

Merge branch 'main' into topk-distinct

5fffc64

fix merge issue

c2e7c33

update

da76832

haohuaijin commented Jan 10, 2026

View reviewed changes

kosiew approved these changes Jan 14, 2026

View reviewed changes

datafusion/core/benches/topk_aggregate.rs Outdated Show resolved Hide resolved

haohuaijin added 2 commits January 14, 2026 22:16

apply suggestion

0ea86bc

Merge branch 'main' into topk-distinct

8fd0376

Merge branch 'main' into topk-distinct

f99e51b

alamb reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

Uh oh!

haohuaijin commented Jan 5, 2026 •

edited by alamb

Loading

Uh oh!

haohuaijin Jan 10, 2026

Uh oh!

kosiew left a comment

Uh oh!

Uh oh!

haohuaijin commented Jan 14, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Jan 16, 2026

Uh oh!

haohuaijin Jan 16, 2026

Uh oh!

alamb Jan 16, 2026

Uh oh!

avantgardnerio Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use GroupedTopKAggregateStream #19653

Are you sure you want to change the base?

feat: support SELECT DISTINCT id FROM t ORDER BY id LIMIT n query use GroupedTopKAggregateStream #19653

Uh oh!

Conversation

haohuaijin commented Jan 5, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

benchmark result

Are these changes tested?

Are there any user-facing changes?

Uh oh!

haohuaijin Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haohuaijin commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

haohuaijin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

avantgardnerio Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream #19653

haohuaijin commented Jan 5, 2026 •

edited by alamb

Loading

haohuaijin commented Jan 14, 2026 •

edited

Loading