Skip to content

Conversation

@yhuang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to add a nullCounter associated with the Frequent Item Sketch in approx_top_k aggregation, so that now the function will return null item and null count if NULL value is among the top_k frequent items.

Why are the changes needed?

NULL value could be meaningful in some use cases and users might want to include NULL in the approx_top_k output.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests on handling null values.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 18, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-53947][SQL]Count null in approx_top_k [SPARK-53947][SQL] Count null in approx_top_k Oct 20, 2025
override def update(buffer: ItemsSketch[Any], input: InternalRow): ItemsSketch[Any] =
ApproxTopK.updateSketchBuffer(expr, buffer, input)
override def update(buffer: ApproxTopKAggregateBuffer[Any], input: InternalRow):
ApproxTopKAggregateBuffer[Any] =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

override def merge(
buffer: ApproxTopKAggregateBuffer[Any],
input: ApproxTopKAggregateBuffer[Any]):
ApproxTopKAggregateBuffer[Any] =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

}

test("SPARK-52515: does not count NULL values") {
test("SPARK-52515: count NULL values") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAnswer(res, Row(Seq(Row("b", 3), Row("a", 2))))
}

test("SPARK-52515: null is the last in top k") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang
Copy link
Member

Thanks, merging to master

gengliangwang pushed a commit that referenced this pull request Oct 21, 2025
…e NULLs

### What changes were proposed in this pull request?

As a follow-up of #52655, add NULL handling in approx_top_k_accumulate/estimate/combine.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests on null handling for accumulate, combine and estimate.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #52673 from yhuang-db/accumulate_estimate_count_null.

Authored-by: yhuang-db <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This PR proposes to add a nullCounter associated with the Frequent Item Sketch in `approx_top_k` aggregation, so that now the function will return null item and null count if NULL value is among the top_k frequent items.

### Why are the changes needed?

NULL value could be meaningful in some use cases and users might want to include NULL in the approx_top_k output.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests on handling null values.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52655 from yhuang-db/approx_top_k_count_null.

Authored-by: yhuang-db <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…e NULLs

### What changes were proposed in this pull request?

As a follow-up of apache#52655, add NULL handling in approx_top_k_accumulate/estimate/combine.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests on null handling for accumulate, combine and estimate.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#52673 from yhuang-db/accumulate_estimate_count_null.

Authored-by: yhuang-db <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants