-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53947][SQL] Count null in approx_top_k #52655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| override def update(buffer: ItemsSketch[Any], input: InternalRow): ItemsSketch[Any] = | ||
| ApproxTopK.updateSketchBuffer(expr, buffer, input) | ||
| override def update(buffer: ApproxTopKAggregateBuffer[Any], input: InternalRow): | ||
| ApproxTopKAggregateBuffer[Any] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| override def merge( | ||
| buffer: ApproxTopKAggregateBuffer[Any], | ||
| input: ApproxTopKAggregateBuffer[Any]): | ||
| ApproxTopKAggregateBuffer[Any] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| } | ||
|
|
||
| test("SPARK-52515: does not count NULL values") { | ||
| test("SPARK-52515: count NULL values") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| checkAnswer(res, Row(Seq(Row("b", 3), Row("a", 2)))) | ||
| } | ||
|
|
||
| test("SPARK-52515: null is the last in top k") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks, merging to master |
…e NULLs ### What changes were proposed in this pull request? As a follow-up of #52655, add NULL handling in approx_top_k_accumulate/estimate/combine. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests on null handling for accumulate, combine and estimate. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52673 from yhuang-db/accumulate_estimate_count_null. Authored-by: yhuang-db <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? This PR proposes to add a nullCounter associated with the Frequent Item Sketch in `approx_top_k` aggregation, so that now the function will return null item and null count if NULL value is among the top_k frequent items. ### Why are the changes needed? NULL value could be meaningful in some use cases and users might want to include NULL in the approx_top_k output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests on handling null values. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52655 from yhuang-db/approx_top_k_count_null. Authored-by: yhuang-db <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…e NULLs ### What changes were proposed in this pull request? As a follow-up of apache#52655, add NULL handling in approx_top_k_accumulate/estimate/combine. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests on null handling for accumulate, combine and estimate. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52673 from yhuang-db/accumulate_estimate_count_null. Authored-by: yhuang-db <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
What changes were proposed in this pull request?
This PR proposes to add a nullCounter associated with the Frequent Item Sketch in
approx_top_kaggregation, so that now the function will return null item and null count if NULL value is among the top_k frequent items.Why are the changes needed?
NULL value could be meaningful in some use cases and users might want to include NULL in the approx_top_k output.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit tests on handling null values.
Was this patch authored or co-authored using generative AI tooling?
No.