-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17641][SQL] Collect_list/Collect_set should not collect null values. #15208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mengxr |
|
Test build #65806 has finished for PR 15208 at commit
|
sameeragarwal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending jenkins
| override def update(b: MutableRow, input: InternalRow): Unit = { | ||
| buffer += child.eval(input) | ||
| val value = child.eval(input) | ||
| if (value != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to add a comment here that this mimics the hive semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Test build #65992 has finished for PR 15208 at commit
|
|
retest this please |
|
Test build #66000 has finished for PR 15208 at commit
|
|
Merging in master/2.0. |
…alues. ## What changes were proposed in this pull request? We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method. ## How was this patch tested? Added a regression test to `DataFrameAggregateSuite`. Author: Herman van Hovell <[email protected]> Closes #15208 from hvanhovell/SPARK-17641. (cherry picked from commit 7d09232) Signed-off-by: Reynold Xin <[email protected]>
What changes were proposed in this pull request?
We added native versions of
collect_setandcollect_listin Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to theCollect.updatemethod.How was this patch tested?
Added a regression test to
DataFrameAggregateSuite.