Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

We added native versions of collect_set and collect_list in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the Collect.update method.

How was this patch tested?

Added a regression test to DataFrameAggregateSuite.

@hvanhovell
Copy link
Contributor Author

cc @mengxr

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65806 has finished for PR 15208 at commit 37c4539.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sameeragarwal sameeragarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending jenkins

override def update(b: MutableRow, input: InternalRow): Unit = {
buffer += child.eval(input)
val value = child.eval(input)
if (value != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to add a comment here that this mimics the hive semantics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65992 has finished for PR 15208 at commit 80b2166.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #66000 has finished for PR 15208 at commit 80b2166.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Sep 28, 2016

Merging in master/2.0.

asfgit pushed a commit that referenced this pull request Sep 28, 2016
…alues.

## What changes were proposed in this pull request?
We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method.

## How was this patch tested?
Added a regression test to `DataFrameAggregateSuite`.

Author: Herman van Hovell <[email protected]>

Closes #15208 from hvanhovell/SPARK-17641.

(cherry picked from commit 7d09232)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 7d09232 Sep 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants