Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Sep 14, 2021

What changes were proposed in this pull request?

For query

select array_distinct(array(cast('nan' as double), cast('nan' as double)))

This returns [NaN, NaN], but it should return [NaN].
This issue is caused by OpenHashSet can't handle Double.NaN and Float.NaN too.
In this pr fix this based on #33955

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

ArrayDistinct won't show duplicated NaN value

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Sep 14, 2021
@AngersZhuuuu AngersZhuuuu changed the title [SPARK-36742][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan [SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan Sep 14, 2021
@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47765/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47770/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Test build #143262 has finished for PR 33993 at commit 0ac9924.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val elem = array.get(i, elementType)
if (isNaN(elem)) {
if (!hs.containsNaN) {
arrayBuffer += elem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, let's wait for the decision at the first PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, let's wait for the decision at the first PR.

Done

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Test build #143267 has finished for PR 33993 at commit 63763df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan @dongjoon-hyun

}

val processArray = withArrayNullAssignment(
def withNaNCheck(body: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move these codegen utils to SQLOpenHashSet as well to reduce duplicated code?

Copy link
Contributor Author

@AngersZhuuuu AngersZhuuuu Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move these codegen utils to SQLOpenHashSet as well to reduce duplicated code?

How about to do this after all done. Not only this can be refactored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Updated, How about current?

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47809/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47809/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Test build #143307 has finished for PR 33993 at commit f5c5452.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47848/

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47848/

}

def valueNaN(dataType: DataType): Any = {
def isNaNFuncAndValueNaN(dataType: DataType, valueName: String): Option[(String, String)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, Done

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47850/

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47850/

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Test build #143345 has finished for PR 33993 at commit 202cf4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47851/

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47867/

@SparkQA
Copy link

SparkQA commented Sep 16, 2021

Test build #143360 has finished for PR 33993 at commit 389c9fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47892/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47894/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47892/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47894/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47895/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47895/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47902/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47902/

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Test build #143385 has finished for PR 33993 at commit d8e80fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Test build #143387 has finished for PR 33993 at commit 21e9422.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2021

Test build #143391 has finished for PR 33993 at commit 434a2be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@cloud-fan cloud-fan closed this in e356f6a Sep 17, 2021
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2021
…at.Nan

For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955

Fix bug

ArrayDistinct won't show duplicated `NaN` value

Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2/3.1/3.0!

fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…at.Nan

### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes apache#33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants