Add set based operations for arrays: `array_intersect`, `array_union`, `array_except`, and `arrays_overlap` for running on GPU #5958

NVnavkumar · 2022-07-06T20:30:39Z

Fixes #4932, #5228 , #5188, and #5222. Requires rapidsai/cudf#11043 and rapidsai/cudf#11143 to be merged.

This implements the SQL functions for set-based operations on arrays array_intersect, array_union, array_except, and arrays_overlap for running on the GPU. A few caveats:

There is a minor bug ([BUG] Exception calling collect() when partitioning using with arrays with null values using array_union(...) #5957) that came up when testing array_union, and it looks like a bug in the partitioning code when using collect()
The 3 operations array_intersect, array_union and array_except cannot guarantee the same order when used on the GPU vs the CPU (because they are set-based operations), so we wrap these operations in sort_array(...) call for testing purposes.
arrays_overlap returns a boolean result value, so there is no need to sort, but the implementation is complicated because of these requirements for the function implementation (see https://spark.apache.org/docs/3.2.1/api/sql/index.html#arrays_overlap)

arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.

…t still needed Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

…ray_intersect_string_array

Signed-off-by: Navin Kumar <[email protected]>

…ray_intersect_string_array

Signed-off-by: Navin Kumar <[email protected]>

integration_tests/src/main/python/array_test.py

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

integration_tests/src/main/python/array_test.py

fix typo Co-authored-by: Nghia Truong <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

…ow down in cudf Signed-off-by: Navin Kumar <[email protected]>

ttnghia · 2022-07-08T03:17:06Z

integration_tests/src/main/python/array_test.py

+            'arrays_overlap(array(), b)',
+            'arrays_overlap(a, a)',
+        )
+    )


There should be an empty line at the end of the file as an implicit convention.

Suggested change

)

)

integration_tests/src/main/python/array_test.py

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

integration_tests/src/main/python/array_test.py

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-23T19:59:17Z

Another example:

scala> val df = List((Array(Double.NaN), Array(-Double.NaN))).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: array<double>, b: array<double>]

scala> df.selectExpr("array_union(a, b)").collect
res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(NaN)])

ttnghia · 2022-07-23T20:04:52Z

What's wrong with my spark-shell (spark 3.1.2)?

scala> spark.sql("SELECT array_union(array(cast('nan' as double)), array(-cast('-nan' as double)))").show
+------------------------------------------------------------------------+
|array_union(array(CAST(nan AS DOUBLE)), array((- CAST(-nan AS DOUBLE))))|
+------------------------------------------------------------------------+
|                                                             [NaN, null]|
+------------------------------------------------------------------------+


scala> spark.sql("SELECT array_union(array(cast('nan' as double)), array(-cast('nan' as double)))").show
+-----------------------------------------------------------------------+
|array_union(array(CAST(nan AS DOUBLE)), array((- CAST(nan AS DOUBLE))))|
+-----------------------------------------------------------------------+
|                                                             [NaN, NaN]|
+-----------------------------------------------------------------------+


scala> val df = List((Array(Double.NaN), Array(-Double.NaN))).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: array<double>, b: array<double>]

scala> df.selectExpr("array_union(a, b)").collect
res17: Array[org.apache.spark.sql.Row] = Array([WrappedArray(NaN, NaN)])

ttnghia · 2022-07-23T20:07:36Z

@NVnavkumar Did you generate your last example by running on the GPU? I did mine on the CPU. If the correct behavior for comparing NaNs is that they are always considered unequal then I need to fix the JNI layer to pass in the correct NaN comparison parameter.

NVnavkumar · 2022-07-23T21:30:26Z

@NVnavkumar Did you generate your last example by running on the GPU? I did mine on the CPU. If the correct behavior for comparing NaNs is that they are always considered unequal then I need to fix the JNI layer to pass in the correct NaN comparison parameter.

Both examples were on the CPU, Spark 3.2.1

NVnavkumar · 2022-07-23T21:35:06Z

Tried with Spark 3.1.3

scala> val df = List((Array(Double.NaN), Array(-Double.NaN))).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: array<double>, b: array<double>]

scala> df.selectExpr("array_union(a, b)").collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(NaN)])

But in Spark 3.1.1, there is this behavior:

scala> val df = List((Array(Double.NaN), Array(-Double.NaN))).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: array<double>, b: array<double>]

scala> df.selectExpr("array_union(a, b)").collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(NaN, NaN)])

ttnghia · 2022-07-23T21:38:22Z

Do we have any config for this? If we can retrieve the config in Spark then we can select the right parameter to call cudf API.

Similar to this:

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Line 3368 in f2d6157

val legacyStatisticalAggregate = SQLConf.get.legacyStatisticalAggregate

NVnavkumar · 2022-07-23T21:40:14Z

This is interesting, because in the JDK itself, Double.doubleToLongBits actually shortcuts the NaN condition:

https://github.com/openjdk/jdk/blob/a0a0539b0d3f9b6809c9759e697bfafd7b138ec1/src/java.base/share/classes/java/lang/Double.java#L860-L865

So I guess Spark wasn't using it before 3.1.3

NVnavkumar · 2022-07-23T21:53:28Z

Do we have any config for this? If we can retrieve the config in Spark then we can select the right parameter to call cudf API.

Similar to this:

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Line 3368 in f2d6157

val legacyStatisticalAggregate = SQLConf.get.legacyStatisticalAggregate

It looks like they switched to SQLOpenHashSet in 3.1.3 and added the NaN check:

https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3215

Here's the PR: apache/spark#33955 and the original Spark issue with NaN: https://issues.apache.org/jira/browse/SPARK-36702

NVnavkumar · 2022-07-23T21:56:56Z

I think I will try to xfail the test for 3.1.1 and 3.1.2 and file a follow up issue for the NaN case.

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-07-28T14:04:58Z

build

ttnghia · 2022-07-28T19:58:47Z

build

revans2

My main concern is documentation. We have some new functionality that is not 100% compatible with Spark. We should at a minimum document that it is not compatible and exactly in what ways it is not compatible. Ideally if there are issues with how Spark does something, like with -0.0 and we have filed an issue for it with Spark we should include all of that in the documentation.

integration_tests/src/main/python/array_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

…3 case as separate test Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-08-02T03:37:22Z

build

NVnavkumar added 12 commits June 27, 2022 17:14

WIP: Plugin implementation of array_intersect using setIntersect, sor…

8ddb9fc

…t still needed Signed-off-by: Navin Kumar <[email protected]>

WIP: array_intersect

0367a78

Signed-off-by: Navin Kumar <[email protected]>

array_intersect implementation

7743f45

Signed-off-by: Navin Kumar <[email protected]>

implementation of 3 other array set operations

64946b2

Signed-off-by: Navin Kumar <[email protected]>

add integration tests

ef54c68

Signed-off-by: Navin Kumar <[email protected]>

working versions of array_difference and arrays_overlap

3362c08

Signed-off-by: Navin Kumar <[email protected]>

array_union passing tests but for one strange bug in GPU partitioning

728f15a

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into ar…

18ad1c0

…ray_intersect_string_array

cleanup arrays_overlap integration test

3bfe106

Signed-off-by: Navin Kumar <[email protected]>

Add reference to issue filed regarding partitioning and collect()

c8e21b6

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into ar…

e7f978c

…ray_intersect_string_array

Add clarifying comment for GpuArraysOverlap

fab15a1

Signed-off-by: Navin Kumar <[email protected]>

ttnghia reviewed Jul 6, 2022

View reviewed changes

integration_tests/src/main/python/array_test.py Outdated Show resolved Hide resolved

ttnghia reviewed Jul 6, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Show resolved Hide resolved

sameerz added the feature request New feature or request label Jul 6, 2022

ttnghia reviewed Jul 6, 2022

View reviewed changes

integration_tests/src/main/python/array_test.py Outdated Show resolved Hide resolved

Update integration_tests/src/main/python/array_test.py

f1003c4

fix typo Co-authored-by: Nghia Truong <[email protected]>

NVnavkumar self-assigned this Jul 7, 2022

NVnavkumar added 2 commits July 7, 2022 12:57

make generated arrays nullable in integration tests

eb5e6f4

Signed-off-by: Navin Kumar <[email protected]>

remove the plugin code for null return value handling since this is n…

06af5eb

…ow down in cudf Signed-off-by: Navin Kumar <[email protected]>

ttnghia reviewed Jul 8, 2022

View reviewed changes

ttnghia added SQL part of the SQL/Dataframe plugin cudf_dependency An issue or PR with this label depends on a new feature in cudf labels Jul 11, 2022

NVnavkumar marked this pull request as ready for review July 15, 2022 15:59

jlowe reviewed Jul 18, 2022

View reviewed changes

integration_tests/src/main/python/array_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/array_test.py Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Outdated Show resolved Hide resolved

NVnavkumar marked this pull request as draft July 19, 2022 01:15

revans2 requested changes Jul 19, 2022

View reviewed changes

integration_tests/src/main/python/array_test.py Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Outdated Show resolved Hide resolved

NVnavkumar added 3 commits July 20, 2022 12:04

Merge branch 'branch-22.08' into array_intersect_string_array

62fcdba

Refactor to using GpuComplexTypeMergingExpression

cd6beb2

Signed-off-by: Navin Kumar <[email protected]>

Add decimal_gens to testing here

4587238

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar added 2 commits July 25, 2022 14:12

Updated docs and xfail Double and Float tests on Spark versions < 3.1.3

893f7d0

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-22.08' into array_intersect_string_array

494ad54

NVnavkumar requested a review from jlowe July 26, 2022 20:33

jlowe previously approved these changes Jul 26, 2022

View reviewed changes

Updated references to methods in cudf and added some scalar tests

3a83257

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar dismissed jlowe’s stale review via 3a83257 July 26, 2022 23:54

jlowe previously approved these changes Jul 27, 2022

View reviewed changes

NVnavkumar marked this pull request as ready for review July 28, 2022 14:04

ttnghia requested a review from revans2 July 28, 2022 14:17

ttnghia mentioned this pull request Jul 29, 2022

Add support for nested types to collect_set(...) on the GPU [databricks] #6079

Merged

revans2 reviewed Aug 1, 2022

View reviewed changes

Add incompat documentation, and update tests to handle pre Spark 3.1.…

5736e1e

…3 case as separate test Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar dismissed jlowe’s stale review via 5736e1e August 1, 2022 23:17

revans2 approved these changes Aug 2, 2022

View reviewed changes

ttnghia approved these changes Aug 2, 2022

View reviewed changes

revans2 merged commit 19a6957 into NVIDIA:branch-22.08 Aug 2, 2022

pxLi mentioned this pull request Aug 3, 2022

[BUG] test_array_intersect failed in databricks 10.4 runtime and Spark 3.3+ #6208

Closed

NVnavkumar mentioned this pull request Aug 6, 2022

[BUG] test_array_union_before_spark313 failed in UCX job #6249

Closed

Add set based operations for arrays: array_intersect, array_union, array_except, and arrays_overlap for running on GPU #5958

Add set based operations for arrays: array_intersect, array_union, array_except, and arrays_overlap for running on GPU #5958

Uh oh!

Conversation

NVnavkumar commented Jul 6, 2022 • edited by ttnghia Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ttnghia Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NVnavkumar commented Jul 23, 2022

Uh oh!

ttnghia commented Jul 23, 2022

Uh oh!

ttnghia commented Jul 23, 2022

Uh oh!

NVnavkumar commented Jul 23, 2022

Uh oh!

NVnavkumar commented Jul 23, 2022

Uh oh!

ttnghia commented Jul 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NVnavkumar commented Jul 23, 2022

Uh oh!

NVnavkumar commented Jul 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NVnavkumar commented Jul 23, 2022

Uh oh!

NVnavkumar commented Jul 28, 2022

Uh oh!

ttnghia commented Jul 28, 2022

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NVnavkumar commented Aug 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add set based operations for arrays: `array_intersect`, `array_union`, `array_except`, and `arrays_overlap` for running on GPU #5958

Add set based operations for arrays: `array_intersect`, `array_union`, `array_except`, and `arrays_overlap` for running on GPU #5958

NVnavkumar commented Jul 6, 2022 •

edited by ttnghia

Loading

ttnghia commented Jul 23, 2022 •

edited

Loading

NVnavkumar commented Jul 23, 2022 •

edited

Loading