-
Notifications
You must be signed in to change notification settings - Fork 268
Add set based operations for arrays: array_intersect, array_union, array_except, and arrays_overlap for running on GPU
#5958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set based operations for arrays: array_intersect, array_union, array_except, and arrays_overlap for running on GPU
#5958
Conversation
…t still needed Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…ray_intersect_string_array
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…ray_intersect_string_array
Signed-off-by: Navin Kumar <[email protected]>
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Show resolved
Hide resolved
fix typo Co-authored-by: Nghia Truong <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…ow down in cudf Signed-off-by: Navin Kumar <[email protected]>
| 'arrays_overlap(array(), b)', | ||
| 'arrays_overlap(a, a)', | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an empty line at the end of the file as an implicit convention.
| ) | |
| ) | |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
|
Another example: |
|
What's wrong with my |
|
@NVnavkumar Did you generate your last example by running on the GPU? I did mine on the CPU. If the correct behavior for comparing NaNs is that they are always considered unequal then I need to fix the JNI layer to pass in the correct NaN comparison parameter. |
Both examples were on the CPU, Spark 3.2.1 |
|
Tried with Spark 3.1.3 But in Spark 3.1.1, there is this behavior: |
|
Do we have any config for this? If we can retrieve the config in Spark then we can select the right parameter to call cudf API. Similar to this: spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Line 3368 in f2d6157
|
|
This is interesting, because in the JDK itself, So I guess Spark wasn't using it before 3.1.3 |
It looks like they switched to SQLOpenHashSet in 3.1.3 and added the NaN check: Here's the PR: apache/spark#33955 and the original Spark issue with NaN: https://issues.apache.org/jira/browse/SPARK-36702 |
|
I think I will try to xfail the test for 3.1.1 and 3.1.2 and file a follow up issue for the NaN case. |
Signed-off-by: Navin Kumar <[email protected]>
|
build |
|
build |
revans2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is documentation. We have some new functionality that is not 100% compatible with Spark. We should at a minimum document that it is not compatible and exactly in what ways it is not compatible. Ideally if there are issues with how Spark does something, like with -0.0 and we have filed an issue for it with Spark we should include all of that in the documentation.
…3 case as separate test Signed-off-by: Navin Kumar <[email protected]>
|
build |
Fixes #4932, #5228 , #5188, and #5222. Requires rapidsai/cudf#11043 and rapidsai/cudf#11143 to be merged.
This implements the SQL functions for set-based operations on arrays
array_intersect,array_union,array_except, andarrays_overlapfor running on the GPU. A few caveats:There is a minor bug ([BUG] Exception calling
collect()when partitioning using with arrays with null values usingarray_union(...)#5957) that came up when testingarray_union, and it looks like a bug in the partitioning code when usingcollect()The 3 operations
array_intersect,array_unionandarray_exceptcannot guarantee the same order when used on the GPU vs the CPU (because they are set-based operations), so we wrap these operations insort_array(...)call for testing purposes.arrays_overlapreturns a boolean result value, so there is no need to sort, but the implementation is complicated because of these requirements for the function implementation (see https://spark.apache.org/docs/3.2.1/api/sql/index.html#arrays_overlap)arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.