Refactor semi_anti_join#11100
Conversation
semi_anti_join to expose semi_join_containssemi_anti_join to expose left_semi_join_contains
|
Benchmark comparing before and after the modification: So the performance is not always better, although in most cases we gain >10% speedup. The slower places are due to null search. I'm still trying to optimize it. |
PointKernel
left a comment
There was a problem hiding this comment.
Some non-blocking nitpick. Thanks @ttnghia ! Great Work!
cpp/src/search/contains_table.cu
Outdated
| // Otherwise, we have only one nullable column and can use its null mask directly. | ||
| auto const row_bitmask = | ||
| haystack_nullable_columns.size() > 1 | ||
| ? std::move( |
There was a problem hiding this comment.
move is necessary here because we need to access to the .first element of the bitmask_and result.
There was a problem hiding this comment.
There was a problem hiding this comment.
Wow I didn't know that. Thanks 😄
There was a problem hiding this comment.
Sorry, I was wrong, the .first element here is a device_buffer instead.
It seems that by not using std::move, the device_buffer is incorrectly copied (??) and CI test failed:
13:43:43 [ RUN ] DeathTest.CudaFatalError
13:43:43 /workspace/.conda-bld/work/cpp/tests/error/error_handling_test.cu:102: Failure
13:43:43 Death test: call_kernel()
13:43:43 Result: failed to die.
13:43:43 Error msg:
13:43:43 [ DEATH ]
There was a problem hiding this comment.
Let me try one more time to see if removing std::move actually can get rid of the death test failure.
There was a problem hiding this comment.
Maybe the failure here is just random? I just checked rmm::device_buffer and see that its copy constructor is deleted so there should not be any copying.
Let's wait to see the CI test result.
There was a problem hiding this comment.
Okay that test failed again. So indeed removing std::move is not related.
There was a problem hiding this comment.
Now all CI tests passed (std::move removed)! Definitely the test failed randomly. I'm going to merge this soon.
|
Rerun tests. |
Unfortunately Jake has no time to review again
|
@gpucibot merge |
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
A (left) semi-join between the left and right tables returns a set of rows in the left table that has matching rows (i.e., compared equally) in the right table. As such, for each row in the left table, it needs to check if that row has a match in the right table.
Such check is very generic and has applications in many other places, not just in semi-join. This PR exposes that check functionality as a new
cudf::detail::contains(table_view, table_view)for internal usage.Closes #11037.
Depends on:
static_multimap::pair_containsNVIDIA/cuCollections#175