Support nan_equality in cudf::distinct#11118
Support nan_equality in cudf::distinct#11118rapids-bot[bot] merged 115 commits intorapidsai:branch-22.08from
nan_equality in cudf::distinct#11118Conversation
…ality # Conflicts: # cpp/include/cudf/stream_compaction.hpp # cpp/src/stream_compaction/distinct.cu
# Conflicts: # cpp/src/stream_compaction/distinct.cu
…ality # Conflicts: # cpp/include/cudf/detail/stream_compaction.hpp # cpp/src/stream_compaction/distinct.cu
# Conflicts: # cpp/benchmarks/stream_compaction/distinct.cpp # cpp/include/cudf/stream_compaction.hpp # cpp/src/dictionary/detail/concatenate.cu # cpp/src/dictionary/set_keys.cu # cpp/src/stream_compaction/distinct.cu # cpp/src/stream_compaction/distinct_reduce.cu # cpp/src/stream_compaction/distinct_reduce.cuh # cpp/src/transform/encode.cu # cpp/tests/stream_compaction/distinct_tests.cpp
97a6cb6 to
be3b2fe
Compare
bdice
left a comment
There was a problem hiding this comment.
A few small suggestions. Otherwise this is a straightforward extension of the comparator API to forward the NaN policy through. Nice work.
| map.get_device_view(), key_hasher, key_equal, keep, reduction_results.begin()}); | ||
| auto const row_comp = cudf::experimental::row::equality::self_comparator(preprocessed_input); | ||
|
|
||
| auto const reduce_by_row = [&](auto const value_comp) { |
There was a problem hiding this comment.
Along the lines of what @PointKernel was suggesting -- one alternative I considered was making this lambda actually allocate and return the output vector, rather than binding in reduction_results as an output iterator and returning void. That felt like an implicit "output parameter" in the lambda. If the lambda were a real function, we'd avoid the output parameter and return the rmm::device_uvector directly from the function. I don't have strong feelings on this, so feel free to keep it as-is.
edit: Initially wrote "IIFE" where I meant lambda. Fixed.
There was a problem hiding this comment.
If we did that, however, almost the entire body of hash_reduce_by_row becomes a lambda and the part at the end is just a nans_equal dispatcher. 😛
|
Thanks all. I really appreciate your help with polishing everything 🍬 |
|
@gpucibot merge |
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
This adds
nan_equalityparameter tocudf::distinct, allowing to specify the desired behavior when dealing with floating-point data:NaNshould be compared equally to otherNaNor not.Depends on #11052 (built on top of it).
Closes #11092.
This is a blocker for set-like operations (#11043) and also the last blocker for #11053.