Skip to content

Support nan_equality in cudf::distinct#11118

Merged
rapids-bot[bot] merged 115 commits intorapidsai:branch-22.08from
ttnghia:distinct_with_nans_equality
Jun 23, 2022
Merged

Support nan_equality in cudf::distinct#11118
rapids-bot[bot] merged 115 commits intorapidsai:branch-22.08from
ttnghia:distinct_with_nans_equality

Conversation

@ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 16, 2022

This adds nan_equality parameter to cudf::distinct, allowing to specify the desired behavior when dealing with floating-point data: NaN should be compared equally to other NaN or not.

Depends on #11052 (built on top of it).
Closes #11092.
This is a blocker for set-like operations (#11043) and also the last blocker for #11053.

ttnghia added 2 commits June 22, 2022 06:18
…ality

# Conflicts:
#	cpp/include/cudf/stream_compaction.hpp
#	cpp/src/stream_compaction/distinct.cu
@github-actions github-actions bot added the CMake CMake build issue label Jun 22, 2022
ttnghia added 3 commits June 22, 2022 09:46
# Conflicts:
#	cpp/src/stream_compaction/distinct.cu
…ality

# Conflicts:
#	cpp/include/cudf/detail/stream_compaction.hpp
#	cpp/src/stream_compaction/distinct.cu
# Conflicts:
#	cpp/benchmarks/stream_compaction/distinct.cpp
#	cpp/include/cudf/stream_compaction.hpp
#	cpp/src/dictionary/detail/concatenate.cu
#	cpp/src/dictionary/set_keys.cu
#	cpp/src/stream_compaction/distinct.cu
#	cpp/src/stream_compaction/distinct_reduce.cu
#	cpp/src/stream_compaction/distinct_reduce.cuh
#	cpp/src/transform/encode.cu
#	cpp/tests/stream_compaction/distinct_tests.cpp
@ttnghia ttnghia force-pushed the distinct_with_nans_equality branch from 97a6cb6 to be3b2fe Compare June 22, 2022 19:04
@github-actions github-actions bot removed the CMake CMake build issue label Jun 22, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress 0 - Blocked Cannot progress due to external reasons labels Jun 22, 2022
@ttnghia ttnghia marked this pull request as ready for review June 22, 2022 19:12
@ttnghia ttnghia requested a review from a team as a code owner June 22, 2022 19:12
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions. Otherwise this is a straightforward extension of the comparator API to forward the NaN policy through. Nice work.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. LGTM!

map.get_device_view(), key_hasher, key_equal, keep, reduction_results.begin()});
auto const row_comp = cudf::experimental::row::equality::self_comparator(preprocessed_input);

auto const reduce_by_row = [&](auto const value_comp) {
Copy link
Contributor

@bdice bdice Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of what @PointKernel was suggesting -- one alternative I considered was making this lambda actually allocate and return the output vector, rather than binding in reduction_results as an output iterator and returning void. That felt like an implicit "output parameter" in the lambda. If the lambda were a real function, we'd avoid the output parameter and return the rmm::device_uvector directly from the function. I don't have strong feelings on this, so feel free to keep it as-is.

edit: Initially wrote "IIFE" where I meant lambda. Fixed.

Copy link
Contributor

@bdice bdice Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, however, almost the entire body of hash_reduce_by_row becomes a lambda and the part at the end is just a nans_equal dispatcher. 😛

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 23, 2022

Thanks all. I really appreciate your help with polishing everything 🍬

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 23, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 80d7cc7 into rapidsai:branch-22.08 Jun 23, 2022
@ttnghia ttnghia deleted the distinct_with_nans_equality branch June 23, 2022 03:08
@ttnghia ttnghia mentioned this pull request Jun 23, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support nan_equality in cudf::distinct

3 participants