Skip to content

Fully support nested types in cudf::contains#10656

Merged
rapids-bot[bot] merged 394 commits intorapidsai:branch-22.10from
ttnghia:support_structs_in_contains
Aug 17, 2022
Merged

Fully support nested types in cudf::contains#10656
rapids-bot[bot] merged 394 commits intorapidsai:branch-22.10from
ttnghia:support_structs_in_contains

Conversation

@ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Apr 14, 2022

This extends the cudf::contains API to support nested types (lists + structs) with arbitrarily nested levels. As such, cudf::contains will work with literally any type of input data.

In addition, this fixes null handling of cudf::contains with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels.

Closes: #8965
Depends on:

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 14, 2022
@ttnghia ttnghia self-assigned this Apr 14, 2022
@codecov

This comment was marked as off-topic.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 29, 2022

Alright, this PR is awakened. Now it has a totally new implementation, which is just around 10 LOC.

null_equality::EQUAL,
nan_equality::ALL_EQUAL,
stream,
mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this detail function returns a device_uvector instead of a column?
Will the result ever be larger than what a column supports?

Copy link
Contributor Author

@ttnghia ttnghia Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That detail::contains(table_view, table_view) function was initially refactored out of semi-join (#11100). Now, it is also used in several other places but the result is only used for temporary computation and never be returned to the user. Thus, device_uvector is enough at this time.

It is still unclear whether we will expose this function to the public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also say that (especially for internal APIs like this, but occasionally even otherwise) I prefer using a strongly-typed object like a device_uvector rather than the type-erased (and nullable) column. We have more information available and convey it. Even for public APIs I like uvectors when we don't need type erasure or nullability because the returned values are clearly defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion is also related to #11356.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have only very minor suggestions at this point. Thank you!

ttnghia and others added 4 commits August 16, 2022 21:55
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
@ttnghia ttnghia requested a review from PointKernel August 17, 2022 04:29
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 127d574 into rapidsai:branch-22.10 Aug 17, 2022
@ttnghia ttnghia deleted the support_structs_in_contains branch August 17, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Add nested struct support for cudf::contains

9 participants