-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python wrappers for c.parallel merge_sort API #3763
Add Python wrappers for c.parallel merge_sort API #3763
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
🟥 CI finished in 5m 51s: Pass: 0%/1 | Total: 5m 51s | Avg: 5m 51s | Max: 5m 51s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
…void pointer before calling merge sort
This reverts commit 3a0a9a3.
printf("\nEXCEPTION in cccl_device_merge_sort(): merge sort output cannot be an iterator\n"); | ||
fflush(stdout); | ||
error = CUDA_ERROR_UNKNOWN; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3722
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add // See #3722
as a comment? — Super simple but much more discoverable.
c/parallel/src/util/types.cpp
Outdated
case cccl_type_enum::UINT8: | ||
return "::cuda::std::uint8_t"; | ||
case cccl_type_enum::UINT16: | ||
return "::cuda::std::uint16_t"; | ||
case cccl_type_enum::UINT32: | ||
return "::cuda::std::uint32_t"; | ||
case cccl_type_enum::UINT64: | ||
return "::cuda::std::uint64_t"; | ||
return "unsigned long"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These had to be changed because creating iterators in python with types np.int64
and np.uint64
were resulting in compilation errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change concerns me, and I am not sure whether it is correct. On Windows, IIRC, long
is 32-bit, at least for some targets.
What is the compilation error that was encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compilation error occurs when we try to instantiate DeviceMergeSortBlockSortKernel
with an iterator:
/home/coder/.local/share/venvs/cccl/lib/python3.12/site-packages/cuda/cccl/include/cub/agent/agent_merge_sort.cuh(206): error: no instance of overloaded function "cub::CUB_300000_SM_890::BlockLoad<T, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z>::Load [with T=input_keys_iterator_state_t::value_type, BLOCK_DIM_X=256, ITEMS_PER_THREAD=2, ALGORITHM=cub::CUB_300000_SM_890::BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1]" matches the argument list
argument types are: (input_keys_iterator_state_t, long [2], int, input_keys_iterator_state_t::value_type)
object type is: cub::CUB_300000_SM_890::BlockLoad<input_keys_iterator_state_t::value_type, 256, 2, cub::CUB_300000_SM_890::BLOCK_LOAD_WARP_TRANSPOSE, 1, 1>
BlockLoadKeys(storage.load_keys).Load(keys_in + tile_base, keys_local, num_remaining, *(keys_in + tile_base));
It seems when I pass in an iterator with datatype int64
, the value type of the iterator is ::cuda::std::int64_t
but KeyT
is long
.
Looking at this in some more detail, I think the issue comes from a mismatch between cccl_type_enum_to_name
and cccl_type_enum_to_string
. The former uses nvrtcGetTypeName
which returns long
when ::cuda::std::int64_t
is passed (see https://godbolt.org/z/bqWznEYGP), while the latter just returns the string "::cuda::std::int64_t"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could just replace usages of cccl_type_enum_to_name
with the string version, but is there a specific reason why nvrtcGetTypeName
would return long? Since they behave differently I'm not sure if having both functions was intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change concerns me
Yeah, agree this is very worrisome. I'd try very hard to avoid this change.
(I stared at the error message and types.h, types.cpp for a while but I'm failing to make the connection. Unfortunately I'm without a workstation at the moment and cannot reproduce/troubleshoot the error.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change and combined the two functions into one. I don't think there was any reason to have both functions to begin with and this way we ensure consistency on what types are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume one problem could be that int64_t
could be implemented as either long
or long long
, which are two different data types but they have the same range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, avoid using long
or long long
because we want to support Windows soon-ish
c/parallel/test/test_merge_sort.cpp
Outdated
// std::sort(expected_items.begin(), expected_items.end()); | ||
// REQUIRE(expected_keys == std::vector<TestType>(input_keys_it)); | ||
// REQUIRE(expected_items == std::vector<item_t>(input_items_it)); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test that shows output iterators also don't work with items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should either be enabled or removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, this is tracked in #3722
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ralf suggested adding them back but ifdeffing them out, which I think is a good idea too
🟨 CI finished in 13m 53s: Pass: 66%/3 | Total: 20m 58s | Avg: 6m 59s | Max: 10m 37s | Hits: 98%/296
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-cpu16 |
…aErrorUnsupportedPtxVersion error
🟩 CI finished in 2h 42m: Pass: 100%/93 | Total: 1d 11h | Avg: 23m 12s | Max: 1h 05m | Hits: 94%/134373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
# | Runner |
---|---|
66 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 08m: Pass: 100%/93 | Total: 15h 53m | Avg: 10m 14s | Max: 34m 02s | Hits: 95%/134373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
# | Runner |
---|---|
66 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
c/parallel/src/merge_sort.cu
Outdated
if (cccl_iterator_kind_t::iterator == d_out_keys.type || cccl_iterator_kind_t::iterator == d_out_items.type) | ||
{ | ||
fflush(stderr); | ||
printf("\nEXCEPTION in cccl_device_merge_sort(): merge sort output cannot be an iterator\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXCEPTION
was meant to signal that an exception was caught. (I started this not very good pattern...)
Could you please change this to ERROR
?
c/parallel/src/merge_sort.cu
Outdated
fflush(stderr); | ||
printf("\nEXCEPTION in cccl_device_merge_sort(): merge sort output cannot be an iterator\n"); | ||
fflush(stdout); | ||
error = CUDA_ERROR_UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to return here?
return CUDA_ERROR_UNKNOWN;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did, fixed
printf("\nEXCEPTION in cccl_device_merge_sort(): merge sort output cannot be an iterator\n"); | ||
fflush(stdout); | ||
error = CUDA_ERROR_UNKNOWN; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add // See #3722
as a comment? — Super simple but much more discoverable.
@@ -209,6 +222,20 @@ def _iterator_to_cccl_iter(it: IteratorBase) -> Iterator: | |||
) | |||
|
|||
|
|||
def _none_to_cccl_iter() -> Iterator: | |||
# Create a null int pointer. Any type could be used here, we just need to pass NULL. | |||
info = _numpy_type_to_info(np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use np.void
here, to make the intent more clear? If not, my second pick would be np.char
.
(This isn't a big deal, but I'd definitely try to make the intent more clear. E.g. seeing size 1 or 0 while debugging some issue in the core code later is a nice clue in that far-removed-from-here context.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.void
didn't work since this calls numba.from_dtype()
internally, so I ended up using np.uint8
from ..typing import DeviceArrayLike | ||
|
||
|
||
def _dtype_validation(dt1, dt2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and also updated the code to follow the new patterns more closely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the _dtype_validation() function here, although it isn't called anymore. Oversight?
@@ -254,36 +254,6 @@ TEST_CASE("DeviceMergeSort::SortKeys works with input iterators", "[merge_sort]" | |||
REQUIRE(expected_keys == std::vector<TestType>(input_keys_ptr)); | |||
} | |||
|
|||
// TEST_CASE("DeviceMergeSort::SortKeys works with output iterators", "[merge_sort]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two ideas:
- Definitely add a comment to explain why the code is commented out instead of being removed entirely.
- Maybe use
#ifdef NEVER_DEFINED
instead of comments, then at least you'll still get the syntax highlighting, and potentially other tooling still see this as C++ code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I added them back
c/parallel/src/util/types.cpp
Outdated
case cccl_type_enum::UINT8: | ||
return "::cuda::std::uint8_t"; | ||
case cccl_type_enum::UINT16: | ||
return "::cuda::std::uint16_t"; | ||
case cccl_type_enum::UINT32: | ||
return "::cuda::std::uint32_t"; | ||
case cccl_type_enum::UINT64: | ||
return "::cuda::std::uint64_t"; | ||
return "unsigned long"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change concerns me
Yeah, agree this is very worrisome. I'd try very hard to avoid this change.
(I stared at the error message and types.h, types.cpp for a while but I'm failing to make the connection. Unfortunately I'm without a workstation at the moment and cannot reproduce/troubleshoot the error.)
…_name` due to inconsistencies with the datatype being returned for INT64 and UINT64
🟩 CI finished in 1h 03m: Pass: 100%/93 | Total: 15h 45m | Avg: 10m 09s | Max: 34m 45s | Hits: 95%/134371
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
# | Runner |
---|---|
66 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
{ | ||
case cccl_type_enum::INT8: | ||
|
||
check(nvrtcGetTypeName<::cuda::std::int8_t*>(&result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nader, your change looks good to me, especially because it passes all tests.
However, it does change the behavior, therefore it might be good to ask @gevtushenko (original author) specifically:
This approach
check(nvrtcGetTypeName<::cuda::std::int8_t*>(&result));
was introduced with PR #2256 (in c/src/reduce.cu at the time):
What was the original intent?
I'm totally speculating here:
Do we maybe want to keep both functions, maybe rename this one to cccl_type_enum_to_nvrtc_type_name()
, and adding the bool is_pointer = false
argument to the other one? (But then again, all tests pass...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with Georgii and he said that having two functions was an oversight on his part, and that this change is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks!
🟩 CI finished in 1h 09m: Pass: 100%/93 | Total: 15h 50m | Avg: 10m 13s | Max: 35m 20s | Hits: 95%/134371
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
# | Runner |
---|---|
66 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
from ..typing import DeviceArrayLike | ||
|
||
|
||
def _dtype_validation(dt1, dt2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the _dtype_validation() function here, although it isn't called anymore. Oversight?
🟩 CI finished in 1h 02m: Pass: 100%/93 | Total: 15h 34m | Avg: 10m 03s | Max: 34m 33s | Hits: 95%/134371
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
# | Runner |
---|---|
66 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
* Fix issue with converting types to strings in c.parallel merge_sort * Add option to specify prefix for iterator methods to avoid name collisions * Return error if output iterators are passed to c.parallel merge_sort * Use `launcher_factory.PtxVersion()` in dispatch merge sort due to cudaErrorUnsupportedPtxVersion error * Remove `cccl_type_enum_to_string` and replace with `cccl_type_enum_to_name` due to inconsistencies with the datatype being returned for INT64 and UINT64
Description
Closes #3459.
Checklist