Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

Closed
benbarsdell opened this issue Aug 2, 2021 · 4 comments · Fixed by #481
Closed

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

benbarsdell opened this issue Aug 2, 2021 · 4 comments · Fixed by #481
Assignees
Labels
P2: nice to have Desired, but not necessary. type: bug: functional Does not work as intended.
Milestone

Comments

@benbarsdell
Copy link
Member

We had a painful issue in TensorFlow that turned out to be because we were passing begin_bit = end_bit = 0 (all keys were zero in our case). CUB failed with "Invalid configuration error", and debugging was difficult because the failing kernel launch was not logged even with debug_synchronous=true.

Some isolated testing show that CUB succeeds and gives the correct answer for small inputs (i.e., single-block), but for large inputs it either produces the wrong value (CUDA <= 11.2) or returns "Invalid configuration error" (CUDA >= 11.3).

Here is a minimal reproducer (remove the ".txt" suffix):
test_cub_bits_bug.cu.txt

It would be great if this could be fixed, and if logging could be added for all kernel launches (specifically in the Onesweep path).

@gevtushenko
Copy link
Collaborator

Hello, @benbarsdell! Thank you for reporting this. As stated in the documentation, key bits should be different:

An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified.

I think that's the reason there are no tests for this particular case:

for (int begin_bit = 0; begin_bit <= 1; begin_bit++)
{
    // Iterate end bit
    for (int end_bit = begin_bit + 1;

As I understand, you expect something like cudaMemcpyAsync to be performed in this case. I think we could generalize API to this case at some point. Can you use some wrapper function until then?

@benbarsdell
Copy link
Member Author

Yes I've worked around it, so it's not blocking us.

I think it would be useful to generalize the API to support it. It's not actually clear to me that that docstring excludes this case, because it does not say that the subrange must be non-empty. We are calling it with end_bit = Log2Ceiling(N), for integer keys in the range [0, N), which results in end_bit = 0 when N = 1.
There is also the fact that it already works in the single-block path.

@alliepiper
Copy link
Collaborator

This seems reasonable. We should handle the case where begin_bit == end_bit to simplify generic usecases and add tests for this.

I opened #355 to track the missing log output.

@alliepiper alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021
@alliepiper alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021
@alliepiper alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022
@alliepiper alliepiper modified the milestones: 1.17.0, 2.0.0 May 7, 2022
@canonizer
Copy link
Contributor

#481 should fix this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. type: bug: functional Does not work as intended.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants