You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
We had a painful issue in TensorFlow that turned out to be because we were passing begin_bit = end_bit = 0 (all keys were zero in our case). CUB failed with "Invalid configuration error", and debugging was difficult because the failing kernel launch was not logged even with debug_synchronous=true.
Some isolated testing show that CUB succeeds and gives the correct answer for small inputs (i.e., single-block), but for large inputs it either produces the wrong value (CUDA <= 11.2) or returns "Invalid configuration error" (CUDA >= 11.3).
Hello, @benbarsdell! Thank you for reporting this. As stated in the documentation, key bits should be different:
An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified.
I think that's the reason there are no tests for this particular case:
for (int begin_bit = 0; begin_bit <= 1; begin_bit++)
{
// Iterate end bitfor (int end_bit = begin_bit + 1;
As I understand, you expect something like cudaMemcpyAsync to be performed in this case. I think we could generalize API to this case at some point. Can you use some wrapper function until then?
Yes I've worked around it, so it's not blocking us.
I think it would be useful to generalize the API to support it. It's not actually clear to me that that docstring excludes this case, because it does not say that the subrange must be non-empty. We are calling it with end_bit = Log2Ceiling(N), for integer keys in the range [0, N), which results in end_bit = 0 when N = 1.
There is also the fact that it already works in the single-block path.
We had a painful issue in TensorFlow that turned out to be because we were passing
begin_bit = end_bit = 0
(all keys were zero in our case). CUB failed with "Invalid configuration error", and debugging was difficult because the failing kernel launch was not logged even withdebug_synchronous=true
.Some isolated testing show that CUB succeeds and gives the correct answer for small inputs (i.e., single-block), but for large inputs it either produces the wrong value (CUDA <= 11.2) or returns "Invalid configuration error" (CUDA >= 11.3).
Here is a minimal reproducer (remove the ".txt" suffix):
test_cub_bits_bug.cu.txt
It would be great if this could be fixed, and if logging could be added for all kernel launches (specifically in the Onesweep path).
The text was updated successfully, but these errors were encountered: