CCCL C library should have its own error codes #3819

leofang · 2025-02-14T21:44:22Z

... and not piggyback on the CUDA driver type CUresult.

The text was updated successfully, but these errors were encountered:

rwgk · 2025-02-18T23:52:20Z

@leofang pointed me here (thanks!) when I mentioned my concerns regarding cuda.parallel error handling.

I feel a little guilty because I introduced this:

199f2a5#diff-538dee7986a27b676e82be8b584faabbfbafee2c8c9ab9c4e429628297981b23R341-R343

-  catch (...)
+  catch (const std::exception& exc)
   {
+    fflush(stderr);
+    printf("\nEXCEPTION in cccl_device_reduce_build(): %s\n", exc.what());
+    fflush(stdout);
     error = CUDA_ERROR_UNKNOWN;
   }

It was meant to be a stop-gap, so that I could at least see what the exceptions are (especially because I was new to the code base), but now (main @ 02d2396) that pattern is spreading as we're adding more algorithms:

c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_merge_sort_build(): %s\n", exc.what());
c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_reduce(): %s\n", exc.what());
c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_merge_sort_cleanup(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce_build(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce_cleanup(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan_build(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan_cleanup(): %s\n", exc.what());

While reviewing #3763 I saw that the approach is used there even more generally (no exception being caught there):

+  if (cccl_iterator_kind_t::iterator == d_out_keys.type || cccl_iterator_kind_t::iterator == d_out_items.type)
+  {
+    // See https://github.com/NVIDIA/cccl/issues/3722
+    fflush(stderr);
+    printf("\nERROR in cccl_device_merge_sort(): merge sort output cannot be an iterator\n");
+    fflush(stdout);
+    return CUDA_ERROR_UNKNOWN;
+  }

Error handling via printf is sure better than no messages at all, but printing error messages and continuing is tech-debt of the more severe kind if it keeps spreading. Eventually this code will be used from higher layers (e.g. cupy), these prints will get overlooked, redirected to /dev/null, etc and quite likely be incomprehensible to people a few steps up the food chain.

I think what we need is a CCCL_C_ERROR system of error codes with associated cccl_set_error()/cccl_get_error() or similar.

rwgk · 2025-02-18T23:52:57Z

@gevtushenko @swhina @NaderAlAwar @oleksandr-pavlyk for vis

github-project-automation bot added this to CCCL Feb 14, 2025

github-project-automation bot moved this to Todo in CCCL Feb 14, 2025

leofang added the CCCL-C For all items related to the C library label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCCL C library should have its own error codes #3819

CCCL C library should have its own error codes #3819

leofang commented Feb 14, 2025

rwgk commented Feb 18, 2025

rwgk commented Feb 18, 2025

CCCL C library should have its own error codes #3819

CCCL C library should have its own error codes #3819

Comments

leofang commented Feb 14, 2025

rwgk commented Feb 18, 2025

rwgk commented Feb 18, 2025