Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCCL C library should have its own error codes #3819

Open
leofang opened this issue Feb 14, 2025 · 2 comments
Open

CCCL C library should have its own error codes #3819

leofang opened this issue Feb 14, 2025 · 2 comments
Labels
CCCL-C For all items related to the C library

Comments

@leofang
Copy link
Member

leofang commented Feb 14, 2025

... and not piggyback on the CUDA driver type CUresult.

@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 14, 2025
@leofang leofang added the CCCL-C For all items related to the C library label Feb 14, 2025
@rwgk
Copy link
Contributor

rwgk commented Feb 18, 2025

@leofang pointed me here (thanks!) when I mentioned my concerns regarding cuda.parallel error handling.

I feel a little guilty because I introduced this:

199f2a5#diff-538dee7986a27b676e82be8b584faabbfbafee2c8c9ab9c4e429628297981b23R341-R343

-  catch (...)
+  catch (const std::exception& exc)
   {
+    fflush(stderr);
+    printf("\nEXCEPTION in cccl_device_reduce_build(): %s\n", exc.what());
+    fflush(stdout);
     error = CUDA_ERROR_UNKNOWN;
   }

It was meant to be a stop-gap, so that I could at least see what the exceptions are (especially because I was new to the code base), but now (main @ 02d2396) that pattern is spreading as we're adding more algorithms:

c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_merge_sort_build(): %s\n", exc.what());
c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_reduce(): %s\n", exc.what());
c/parallel/src/merge_sort.cu:    printf("\nEXCEPTION in cccl_device_merge_sort_cleanup(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce_build(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce(): %s\n", exc.what());
c/parallel/src/reduce.cu:    printf("\nEXCEPTION in cccl_device_reduce_cleanup(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan_build(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan(): %s\n", exc.what());
c/parallel/src/scan.cu:    printf("\nEXCEPTION in cccl_device_scan_cleanup(): %s\n", exc.what());

While reviewing #3763 I saw that the approach is used there even more generally (no exception being caught there):

+  if (cccl_iterator_kind_t::iterator == d_out_keys.type || cccl_iterator_kind_t::iterator == d_out_items.type)
+  {
+    // See https://github.com/NVIDIA/cccl/issues/3722
+    fflush(stderr);
+    printf("\nERROR in cccl_device_merge_sort(): merge sort output cannot be an iterator\n");
+    fflush(stdout);
+    return CUDA_ERROR_UNKNOWN;
+  }

Error handling via printf is sure better than no messages at all, but printing error messages and continuing is tech-debt of the more severe kind if it keeps spreading. Eventually this code will be used from higher layers (e.g. cupy), these prints will get overlooked, redirected to /dev/null, etc and quite likely be incomprehensible to people a few steps up the food chain.

I think what we need is a CCCL_C_ERROR system of error codes with associated cccl_set_error()/cccl_get_error() or similar.

@rwgk
Copy link
Contributor

rwgk commented Feb 18, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CCCL-C For all items related to the C library
Projects
Status: Todo
Development

No branches or pull requests

2 participants