Skip to content

[STF] Low level interface for the cuda_kernel(_chain) construct#5319

Merged
caugonnet merged 39 commits intoNVIDIA:mainfrom
caugonnet:stf_cuda_kernel_lowlevel
Aug 4, 2025
Merged

[STF] Low level interface for the cuda_kernel(_chain) construct#5319
caugonnet merged 39 commits intoNVIDIA:mainfrom
caugonnet:stf_cuda_kernel_lowlevel

Conversation

@caugonnet
Copy link
Contributor

Description

Implement a low-level interface for the cuda_kernel(_chain) construct so that it can later be called from C/python

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@caugonnet caugonnet requested review from a team as code owners July 20, 2025 06:43
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jul 20, 2025
@caugonnet caugonnet requested review from gonidelis and wmaxey July 20, 2025 06:43
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jul 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@caugonnet caugonnet requested a review from griwes July 20, 2025 06:43
@caugonnet caugonnet self-assigned this Jul 20, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jul 20, 2025
@caugonnet
Copy link
Contributor Author

/ok to test 316167f

@github-actions
Copy link
Contributor

🟩 CI finished in 32m 04s: Pass: 100%/32 | Total: 7h 03m | Avg: 13m 13s | Max: 29m 27s | Hits: 75%/15978
  • 🟩 cudax: Pass: 100%/28 | Total: 6h 48m | Avg: 14m 35s | Max: 29m 27s | Hits: 75%/15978

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  5h 52m | Avg: 14m 41s | Max: 29m 27s | Hits:  76%/13522 
      🟩 arm64              Pass: 100%/4   | Total: 56m 00s | Avg: 14m 00s | Max: 15m 47s | Hits:  69%/2456  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 37m 39s | Avg: 12m 33s | Max: 14m 38s | Hits:  75%/1537  
      🟩 12.9               Pass: 100%/25  | Total:  6h 10m | Avg: 14m 49s | Max: 29m 27s | Hits:  75%/14441 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 37m 39s | Avg: 12m 33s | Max: 14m 38s | Hits:  75%/1537  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 10m | Avg: 14m 49s | Max: 29m 27s | Hits:  75%/14441 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  6h 48m | Avg: 14m 35s | Max: 29m 27s | Hits:  75%/15978 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 24m 59s | Avg: 12m 29s | Max: 13m 21s | Hits:  70%/1230  
      🟩 Clang15            Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s | Hits:  70%/614   
      🟩 Clang16            Pass: 100%/1   | Total: 15m 15s | Avg: 15m 15s | Max: 15m 15s | Hits:  70%/614   
      🟩 Clang17            Pass: 100%/1   | Total: 15m 21s | Avg: 15m 21s | Max: 15m 21s | Hits:  70%/614   
      🟩 Clang18            Pass: 100%/1   | Total: 15m 02s | Avg: 15m 02s | Max: 15m 02s | Hits:  70%/614   
      🟩 Clang19            Pass: 100%/4   | Total: 49m 06s | Avg: 12m 16s | Max: 15m 12s | Hits:  77%/2456  
      🟩 GCC10              Pass: 100%/2   | Total: 30m 31s | Avg: 15m 15s | Max: 15m 53s | Hits:  69%/1230  
      🟩 GCC11              Pass: 100%/1   | Total: 16m 05s | Avg: 16m 05s | Max: 16m 05s | Hits:  69%/614   
      🟩 GCC12              Pass: 100%/1   | Total: 17m 47s | Avg: 17m 47s | Max: 17m 47s | Hits:  69%/614   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 47m | Avg: 13m 23s | Max: 16m 58s | Hits:  77%/4912  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 23s | Avg: 11m 23s | Max: 11m 23s | Hits:  95%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 33m 06s | Avg: 11m 02s | Max: 11m 47s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 57m 33s | Avg: 28m 46s | Max: 29m 27s | Hits:  67%/1224  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 14m | Avg: 13m 29s | Max: 15m 21s | Hits:  73%/6142  
      🟩 GCC                Pass: 100%/12  | Total:  2h 51m | Avg: 14m 17s | Max: 17m 47s | Hits:  74%/7370  
      🟩 MSVC               Pass: 100%/4   | Total: 44m 29s | Avg: 11m 07s | Max: 11m 47s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 57m 33s | Avg: 28m 46s | Max: 29m 27s | Hits:  67%/1224  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 19m 51s | Avg:  9m 55s | Max: 12m 33s | Hits:  84%/1228  
      🟩 rtx2080            Pass: 100%/26  | Total:  6h 28m | Avg: 14m 56s | Max: 29m 27s | Hits:  74%/14750 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 22m | Avg: 15m 18s | Max: 29m 27s | Hits:  71%/14136 
      🟩 Test               Pass: 100%/3   | Total: 25m 54s | Avg:  8m 38s | Max: 10m 34s | Hits:  99%/1842  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 19m 51s | Avg:  9m 55s | Max: 12m 33s | Hits:  84%/1228  
      🟩 90;90a             Pass: 100%/2   | Total: 25m 28s | Avg: 12m 44s | Max: 14m 39s | Hits:  78%/925   
      🟩 100;120            Pass: 100%/2   | Total: 25m 25s | Avg: 12m 42s | Max: 14m 55s | Hits:  78%/925   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 54m 58s | Avg: 18m 19s | Max: 28m 06s | Hits:  69%/1840  
      🟩 20                 Pass: 100%/25  | Total:  5h 53m | Avg: 14m 08s | Max: 29m 27s | Hits:  75%/14138 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 14m 39s | Avg: 3m 39s | Max: 3m 57s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 14m 39s | Avg:  3m 39s | Max:  3m 57s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  3m 57s
      🟩 12.9               Pass: 100%/2   | Total:  7m 17s | Avg:  3m 38s | Max:  3m 55s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  3m 57s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 17s | Avg:  3m 38s | Max:  3m 55s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 14m 39s | Avg:  3m 39s | Max:  3m 57s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 57s | Avg:  3m 57s | Max:  3m 57s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 22s | Avg:  3m 22s | Max:  3m 22s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 25s | Avg:  3m 25s | Max:  3m 25s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 55s | Avg:  3m 55s | Max:  3m 55s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 57s
      🟩 GCC                Pass: 100%/2   | Total:  7m 20s | Avg:  3m 40s | Max:  3m 55s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 14m 39s | Avg:  3m 39s | Max:  3m 57s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 14m 39s | Avg:  3m 39s | Max:  3m 57s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 linux-arm64-cpu16
4 windows-amd64-cpu16
1 linux-amd64-gpu-h100-latest-1

Comment on lines 124 to 125
dim3 gridDim;
dim3 blockDim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these have a default ctor, and if not, should we add a = {}?

@andralex
Copy link
Contributor

Made a pass, good stuff @caugonnet. Thanks @bernhardmgruber for taking a look!

@caugonnet
Copy link
Contributor Author

/ok to test c3a4c99

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Aug 4, 2025

/ok to test c3a4c99

@caugonnet, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

…pe.cuh

Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
@caugonnet
Copy link
Contributor Author

/ok to test 9a9cde3

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Aug 4, 2025

/ok to test 9a9cde3

@caugonnet, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@caugonnet
Copy link
Contributor Author

/ok to test dec6b46

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2025

🟩 CI finished in 32m 30s: Pass: 100%/32 | Total: 7h 41m | Avg: 14m 24s | Max: 32m 25s | Hits: 72%/15390
  • 🟩 cudax: Pass: 100%/28 | Total: 7h 27m | Avg: 15m 59s | Max: 32m 25s | Hits: 72%/15390

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  6h 24m | Avg: 16m 01s | Max: 32m 25s | Hits:  72%/13018 
      🟩 arm64              Pass: 100%/4   | Total:  1h 02m | Avg: 15m 42s | Max: 16m 20s | Hits:  68%/2372  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 37m 52s | Avg: 12m 37s | Max: 14m 10s | Hits:  72%/1474  
      🟩 12.9               Pass: 100%/25  | Total:  6h 49m | Avg: 16m 23s | Max: 32m 25s | Hits:  72%/13916 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 37m 52s | Avg: 12m 37s | Max: 14m 10s | Hits:  72%/1474  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 49m | Avg: 16m 23s | Max: 32m 25s | Hits:  72%/13916 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  7h 27m | Avg: 15m 59s | Max: 32m 25s | Hits:  72%/15390 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 28m 44s | Avg: 14m 22s | Max: 16m 29s | Hits:  69%/1188  
      🟩 Clang15            Pass: 100%/1   | Total: 17m 26s | Avg: 17m 26s | Max: 17m 26s | Hits:  67%/593   
      🟩 Clang16            Pass: 100%/1   | Total: 17m 03s | Avg: 17m 03s | Max: 17m 03s | Hits:  61%/593   
      🟩 Clang17            Pass: 100%/1   | Total: 16m 01s | Avg: 16m 01s | Max: 16m 01s | Hits:  66%/593   
      🟩 Clang18            Pass: 100%/1   | Total: 17m 03s | Avg: 17m 03s | Max: 17m 03s | Hits:  65%/593   
      🟩 Clang19            Pass: 100%/4   | Total: 54m 48s | Avg: 13m 42s | Max: 15m 38s | Hits:  76%/2372  
      🟩 GCC10              Pass: 100%/2   | Total: 31m 56s | Avg: 15m 58s | Max: 17m 46s | Hits:  68%/1188  
      🟩 GCC11              Pass: 100%/1   | Total: 17m 31s | Avg: 17m 31s | Max: 17m 31s | Hits:  61%/593   
      🟩 GCC12              Pass: 100%/1   | Total: 19m 32s | Avg: 19m 32s | Max: 19m 32s | Hits:  68%/593   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 56m | Avg: 14m 34s | Max: 18m 36s | Hits:  75%/4744  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 27s | Avg: 11m 27s | Max: 11m 27s | Hits:  89%/288   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 38m 02s | Avg: 12m 40s | Max: 13m 22s | Hits:  91%/870   
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  1h 01m | Avg: 30m 42s | Max: 32m 25s | Hits:  61%/1182  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 31m | Avg: 15m 06s | Max: 17m 26s | Hits:  70%/5932  
      🟩 GCC                Pass: 100%/12  | Total:  3h 05m | Avg: 15m 27s | Max: 19m 32s | Hits:  72%/7118  
      🟩 MSVC               Pass: 100%/4   | Total: 49m 29s | Avg: 12m 22s | Max: 13m 22s | Hits:  90%/1158  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 01m | Avg: 30m 42s | Max: 32m 25s | Hits:  61%/1182  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 21m 15s | Avg: 10m 37s | Max: 12m 13s | Hits:  84%/1186  
      🟩 rtx2080            Pass: 100%/26  | Total:  7h 06m | Avg: 16m 23s | Max: 32m 25s | Hits:  71%/14204 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 56m | Avg: 16m 40s | Max: 32m 25s | Hits:  68%/13611 
      🟩 Test               Pass: 100%/3   | Total: 30m 34s | Avg: 10m 11s | Max: 12m 37s | Hits:  99%/1779  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 21m 15s | Avg: 10m 37s | Max: 12m 13s | Hits:  84%/1186  
      🟩 90;90a             Pass: 100%/2   | Total: 29m 09s | Avg: 14m 34s | Max: 15m 47s | Hits:  76%/883   
      🟩 100;120            Pass: 100%/2   | Total: 27m 14s | Avg: 13m 37s | Max: 15m 46s | Hits:  74%/883   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total:  1h 00m | Avg: 20m 02s | Max: 28m 59s | Hits:  65%/1777  
      🟩 20                 Pass: 100%/25  | Total:  6h 27m | Avg: 15m 29s | Max: 32m 25s | Hits:  73%/13613 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 13m 46s | Avg: 3m 26s | Max: 3m 31s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 31s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  6m 52s | Avg:  3m 26s | Max:  3m 27s
      🟩 12.9               Pass: 100%/2   | Total:  6m 54s | Avg:  3m 27s | Max:  3m 31s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  6m 52s | Avg:  3m 26s | Max:  3m 27s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  6m 54s | Avg:  3m 27s | Max:  3m 31s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 31s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 27s | Avg:  3m 27s | Max:  3m 27s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 31s | Avg:  3m 31s | Max:  3m 31s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 25s | Avg:  3m 25s | Max:  3m 25s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  6m 58s | Avg:  3m 29s | Max:  3m 31s
      🟩 GCC                Pass: 100%/2   | Total:  6m 48s | Avg:  3m 24s | Max:  3m 25s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 31s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 31s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 linux-arm64-cpu16
4 windows-amd64-cpu16
1 linux-amd64-gpu-h100-latest-1

@caugonnet caugonnet enabled auto-merge (squash) August 4, 2025 14:57
@caugonnet caugonnet merged commit 8e6c018 into NVIDIA:main Aug 4, 2025
45 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Aug 4, 2025
davebayer pushed a commit to davebayer/cccl that referenced this pull request Sep 23, 2025
…IA#5319)

* Allow CUfunction (driver API) in the cuda_kernel(_chain) API

* clang-format

* We have a std::tuple not a cuda::std::tuple (yet)

* If CUDASTF_CUDA_KERNEL_DEBUG is set, we display the number of registers used by kernels

* Support CUkernel in addition to CUfunction

* Add a test with CUfunction and CUkernel

* Check whether CUkernel is supported

* use _CCCL_ASSERT instead of assert to avoid an unused variable error

* cudaGetKernel was added in CUDA 12.1

* clang-format

* Extract the start and end phase of the ->* operator

* There is no need to store untyped_t as we now store the task with its type

* Implement the low level interface for cuda_kernel(_chain) with a way to avoid using the ->* operator

* - Add a test to ensure we can put no arguments in the cuda_kernel_desc constructor
- Implement a low level API to describe cuda_kernel_desc with an array of
  pointers rather than a variadic interface (and use it in a test)

* simpler code, and do not check for CUDA_VERSION >= 12

* Simpler code

Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>

* Update cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh

Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>

* clang-format

* Do not test for CUDA_VERSION >= 12

* Add missing const

* add missing template

* Use _CCCL_CTK_AT_LEAST

* replace std::visit by std::get_if in get_num_registers

* use ::std::get_if instead of ::std::visit

* Add some tests for get_num_registers which actually fails (thanks @davebayer for finding this)

* Fix the method to get the number of registers for CUkernel

* Update cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh

Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>

---------

Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants