fix: replace griddepcontrol inline PTX with CUDA runtime API#2720
fix: replace griddepcontrol inline PTX with CUDA runtime API#2720he-yufeng wants to merge 1 commit intoflashinfer-ai:mainfrom
Conversation
Replace all asm volatile griddepcontrol.wait and launch_dependents with their CUDA runtime API equivalents: - griddepcontrol.wait -> cudaGridDependencySynchronize() - griddepcontrol.launch_dependents -> cudaTriggerProgrammaticLaunchCompletion() The inline PTX variants lacked a memory clobber, which is undefined behavior -- the compiler is free to reorder memory accesses across the barrier, potentially causing incorrect results in fused/PDL kernels. The CUDA runtime wrappers (available since CUDA 12.0, same toolkit that introduced the PTX instructions) include proper compiler barriers and are the recommended approach per NVIDIA. Fixes flashinfer-ai#2558
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request modernizes the CUDA kernel implementations by migrating from direct PTX assembly instructions for grid dependency control to the official CUDA runtime API functions. This change improves code readability, maintainability, and ensures better compatibility with future CUDA toolkit versions, addressing a known issue related to the use of inline assembly for these operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughThis pull request replaces inline CUDA assembly instructions for grid dependency control with corresponding CUDA runtime API calls across 19 files. The replacements follow a consistent pattern: Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
DetailsThis replaces all 79 occurrences of inline PTX
Why: The inline PTX lacked a Scope: Covers all C/C++ files on current main. The earlier Not covered:
|
There was a problem hiding this comment.
Code Review
This pull request is a large-scale refactoring that replaces deprecated griddepcontrol inline PTX assembly instructions with their modern CUDA runtime API equivalents: cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion. This is a valuable improvement for code quality, enhancing readability, maintainability, and ensuring forward compatibility with future CUDA versions. I have reviewed the changes across all 19 files and found them to be correct and consistently applied. No issues were found.
Note: Security Review did not run due to the size of the PR.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh (1)
1074-1079:⚠️ Potential issue | 🟠 MajorRemove duplicate synchronization calls in
moereduce_allreduce_fusion_kernel_oneshot_lamportkernel.The kernel has redundant calls that execute sequentially:
cudaGridDependencySynchronize()at lines 937 and 969cudaTriggerProgrammaticLaunchCompletion()at lines 1074 and 1078Remove the outer guards at lines 936–938 and 1077–1079, keeping only the calls within the main SM90 block (lines 969 and 1074).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh` around lines 1074 - 1079, In the moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are duplicate sync calls; remove the outer conditional guards that call cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so only the calls inside the SM90-specific block (the guard using defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion invocations that are inside the SM90 block and delete the redundant ones outside it.
🧹 Nitpick comments (1)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (1)
156-158: These PDL runtime API calls are already guarded for SM90+ architectures, making them safe for compilation.The calls at lines 156-158 and 249-251 are protected by
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))guards. SincecudaGridDependencySynchronize()andcudaTriggerProgrammaticLaunchCompletion()were introduced in CUDA Toolkit 12.0 and SM90 (Hopper) also requires CUDA 12.0+, the implicit guarantee holds.However, for clarity and explicitness, consider adding an explicit
CUDART_VERSION >= 12000guard alongside the existing SM90 check to make the toolkit version requirement self-documenting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines 156 - 158, The preprocessor guards around the PDL runtime calls (cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely only on SM90 checks; update both conditionals that currently read like "#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >= 12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+ when calling cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/xqa/utils.cuh`:
- Around line 769-779: The runtime calls in preExit() and acqBulk() are only
guarded by __CUDA_ARCH__ but must also be gated by the CUDA compiler version;
wrap the cudaTriggerProgrammaticLaunchCompletion() call in preExit and
cudaGridDependencySynchronize() call in acqBulk with an additional compile-time
check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both (__CUDA_ARCH__ >= 900)
and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only used when the toolchain
supports CUDA 12+; update the preExit and acqBulk macros accordingly to match
the pattern used elsewhere (e.g., include/flashinfer/*.cuh).
---
Outside diff comments:
In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh`:
- Around line 1074-1079: In the
moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are duplicate
sync calls; remove the outer conditional guards that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so
only the calls inside the SM90-specific block (the guard using
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the
cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion
invocations that are inside the SM90 block and delete the redundant ones outside
it.
---
Nitpick comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 156-158: The preprocessor guards around the PDL runtime calls
(cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely
only on SM90 checks; update both conditionals that currently read like "#if
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA
runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >=
12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+
when calling cudaGridDependencySynchronize and
cudaTriggerProgrammaticLaunchCompletion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0fc62ef-cc14-47bc-a8e6-14621f9bc629
📒 Files selected for processing (19)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuhcsrc/fused_moe/noAuxTcKernels.cucsrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.hcsrc/nv_internal/tensorrt_llm/kernels/quantization.cuhcsrc/xqa/utils.cuhinclude/flashinfer/activation.cuhinclude/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hppinclude/flashinfer/attention/blackwell/plan.cuhinclude/flashinfer/attention/cascade.cuhinclude/flashinfer/attention/decode.cuhinclude/flashinfer/attention/prefill.cuhinclude/flashinfer/comm/trtllm_moe_allreduce_fusion.cuhinclude/flashinfer/gemm/dsv3_router_gemm.cuhinclude/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuhinclude/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuhinclude/flashinfer/norm.cuhinclude/flashinfer/pos_enc.cuhinclude/flashinfer/sampling.cuhinclude/flashinfer/trtllm/fmha/lse.cuh
| __device__ inline void preExit() { | ||
| #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) | ||
| asm volatile("griddepcontrol.launch_dependents;\n"); | ||
| cudaTriggerProgrammaticLaunchCompletion(); | ||
| #endif | ||
| } | ||
|
|
||
| __device__ inline void acqBulk() { | ||
| #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) | ||
| asm volatile("griddepcontrol.wait;\n"); | ||
| cudaGridDependencySynchronize(); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Programmatic launch API usage and guards =="
rg -n -C2 'cudaGridDependencySynchronize|cudaTriggerProgrammaticLaunchCompletion|__CUDACC_VER_MAJOR__' csrc include
echo
echo "== Repository CUDA-version references =="
fd '^(README.*|CMakeLists\.txt|.*\.cmake|Dockerfile.*)$' . -X rg -n 'CUDA|11\.8|12\.' {} 2>/dev/null || trueRepository: flashinfer-ai/flashinfer
Length of output: 50380
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Repository CUDA version constraints =="
find . -maxdepth 2 \( -name "CMakeLists.txt" -o -name "README.md" -o -name "setup.py" -o -name "*.cmake" \) -type f | head -20 | while read f; do
echo "--- $f ---"
grep -i "cuda" "$f" | grep -E "(11|12|version|minimum)" | head -10 || true
done
echo
echo "== CUDA version checks in recently modified kernels =="
rg -n "__CUDACC_VER" csrc/fused_moe/ csrc/xqa/ include/flashinfer/ -A1 -B1 | head -80Repository: flashinfer-ai/flashinfer
Length of output: 7420
Restore the CUDA 12 guard around these runtime calls.
Lines 771 and 777 switch from inline PTX to CUDA runtime APIs (cudaTriggerProgrammaticLaunchCompletion and cudaGridDependencySynchronize), but only guard on __CUDA_ARCH__. The repository officially supports CUDA 12.6+ and every other use of these APIs in the codebase gates them with __CUDACC_VER_MAJOR__ >= 12 (see include/flashinfer/sampling.cuh, include/flashinfer/norm.cuh, include/flashinfer/trtllm/fmha/lse.cuh). Without the version guard, compilation can fail on older toolchains.
Suggested patch
__device__ inline void preExit() {
-#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
+#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
cudaTriggerProgrammaticLaunchCompletion();
`#endif`
}
__device__ inline void acqBulk() {
-#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
+#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
cudaGridDependencySynchronize();
`#endif`
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __device__ inline void preExit() { | |
| #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) | |
| asm volatile("griddepcontrol.launch_dependents;\n"); | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| #endif | |
| } | |
| __device__ inline void acqBulk() { | |
| #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) | |
| asm volatile("griddepcontrol.wait;\n"); | |
| cudaGridDependencySynchronize(); | |
| #endif | |
| } | |
| __device__ inline void preExit() { | |
| `#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| `#endif` | |
| } | |
| __device__ inline void acqBulk() { | |
| `#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) | |
| cudaGridDependencySynchronize(); | |
| `#endif` | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/xqa/utils.cuh` around lines 769 - 779, The runtime calls in preExit()
and acqBulk() are only guarded by __CUDA_ARCH__ but must also be gated by the
CUDA compiler version; wrap the cudaTriggerProgrammaticLaunchCompletion() call
in preExit and cudaGridDependencySynchronize() call in acqBulk with an
additional compile-time check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both
(__CUDA_ARCH__ >= 900) and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only
used when the toolchain supports CUDA 12+; update the preExit and acqBulk macros
accordingly to match the pattern used elsewhere (e.g.,
include/flashinfer/*.cuh).
Fixes #2558 - replaces asm volatile griddepcontrol with cudaGridDependencySynchronize/cudaTriggerProgrammaticLaunchCompletion across 19 files
Summary by CodeRabbit