fix: replace griddepcontrol inline PTX with CUDA runtime API by he-yufeng · Pull Request #2720 · flashinfer-ai/flashinfer

he-yufeng · 2026-03-07T20:02:14Z

Fixes #2558 - replaces asm volatile griddepcontrol with cudaGridDependencySynchronize/cudaTriggerProgrammaticLaunchCompletion across 19 files

Summary by CodeRabbit

Refactor
- Modernized synchronization mechanisms across GPU kernels to use standard CUDA runtime API calls instead of low-level assembly instructions, improving code maintainability and compatibility with newer CUDA toolchains while preserving all existing functionality.

Replace all asm volatile griddepcontrol.wait and launch_dependents with their CUDA runtime API equivalents: - griddepcontrol.wait -> cudaGridDependencySynchronize() - griddepcontrol.launch_dependents -> cudaTriggerProgrammaticLaunchCompletion() The inline PTX variants lacked a memory clobber, which is undefined behavior -- the compiler is free to reorder memory accesses across the barrier, potentially causing incorrect results in fused/PDL kernels. The CUDA runtime wrappers (available since CUDA 12.0, same toolkit that introduced the PTX instructions) include proper compiler barriers and are the recommended approach per NVIDIA. Fixes flashinfer-ai#2558

gemini-code-assist · 2026-03-07T20:02:37Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request modernizes the CUDA kernel implementations by migrating from direct PTX assembly instructions for grid dependency control to the official CUDA runtime API functions. This change improves code readability, maintainability, and ensures better compatibility with future CUDA toolkit versions, addressing a known issue related to the use of inline assembly for these operations.

Highlights

CUDA API Modernization: Replaced inline PTX assembly calls for griddepcontrol.wait and griddepcontrol.launch_dependents with their respective CUDA runtime API equivalents, cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion().
Broad Impact: Applied these replacements across 19 different CUDA kernel files, ensuring consistent use of the new API for grid dependency control throughout the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in buildMinLatencyActiveExpertMapsKernel, fusedBuildExpertMapsSortFirstTokenKernel, blockExpertPrefixSumKernel, globalExpertPrefixSumLargeKernel, globalExpertPrefixSumKernel, mergeExpertPrefixSumKernel, computeStridesTmaWarpSpecializedKernel, expandInputRowsKernel, finalizeMoeRoutingKernel, finalizeMoeRoutingNoFillingKernel, and doActivationKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in buildMinLatencyActiveExpertMapsKernel, fusedBuildExpertMapsSortFirstTokenKernel, blockExpertPrefixSumKernel, globalExpertPrefixSumLargeKernel, globalExpertPrefixSumKernel, mergeExpertPrefixSumKernel, computeStridesTmaWarpSpecializedKernel, expandInputRowsKernel, finalizeMoeRoutingKernel, finalizeMoeRoutingNoFillingKernel, and doActivationKernel.
csrc/fused_moe/noAuxTcKernels.cu
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in deepseek_v3_topk_kernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in deepseek_v3_topk_kernel.
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in launch_dependent_grids.
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in wait_on_dependent_grids.
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in quantize_with_block_size and quantize_with_block_size_tma.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in quantize_with_block_size and quantize_with_block_size_tma.
csrc/xqa/utils.cuh
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in preExit.
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in acqBulk.
include/flashinfer/activation.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in act_and_mul_kernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in act_and_mul_kernel.
include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in operator().
include/flashinfer/attention/blackwell/plan.cuh
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in plan_kernel.
include/flashinfer/attention/cascade.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in PersistentVariableLengthMergeStatesKernel and PersistentVariableLengthAttentionSumKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in PersistentVariableLengthMergeStatesKernel and PersistentVariableLengthAttentionSumKernel.
include/flashinfer/attention/decode.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in BatchDecodeWithPagedKVCacheDevice and BatchDecodeWithPagedKVCacheKernelMLA.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in BatchDecodeWithPagedKVCacheDevice and BatchDecodeWithPagedKVCacheKernelMLA.
include/flashinfer/attention/prefill.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in BatchPrefillWithRaggedKVCacheKernel and BatchPrefillWithPagedKVCacheDevice.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in BatchPrefillWithRaggedKVCacheKernel and BatchPrefillWithPagedKVCacheDevice.
include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in moereduce_allreduce_fusion_kernel_oneshot_lamport.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in moereduce_allreduce_fusion_kernel_oneshot_lamport.
include/flashinfer/gemm/dsv3_router_gemm.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in router_gemm_kernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in router_gemm_kernel.
include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in compute_sm100_cutlass_group_gemm_args.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in compute_sm100_cutlass_group_gemm_args.
include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in compute_sm100_cutlass_group_gemm_args.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in compute_sm100_cutlass_group_gemm_args.
include/flashinfer/norm.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in RMSNormKernel, RMSNormQuantKernel, QKRMSNormKernel, FusedAddRMSNormKernel, and FusedAddRMSNormQuantKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in RMSNormKernel, RMSNormQuantKernel, QKRMSNormKernel, FusedAddRMSNormKernel, and FusedAddRMSNormQuantKernel.
include/flashinfer/pos_enc.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in RopeQuantizeKernel and RopeQuantizeAppendPagedKVCacheKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in RopeQuantizeKernel and RopeQuantizeAppendPagedKVCacheKernel.
include/flashinfer/sampling.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in OnlineSoftmaxFusedKernel, OnlineSoftmaxMapKernel, and OnlineSoftmaxReduceKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in OnlineSoftmaxFusedKernel, OnlineSoftmaxMapKernel, and OnlineSoftmaxReduceKernel.
include/flashinfer/trtllm/fmha/lse.cuh
- Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in ComputeLSEFromMDKernel.
- Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in ComputeLSEFromMDKernel.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-07T20:02:45Z

📝 Walkthrough

Walkthrough

This pull request replaces inline CUDA assembly instructions for grid dependency control with corresponding CUDA runtime API calls across 19 files. The replacements follow a consistent pattern: griddepcontrol.wait; becomes cudaGridDependencySynchronize(); and griddepcontrol.launch_dependents; becomes cudaTriggerProgrammaticLaunchCompletion();. No public API signatures or control flow logic are modified.

Changes

Cohort / File(s)	Summary
Flashinfer Attention Kernels `include/flashinfer/attention/cascade.cuh`, `include/flashinfer/attention/decode.cuh`, `include/flashinfer/attention/prefill.cuh`, `include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp`, `include/flashinfer/attention/blackwell/plan.cuh`	Replaced inline assembly grid dependency instructions with CUDA runtime synchronization calls in multiple attention kernel implementations and Blackwell-specific optimizations.
Flashinfer GEMM Kernels `include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh`, `include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh`, `include/flashinfer/gemm/dsv3_router_gemm.cuh`	Replaced grid dependency assembly with CUDA runtime API calls in quantized GEMM and router GEMM kernels.
Flashinfer Utility Kernels `include/flashinfer/norm.cuh`, `include/flashinfer/pos_enc.cuh`, `include/flashinfer/activation.cuh`, `include/flashinfer/sampling.cuh`	Replaced inline grid dependency assembly with CUDA runtime equivalents in normalization, position encoding, activation, and sampling kernels.
Flashinfer Communication & FMHA `include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh`, `include/flashinfer/trtllm/fmha/lse.cuh`	Replaced grid dependency assembly with CUDA runtime calls in AllReduce fusion and LSE computation kernels.
TensorRT-LLM Core Infrastructure `csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h`	Updated central grid dependency control header wrapper functions to use CUDA runtime APIs instead of inline assembly directives.
TensorRT-LLM Quantization & Kernels `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`	Replaced grid dependency assembly with CUDA runtime calls in quantization kernels.
Fused MOE Kernels `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`, `csrc/fused_moe/noAuxTcKernels.cu`	Replaced inline assembly grid dependency instructions with CUDA runtime API calls across multiple Cutlass-based MOE kernel implementations.
XQA Utilities `csrc/xqa/utils.cuh`	Replaced grid dependency assembly with CUDA runtime synchronization for architecture >= 900 conditional paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~25 minutes

Possibly related PRs

Optimize quantization function in large problem size #2343 — Restructures and templatizes quantization kernel code in csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh, intersecting with the same file modified in this PR.

Suggested labels

run-ci

Suggested reviewers

djmmoss
bkryu
aleozlx
yzh119
jimmyzho
cyx-6
yongwww

Poem

🐰 Away with assembly, old and grim,
CUDA runtime calls now swim,
Grid dependencies, clean and bright,
No more dancing with PTX's might!
From chaos tamed to order true,
The kernels dance in morning dew. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is minimal and does not follow the provided template structure with required sections like Description, Related Issues, Pre-commit Checks, and Tests.	Expand the description to include all template sections: detailed explanation of changes, explicit issue link in the format suggested, checklist completion status, and test information.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately describes the main change: replacing inline PTX griddepcontrol with CUDA runtime API calls across the codebase.
Linked Issues check	✅ Passed	All code changes directly address issue `#2558` requirements: replacing unsafe inline asm griddepcontrol.wait/launch_dependents with CUDA runtime APIs cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion across 19 files.
Out of Scope Changes check	✅ Passed	All changes are scoped to replacing griddepcontrol inline assembly with CUDA runtime API calls; no unrelated modifications, refactoring, or feature additions are present in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

he-yufeng · 2026-03-07T20:02:56Z

Details

This replaces all 79 occurrences of inline PTX griddepcontrol assembly with proper CUDA runtime API wrappers across 19 files:

asm volatile("griddepcontrol.wait;") -> cudaGridDependencySynchronize()
asm volatile("griddepcontrol.launch_dependents;") -> cudaTriggerProgrammaticLaunchCompletion()

Why: The inline PTX lacked a "memory" clobber, which is UB. The compiler can freely reorder global memory accesses across the barrier, potentially breaking PDL synchronization. The CUDA runtime wrappers include proper compiler barriers and have been available since CUDA 12.0.

Scope: Covers all C/C++ files on current main. The earlier claude/issue-2558-20260216-0413 branch only touched 14 files and is now 71 commits behind, missing new files in csrc/fused_moe/cutlass_backend/ and csrc/xqa/.

Not covered:

CUTLASS wrapper callsites in tgv_gemm.cuh (uses cutlass::arch::wait_on_dependent_grids(), correctness depends on compile flags addressed in fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops #2716)
Python DSL files using MLIR llvm.inline_asm (different mechanism, may need separate fix)

gemini-code-assist

Code Review

This pull request is a large-scale refactoring that replaces deprecated griddepcontrol inline PTX assembly instructions with their modern CUDA runtime API equivalents: cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion. This is a valuable improvement for code quality, enhancing readability, maintainability, and ensuring forward compatibility with future CUDA versions. I have reviewed the changes across all 19 files and found them to be correct and consistently applied. No issues were found.

_{Note: Security Review did not run due to the size of the PR.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh (1)
1074-1079: ⚠️ Potential issue | 🟠 Major

Remove duplicate synchronization calls in moereduce_allreduce_fusion_kernel_oneshot_lamport kernel.

The kernel has redundant calls that execute sequentially:

cudaGridDependencySynchronize() at lines 937 and 969

cudaTriggerProgrammaticLaunchCompletion() at lines 1074 and 1078

Remove the outer guards at lines 936–938 and 1077–1079, keeping only the calls within the main SM90 block (lines 969 and 1074).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh` around lines 1074 -
1079, In the moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are
duplicate sync calls; remove the outer conditional guards that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so
only the calls inside the SM90-specific block (the guard using
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the
cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion
invocations that are inside the SM90 block and delete the redundant ones outside
it.

🧹 Nitpick comments (1)

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (1)
156-158: These PDL runtime API calls are already guarded for SM90+ architectures, making them safe for compilation.

The calls at lines 156-158 and 249-251 are protected by #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) guards. Since cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() were introduced in CUDA Toolkit 12.0 and SM90 (Hopper) also requires CUDA 12.0+, the implicit guarantee holds.

However, for clarity and explicitness, consider adding an explicit CUDART_VERSION >= 12000 guard alongside the existing SM90 check to make the toolkit version requirement self-documenting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines
156 - 158, The preprocessor guards around the PDL runtime calls
(cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely
only on SM90 checks; update both conditionals that currently read like "#if
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA
runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >=
12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+
when calling cudaGridDependencySynchronize and
cudaTriggerProgrammaticLaunchCompletion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/xqa/utils.cuh`:
- Around line 769-779: The runtime calls in preExit() and acqBulk() are only
guarded by __CUDA_ARCH__ but must also be gated by the CUDA compiler version;
wrap the cudaTriggerProgrammaticLaunchCompletion() call in preExit and
cudaGridDependencySynchronize() call in acqBulk with an additional compile-time
check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both (__CUDA_ARCH__ >= 900)
and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only used when the toolchain
supports CUDA 12+; update the preExit and acqBulk macros accordingly to match
the pattern used elsewhere (e.g., include/flashinfer/*.cuh).

---

Outside diff comments:
In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh`:
- Around line 1074-1079: In the
moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are duplicate
sync calls; remove the outer conditional guards that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so
only the calls inside the SM90-specific block (the guard using
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the
cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion
invocations that are inside the SM90 block and delete the redundant ones outside
it.

---

Nitpick comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 156-158: The preprocessor guards around the PDL runtime calls
(cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely
only on SM90 checks; update both conditionals that currently read like "#if
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA
runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >=
12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+
when calling cudaGridDependencySynchronize and
cudaTriggerProgrammaticLaunchCompletion.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0fc62ef-cc14-47bc-a8e6-14621f9bc629

📥 Commits

Reviewing files that changed from the base of the PR and between 65d6e4a and c771490.

📒 Files selected for processing (19)

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
csrc/fused_moe/noAuxTcKernels.cu
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/xqa/utils.cuh
include/flashinfer/activation.cuh
include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp
include/flashinfer/attention/blackwell/plan.cuh
include/flashinfer/attention/cascade.cuh
include/flashinfer/attention/decode.cuh
include/flashinfer/attention/prefill.cuh
include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh
include/flashinfer/gemm/dsv3_router_gemm.cuh
include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh
include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh
include/flashinfer/norm.cuh
include/flashinfer/pos_enc.cuh
include/flashinfer/sampling.cuh
include/flashinfer/trtllm/fmha/lse.cuh

coderabbitai · 2026-03-07T20:10:57Z

csrc/xqa/utils.cuh

 __device__ inline void preExit() {
 #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
-  asm volatile("griddepcontrol.launch_dependents;\n");
+  cudaTriggerProgrammaticLaunchCompletion();
 #endif
 }

 __device__ inline void acqBulk() {
 #if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
-  asm volatile("griddepcontrol.wait;\n");
+  cudaGridDependencySynchronize();
 #endif
 }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Programmatic launch API usage and guards ==" rg -n -C2 'cudaGridDependencySynchronize|cudaTriggerProgrammaticLaunchCompletion|__CUDACC_VER_MAJOR__' csrc include echo echo "== Repository CUDA-version references ==" fd '^(README.*|CMakeLists\.txt|.*\.cmake|Dockerfile.*)$' . -X rg -n 'CUDA|11\.8|12\.' {} 2>/dev/null || true

Repository: flashinfer-ai/flashinfer

Length of output: 50380

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Repository CUDA version constraints ==" find . -maxdepth 2 $ -name "CMakeLists.txt" -o -name "README.md" -o -name "setup.py" -o -name "*.cmake" $ -type f | head -20 | while read f; do echo "--- $f ---" grep -i "cuda" "$f" | grep -E "(11|12|version|minimum)" | head -10 || true done echo echo "== CUDA version checks in recently modified kernels ==" rg -n "__CUDACC_VER" csrc/fused_moe/ csrc/xqa/ include/flashinfer/ -A1 -B1 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 7420

Restore the CUDA 12 guard around these runtime calls.

Lines 771 and 777 switch from inline PTX to CUDA runtime APIs (cudaTriggerProgrammaticLaunchCompletion and cudaGridDependencySynchronize), but only guard on __CUDA_ARCH__. The repository officially supports CUDA 12.6+ and every other use of these APIs in the codebase gates them with __CUDACC_VER_MAJOR__ >= 12 (see include/flashinfer/sampling.cuh, include/flashinfer/norm.cuh, include/flashinfer/trtllm/fmha/lse.cuh). Without the version guard, compilation can fail on older toolchains.

Suggested patch

__device__ inline void preExit() { -#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) +#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) cudaTriggerProgrammaticLaunchCompletion(); `#endif` } __device__ inline void acqBulk() { -#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) +#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) cudaGridDependencySynchronize(); `#endif` }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

__device__ inline void preExit() {

#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)

asm volatile("griddepcontrol.launch_dependents;\n");

cudaTriggerProgrammaticLaunchCompletion();

#endif

}

__device__ inline void acqBulk() {

#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)

asm volatile("griddepcontrol.wait;\n");

cudaGridDependencySynchronize();

#endif

}

__device__ inline void preExit() {

`#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))

cudaTriggerProgrammaticLaunchCompletion();

`#endif`

}

__device__ inline void acqBulk() {

`#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))

cudaGridDependencySynchronize();

`#endif`

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/xqa/utils.cuh` around lines 769 - 779, The runtime calls in preExit() and acqBulk() are only guarded by __CUDA_ARCH__ but must also be gated by the CUDA compiler version; wrap the cudaTriggerProgrammaticLaunchCompletion() call in preExit and cudaGridDependencySynchronize() call in acqBulk with an additional compile-time check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both (__CUDA_ARCH__ >= 900) and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only used when the toolchain supports CUDA 12+; update the preExit and acqBulk macros accordingly to match the pattern used elsewhere (e.g., include/flashinfer/*.cuh).

he-yufeng · 2026-03-23T13:33:34Z

Ping @yzh119 — any thoughts on this? It's been open a couple weeks. The inline PTX UB was reported by an NVIDIA engineer in #2558 so figured it's worth addressing.

he-yufeng requested review from IwakuraRein, bkryu, cyx-6, djmmoss, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, nvmbreughe, wenscarl and yzh119 as code owners March 7, 2026 20:02

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace griddepcontrol inline PTX with CUDA runtime API#2720

fix: replace griddepcontrol inline PTX with CUDA runtime API#2720
he-yufeng wants to merge 1 commit intoflashinfer-ai:mainfrom
he-yufeng:fix/griddepcontrol-use-cuda-api

he-yufeng commented Mar 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Uh oh!

coderabbitai bot commented Mar 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

he-yufeng commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 7, 2026

Uh oh!

he-yufeng commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

he-yufeng commented Mar 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

he-yufeng commented Mar 7, 2026

Details

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

he-yufeng commented Mar 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading