Skip to content

fix: replace griddepcontrol inline PTX with CUDA runtime API#2720

Open
he-yufeng wants to merge 1 commit intoflashinfer-ai:mainfrom
he-yufeng:fix/griddepcontrol-use-cuda-api
Open

fix: replace griddepcontrol inline PTX with CUDA runtime API#2720
he-yufeng wants to merge 1 commit intoflashinfer-ai:mainfrom
he-yufeng:fix/griddepcontrol-use-cuda-api

Conversation

@he-yufeng
Copy link
Copy Markdown

@he-yufeng he-yufeng commented Mar 7, 2026

Fixes #2558 - replaces asm volatile griddepcontrol with cudaGridDependencySynchronize/cudaTriggerProgrammaticLaunchCompletion across 19 files

Summary by CodeRabbit

  • Refactor
    • Modernized synchronization mechanisms across GPU kernels to use standard CUDA runtime API calls instead of low-level assembly instructions, improving code maintainability and compatibility with newer CUDA toolchains while preserving all existing functionality.

Replace all asm volatile griddepcontrol.wait and launch_dependents
with their CUDA runtime API equivalents:

- griddepcontrol.wait -> cudaGridDependencySynchronize()
- griddepcontrol.launch_dependents -> cudaTriggerProgrammaticLaunchCompletion()

The inline PTX variants lacked a memory clobber, which is undefined
behavior -- the compiler is free to reorder memory accesses across the
barrier, potentially causing incorrect results in fused/PDL kernels.

The CUDA runtime wrappers (available since CUDA 12.0, same toolkit that
introduced the PTX instructions) include proper compiler barriers and
are the recommended approach per NVIDIA.

Fixes flashinfer-ai#2558
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request modernizes the CUDA kernel implementations by migrating from direct PTX assembly instructions for grid dependency control to the official CUDA runtime API functions. This change improves code readability, maintainability, and ensures better compatibility with future CUDA toolkit versions, addressing a known issue related to the use of inline assembly for these operations.

Highlights

  • CUDA API Modernization: Replaced inline PTX assembly calls for griddepcontrol.wait and griddepcontrol.launch_dependents with their respective CUDA runtime API equivalents, cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion().
  • Broad Impact: Applied these replacements across 19 different CUDA kernel files, ensuring consistent use of the new API for grid dependency control throughout the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in buildMinLatencyActiveExpertMapsKernel, fusedBuildExpertMapsSortFirstTokenKernel, blockExpertPrefixSumKernel, globalExpertPrefixSumLargeKernel, globalExpertPrefixSumKernel, mergeExpertPrefixSumKernel, computeStridesTmaWarpSpecializedKernel, expandInputRowsKernel, finalizeMoeRoutingKernel, finalizeMoeRoutingNoFillingKernel, and doActivationKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in buildMinLatencyActiveExpertMapsKernel, fusedBuildExpertMapsSortFirstTokenKernel, blockExpertPrefixSumKernel, globalExpertPrefixSumLargeKernel, globalExpertPrefixSumKernel, mergeExpertPrefixSumKernel, computeStridesTmaWarpSpecializedKernel, expandInputRowsKernel, finalizeMoeRoutingKernel, finalizeMoeRoutingNoFillingKernel, and doActivationKernel.
  • csrc/fused_moe/noAuxTcKernels.cu
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in deepseek_v3_topk_kernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in deepseek_v3_topk_kernel.
  • csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in launch_dependent_grids.
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in wait_on_dependent_grids.
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in quantize_with_block_size and quantize_with_block_size_tma.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in quantize_with_block_size and quantize_with_block_size_tma.
  • csrc/xqa/utils.cuh
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in preExit.
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in acqBulk.
  • include/flashinfer/activation.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in act_and_mul_kernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in act_and_mul_kernel.
  • include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in operator().
  • include/flashinfer/attention/blackwell/plan.cuh
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in plan_kernel.
  • include/flashinfer/attention/cascade.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in PersistentVariableLengthMergeStatesKernel and PersistentVariableLengthAttentionSumKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in PersistentVariableLengthMergeStatesKernel and PersistentVariableLengthAttentionSumKernel.
  • include/flashinfer/attention/decode.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in BatchDecodeWithPagedKVCacheDevice and BatchDecodeWithPagedKVCacheKernelMLA.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in BatchDecodeWithPagedKVCacheDevice and BatchDecodeWithPagedKVCacheKernelMLA.
  • include/flashinfer/attention/prefill.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in BatchPrefillWithRaggedKVCacheKernel and BatchPrefillWithPagedKVCacheDevice.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in BatchPrefillWithRaggedKVCacheKernel and BatchPrefillWithPagedKVCacheDevice.
  • include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in moereduce_allreduce_fusion_kernel_oneshot_lamport.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in moereduce_allreduce_fusion_kernel_oneshot_lamport.
  • include/flashinfer/gemm/dsv3_router_gemm.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in router_gemm_kernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in router_gemm_kernel.
  • include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in compute_sm100_cutlass_group_gemm_args.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in compute_sm100_cutlass_group_gemm_args.
  • include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in compute_sm100_cutlass_group_gemm_args.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in compute_sm100_cutlass_group_gemm_args.
  • include/flashinfer/norm.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in RMSNormKernel, RMSNormQuantKernel, QKRMSNormKernel, FusedAddRMSNormKernel, and FusedAddRMSNormQuantKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in RMSNormKernel, RMSNormQuantKernel, QKRMSNormKernel, FusedAddRMSNormKernel, and FusedAddRMSNormQuantKernel.
  • include/flashinfer/pos_enc.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in RopeQuantizeKernel and RopeQuantizeAppendPagedKVCacheKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in RopeQuantizeKernel and RopeQuantizeAppendPagedKVCacheKernel.
  • include/flashinfer/sampling.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in OnlineSoftmaxFusedKernel, OnlineSoftmaxMapKernel, and OnlineSoftmaxReduceKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in OnlineSoftmaxFusedKernel, OnlineSoftmaxMapKernel, and OnlineSoftmaxReduceKernel.
  • include/flashinfer/trtllm/fmha/lse.cuh
    • Replaced griddepcontrol.wait with cudaGridDependencySynchronize() in ComputeLSEFromMDKernel.
    • Replaced griddepcontrol.launch_dependents with cudaTriggerProgrammaticLaunchCompletion() in ComputeLSEFromMDKernel.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

This pull request replaces inline CUDA assembly instructions for grid dependency control with corresponding CUDA runtime API calls across 19 files. The replacements follow a consistent pattern: griddepcontrol.wait; becomes cudaGridDependencySynchronize(); and griddepcontrol.launch_dependents; becomes cudaTriggerProgrammaticLaunchCompletion();. No public API signatures or control flow logic are modified.

Changes

Cohort / File(s) Summary
Flashinfer Attention Kernels
include/flashinfer/attention/cascade.cuh, include/flashinfer/attention/decode.cuh, include/flashinfer/attention/prefill.cuh, include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp, include/flashinfer/attention/blackwell/plan.cuh
Replaced inline assembly grid dependency instructions with CUDA runtime synchronization calls in multiple attention kernel implementations and Blackwell-specific optimizations.
Flashinfer GEMM Kernels
include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh, include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh, include/flashinfer/gemm/dsv3_router_gemm.cuh
Replaced grid dependency assembly with CUDA runtime API calls in quantized GEMM and router GEMM kernels.
Flashinfer Utility Kernels
include/flashinfer/norm.cuh, include/flashinfer/pos_enc.cuh, include/flashinfer/activation.cuh, include/flashinfer/sampling.cuh
Replaced inline grid dependency assembly with CUDA runtime equivalents in normalization, position encoding, activation, and sampling kernels.
Flashinfer Communication & FMHA
include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh, include/flashinfer/trtllm/fmha/lse.cuh
Replaced grid dependency assembly with CUDA runtime calls in AllReduce fusion and LSE computation kernels.
TensorRT-LLM Core Infrastructure
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
Updated central grid dependency control header wrapper functions to use CUDA runtime APIs instead of inline assembly directives.
TensorRT-LLM Quantization & Kernels
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Replaced grid dependency assembly with CUDA runtime calls in quantization kernels.
Fused MOE Kernels
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, csrc/fused_moe/noAuxTcKernels.cu
Replaced inline assembly grid dependency instructions with CUDA runtime API calls across multiple Cutlass-based MOE kernel implementations.
XQA Utilities
csrc/xqa/utils.cuh
Replaced grid dependency assembly with CUDA runtime synchronization for architecture >= 900 conditional paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • djmmoss
  • bkryu
  • aleozlx
  • yzh119
  • jimmyzho
  • cyx-6
  • yongwww

Poem

🐰 Away with assembly, old and grim,
CUDA runtime calls now swim,
Grid dependencies, clean and bright,
No more dancing with PTX's might!
From chaos tamed to order true,
The kernels dance in morning dew.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and does not follow the provided template structure with required sections like Description, Related Issues, Pre-commit Checks, and Tests. Expand the description to include all template sections: detailed explanation of changes, explicit issue link in the format suggested, checklist completion status, and test information.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately describes the main change: replacing inline PTX griddepcontrol with CUDA runtime API calls across the codebase.
Linked Issues check ✅ Passed All code changes directly address issue #2558 requirements: replacing unsafe inline asm griddepcontrol.wait/launch_dependents with CUDA runtime APIs cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion across 19 files.
Out of Scope Changes check ✅ Passed All changes are scoped to replacing griddepcontrol inline assembly with CUDA runtime API calls; no unrelated modifications, refactoring, or feature additions are present in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@he-yufeng
Copy link
Copy Markdown
Author

Details

This replaces all 79 occurrences of inline PTX griddepcontrol assembly with proper CUDA runtime API wrappers across 19 files:

  • asm volatile("griddepcontrol.wait;") -> cudaGridDependencySynchronize()
  • asm volatile("griddepcontrol.launch_dependents;") -> cudaTriggerProgrammaticLaunchCompletion()

Why: The inline PTX lacked a "memory" clobber, which is UB. The compiler can freely reorder global memory accesses across the barrier, potentially breaking PDL synchronization. The CUDA runtime wrappers include proper compiler barriers and have been available since CUDA 12.0.

Scope: Covers all C/C++ files on current main. The earlier claude/issue-2558-20260216-0413 branch only touched 14 files and is now 71 commits behind, missing new files in csrc/fused_moe/cutlass_backend/ and csrc/xqa/.

Not covered:

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large-scale refactoring that replaces deprecated griddepcontrol inline PTX assembly instructions with their modern CUDA runtime API equivalents: cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion. This is a valuable improvement for code quality, enhancing readability, maintainability, and ensuring forward compatibility with future CUDA versions. I have reviewed the changes across all 19 files and found them to be correct and consistently applied. No issues were found.

Note: Security Review did not run due to the size of the PR.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh (1)

1074-1079: ⚠️ Potential issue | 🟠 Major

Remove duplicate synchronization calls in moereduce_allreduce_fusion_kernel_oneshot_lamport kernel.

The kernel has redundant calls that execute sequentially:

  • cudaGridDependencySynchronize() at lines 937 and 969
  • cudaTriggerProgrammaticLaunchCompletion() at lines 1074 and 1078

Remove the outer guards at lines 936–938 and 1077–1079, keeping only the calls within the main SM90 block (lines 969 and 1074).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh` around lines 1074 -
1079, In the moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are
duplicate sync calls; remove the outer conditional guards that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so
only the calls inside the SM90-specific block (the guard using
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the
cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion
invocations that are inside the SM90 block and delete the redundant ones outside
it.
🧹 Nitpick comments (1)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (1)

156-158: These PDL runtime API calls are already guarded for SM90+ architectures, making them safe for compilation.

The calls at lines 156-158 and 249-251 are protected by #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) guards. Since cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() were introduced in CUDA Toolkit 12.0 and SM90 (Hopper) also requires CUDA 12.0+, the implicit guarantee holds.

However, for clarity and explicitness, consider adding an explicit CUDART_VERSION >= 12000 guard alongside the existing SM90 check to make the toolkit version requirement self-documenting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh` around lines
156 - 158, The preprocessor guards around the PDL runtime calls
(cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely
only on SM90 checks; update both conditionals that currently read like "#if
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA
runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >=
12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+
when calling cudaGridDependencySynchronize and
cudaTriggerProgrammaticLaunchCompletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/xqa/utils.cuh`:
- Around line 769-779: The runtime calls in preExit() and acqBulk() are only
guarded by __CUDA_ARCH__ but must also be gated by the CUDA compiler version;
wrap the cudaTriggerProgrammaticLaunchCompletion() call in preExit and
cudaGridDependencySynchronize() call in acqBulk with an additional compile-time
check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both (__CUDA_ARCH__ >= 900)
and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only used when the toolchain
supports CUDA 12+; update the preExit and acqBulk macros accordingly to match
the pattern used elsewhere (e.g., include/flashinfer/*.cuh).

---

Outside diff comments:
In `@include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh`:
- Around line 1074-1079: In the
moereduce_allreduce_fusion_kernel_oneshot_lamport kernel there are duplicate
sync calls; remove the outer conditional guards that call
cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() so
only the calls inside the SM90-specific block (the guard using
defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) remain; ensure you leave the
cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion
invocations that are inside the SM90 block and delete the redundant ones outside
it.

---

Nitpick comments:
In `@csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`:
- Around line 156-158: The preprocessor guards around the PDL runtime calls
(cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion) rely
only on SM90 checks; update both conditionals that currently read like "#if
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))" to also require the CUDA
runtime version by adding a "&& (defined(CUDART_VERSION) && (CUDART_VERSION >=
12000))" clause so the code explicitly documents and enforces CUDA Toolkit 12.0+
when calling cudaGridDependencySynchronize and
cudaTriggerProgrammaticLaunchCompletion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0fc62ef-cc14-47bc-a8e6-14621f9bc629

📥 Commits

Reviewing files that changed from the base of the PR and between 65d6e4a and c771490.

📒 Files selected for processing (19)
  • csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
  • csrc/fused_moe/noAuxTcKernels.cu
  • csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/xqa/utils.cuh
  • include/flashinfer/activation.cuh
  • include/flashinfer/attention/blackwell/kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp
  • include/flashinfer/attention/blackwell/plan.cuh
  • include/flashinfer/attention/cascade.cuh
  • include/flashinfer/attention/decode.cuh
  • include/flashinfer/attention/prefill.cuh
  • include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh
  • include/flashinfer/gemm/dsv3_router_gemm.cuh
  • include/flashinfer/gemm/group_gemm_fp8_groupwise_sm100.cuh
  • include/flashinfer/gemm/group_gemm_mxfp4_groupwise_sm100.cuh
  • include/flashinfer/norm.cuh
  • include/flashinfer/pos_enc.cuh
  • include/flashinfer/sampling.cuh
  • include/flashinfer/trtllm/fmha/lse.cuh

Comment on lines 769 to 779
__device__ inline void preExit() {
#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
asm volatile("griddepcontrol.launch_dependents;\n");
cudaTriggerProgrammaticLaunchCompletion();
#endif
}

__device__ inline void acqBulk() {
#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
asm volatile("griddepcontrol.wait;\n");
cudaGridDependencySynchronize();
#endif
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Programmatic launch API usage and guards =="
rg -n -C2 'cudaGridDependencySynchronize|cudaTriggerProgrammaticLaunchCompletion|__CUDACC_VER_MAJOR__' csrc include

echo
echo "== Repository CUDA-version references =="
fd '^(README.*|CMakeLists\.txt|.*\.cmake|Dockerfile.*)$' . -X rg -n 'CUDA|11\.8|12\.' {} 2>/dev/null || true

Repository: flashinfer-ai/flashinfer

Length of output: 50380


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Repository CUDA version constraints =="
find . -maxdepth 2 \( -name "CMakeLists.txt" -o -name "README.md" -o -name "setup.py" -o -name "*.cmake" \) -type f | head -20 | while read f; do
  echo "--- $f ---"
  grep -i "cuda" "$f" | grep -E "(11|12|version|minimum)" | head -10 || true
done

echo
echo "== CUDA version checks in recently modified kernels =="
rg -n "__CUDACC_VER" csrc/fused_moe/ csrc/xqa/ include/flashinfer/ -A1 -B1 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 7420


Restore the CUDA 12 guard around these runtime calls.

Lines 771 and 777 switch from inline PTX to CUDA runtime APIs (cudaTriggerProgrammaticLaunchCompletion and cudaGridDependencySynchronize), but only guard on __CUDA_ARCH__. The repository officially supports CUDA 12.6+ and every other use of these APIs in the codebase gates them with __CUDACC_VER_MAJOR__ >= 12 (see include/flashinfer/sampling.cuh, include/flashinfer/norm.cuh, include/flashinfer/trtllm/fmha/lse.cuh). Without the version guard, compilation can fail on older toolchains.

Suggested patch
 __device__ inline void preExit() {
-#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
+#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
   cudaTriggerProgrammaticLaunchCompletion();
 `#endif`
 }
 
 __device__ inline void acqBulk() {
-#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
+#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
   cudaGridDependencySynchronize();
 `#endif`
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
__device__ inline void preExit() {
#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
asm volatile("griddepcontrol.launch_dependents;\n");
cudaTriggerProgrammaticLaunchCompletion();
#endif
}
__device__ inline void acqBulk() {
#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
asm volatile("griddepcontrol.wait;\n");
cudaGridDependencySynchronize();
#endif
}
__device__ inline void preExit() {
`#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
cudaTriggerProgrammaticLaunchCompletion();
`#endif`
}
__device__ inline void acqBulk() {
`#if` (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
cudaGridDependencySynchronize();
`#endif`
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/xqa/utils.cuh` around lines 769 - 779, The runtime calls in preExit()
and acqBulk() are only guarded by __CUDA_ARCH__ but must also be gated by the
CUDA compiler version; wrap the cudaTriggerProgrammaticLaunchCompletion() call
in preExit and cudaGridDependencySynchronize() call in acqBulk with an
additional compile-time check for __CUDACC_VER_MAJOR__ >= 12 (i.e. require both
(__CUDA_ARCH__ >= 900) and (__CUDACC_VER_MAJOR__ >= 12)) so these APIs are only
used when the toolchain supports CUDA 12+; update the preExit and acqBulk macros
accordingly to match the pattern used elsewhere (e.g.,
include/flashinfer/*.cuh).

@he-yufeng
Copy link
Copy Markdown
Author

Ping @yzh119 — any thoughts on this? It's been open a couple weeks. The inline PTX UB was reported by an NVIDIA engineer in #2558 so figured it's worth addressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

griddepcontrol.wait should use "memory" clobber

1 participant