fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels#2725
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a compatibility issue preventing NVFP4 MoE grouped GEMM kernels from running on newer SM120 (RTX Blackwell desktop) GPUs. By adjusting the compute capability checks from a strict equality to a greater-than-or-equal-to comparison and explicitly adding SM120 to the list of supported architectures for compilation, the changes extend the functionality to modern hardware. This ensures that users with Blackwell GPUs can leverage optimized MoE kernels, improving performance and broadening hardware support for the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRelaxed TRT-LLM device SM checks to accept SM 10.x or SM 12.x; expanded fused_moe JIT generators to include CUDA major 12; added CompilationContext._normalize_cuda_arch to normalize CUDA compute-capability suffixes for nvcc -gencode flags. Changes
Sequence Diagram(s)(Skipped — changes are gating, flag generation, and normalization logic without a new multi-component sequential flow.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can validate your CodeRabbit configuration file in your editor.If your editor has YAML language server, you can enable auto-completion and validation by adding |
There was a problem hiding this comment.
Code Review
This pull request adds support for SM120 (RTX Blackwell desktop) GPUs to the NVFP4 MoE kernels. The changes correctly relax the hardcoded SM100-only checks to allow for newer architectures by changing an equality check to a greater-than-or-equal check in trtllm_fused_moe_kernel_launcher.cu and adding major version 12 to the supported versions in fused_moe.py. My review includes suggestions to improve a misleading error message and remove a redundant device capability check in the C++ code for better maintainability.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 417-418: Replace the >= check using TVM_FFI_ICHECK_GE(major, 10)
with an explicit allowlist check against supported SM major versions (inspect
the existing validated targets and check major against that set, e.g., major ==
X || major == Y) and update the error message to print the full SM as
"<major>.<minor>" and list the supported majors (e.g., "unsupported SM
<major>.<minor>; supported SM majors: ..."); apply the same change to the other
occurrence around lines 1345-1347 so both checks use the explicit allowlist and
the new descriptive error text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 68ea16c3-55df-4006-9354-8bee77ad27c9
📒 Files selected for processing (2)
csrc/trtllm_fused_moe_kernel_launcher.cuflashinfer/jit/fused_moe.py
SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks. Changes: - jit/fused_moe.py: Add major version 12 to supported_major_versions - csrc/trtllm_fused_moe_kernel_launcher.cu: ICHECK_EQ -> ICHECK_GE for SM check Tested: Qwen3.5-397B-A17B-NVFP4 on 4x RTX PRO 6000 (SM120, 96GB each) - compute_120f + CUDA 13.0: 39.0 tok/s (correct output) - compute_120a + CUDA 12.8: 14.6 tok/s (correct, slow fallback tactics) Related: CUTLASS flashinfer-ai#2820, flashinfer-ai#2800; vLLM #33416, #33333 Signed-off-by: Brandon Music <brandon.m.music@gmail.com>
35c37db to
15322e6
Compare
…compute_120f - csrc/trtllm_fused_moe_kernel_launcher.cu: Changed ICHECK_GE to explicit allowlist (major == 10 || major == 12) per reviewer request. Updated error messages to state 'SM 10.x or SM 12.x' requirement. - flashinfer/compilation_context.py: SM12.x GPUs now generate 'compute_120f' flags instead of 'compute_120a'. The 'f' suffix enables fast TMA WS grouped GEMM tactics on desktop Blackwell (39 tok/s vs 14.6 tok/s with 'a' suffix). SM9.x retains 'a', SM10.x+ uses 'f'. Signed-off-by: Brandon Music <brandon.m.music@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 44-47: The manual FLASHINFER_CUDA_ARCH_LIST parsing must apply the
same suffix mapping as the autodetection branch so Blackwell (major>=10) becomes
minor+"f" and CUDA 9 becomes minor+"a"; update the code that reads and splits
FLASHINFER_CUDA_ARCH_LIST (the logic that produces major and minor values used
by get_nvcc_flags_list()) to normalize each entry by converting numeric minors
to strings and appending "a" when major==9 or "f" when major>=10, mirroring the
existing autodetect block that modifies minor, so get_nvcc_flags_list() emits
the corrected compute targets.
- Around line 44-47: The code is appending the "f" suffix to SM10+ targets
unconditionally; update CompilationContext (either in __init__ where suffixes
are constructed or in get_nvcc_flags_list()) to filter out "*f" targets unless
the toolchain supports them by calling jit.cpp_ext.is_cuda_version_at_least with
the same CUDA-version gating logic used in aot.detect_sm_capabilities();
specifically, import is_cuda_version_at_least, determine the minimum CUDA
version required for each "*f" target (reuse the mapping/logic from
aot.detect_sm_capabilities()), and only append or emit targets like
"compute_103f" when is_cuda_version_at_least(required_major, required_minor)
returns true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7027ec4f-c30e-47d1-94d4-91a398c68396
📒 Files selected for processing (2)
csrc/trtllm_fused_moe_kernel_launcher.cuflashinfer/compilation_context.py
🚧 Files skipped from review as they are similar to previous changes (1)
- csrc/trtllm_fused_moe_kernel_launcher.cu
Refactor compilation_context.py per reviewer feedback: - Extract _normalize_cuda_arch() static method that both the FLASHINFER_CUDA_ARCH_LIST manual parsing and auto-detection paths route through for consistent suffix selection. - Add CUDA version gating: only append the 'f' suffix to SM10+ targets when CUDA >= 13.0 is available (via is_cuda_version_at_least from flashinfer.jit.cpp_ext). Falls back to 'a' suffix on older CUDA toolchains or when the import is unavailable. - Respect user-provided suffixes in FLASHINFER_CUDA_ARCH_LIST (e.g. '12.0f') without re-normalizing. Signed-off-by: Brandon Music <brandon.m.music@gmail.com>
|
does that mean after merging this pr sm120 for nvidia rtx 6000 pro blackwell can also run the MoE kernel from FlashInfer for model gpt oss 20b? |
| cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device); | ||
| cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device); | ||
| TVM_FFI_ICHECK_EQ(major, 10) << "MoE kernel requires 10.x architecture. Current device has SM " | ||
| TVM_FFI_ICHECK(major == 10 || major == 12) << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM " |
There was a problem hiding this comment.
Maybe 11 (Thor) also works? it should very similar to 10.x
|
Maybe 11 (Thor) also works? It should be very similar to 10.x |
Possibly. sm120 and blackwells are still not nearly as supported as they should be, I wouldn’t say for sure. |
the problem is vllm currently cannot use flash attention for sm120 - it uses triton attention - will this pr fix it? |
|
/bot run |
Constrain the compute_*f suffix to SM 12.x (RTX Blackwell desktop) specifically, rather than applying it to all major >= 10. This avoids possible performance regressions on SM 10.x (B200) where compute_100a is the validated target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
brandonmmusic-max
left a comment
There was a problem hiding this comment.
Thanks — that makes sense. I restricted the f suffix normalization to SM120 specifically to avoid possible regressions on SM100/B200. The PR has been updated.
The device capability check in FusedMoeFP4Launcher::init() is redundant — init_common() already performs the same validation. Remove the duplicated check and the static device_props lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 47-56: The try/except around "from flashinfer.jit.cpp_ext import
is_cuda_version_at_least" should catch RuntimeError and ValueError (or a general
Exception) in addition to ImportError so failures from is_cuda_version_at_least
don't propagate; update the logger.debug message in compilation_context.py to
say something like "Could not determine CUDA version; falling back to 'a' suffix
for SM %d.%d" and include the caught exception details, then return (major,
str(minor) + "a") as the fallback.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf765eee-2187-4f4d-8235-ac00013ccf8b
📒 Files selected for processing (1)
flashinfer/compilation_context.py
is_cuda_version_at_least can raise RuntimeError (nvcc unavailable) or ValueError (malformed version string), not just ImportError. Catch all three to ensure graceful fallback to the 'a' suffix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DGX Spark (SM 12.1) should also compile as compute_120f, not 121f. All SM 12.x variants are now normalized to SM 120 for compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi Ka-Hyun,
I have made the requested changes and pushed them. I also updated the
docstring to reflect these updates, as I thought that might be helpful for
future reference.
Best regards,
Brandon M. Music
…
On Fri, Mar 13, 2026 at 5:12 PM Ka-Hyun Nam ***@***.***>
wrote:
> ***@***.**** requested changes on this pull request.
>
> One last comment, thanks!
> ------------------------------
>
> In flashinfer/compilation_context.py
> <#2725 (comment)>
> :
>
> > @@ -30,20 +30,55 @@ class CompilationContext:
> "-DFLASHINFER_ENABLE_FP4_E2M1",
> ]
>
> + @staticmethod
> + def _normalize_cuda_arch(major: int, minor: int) -> tuple[int, str]:
> + """Normalize a (major, minor) capability pair into a (major, minor_str)
> + tuple with the correct architecture suffix for nvcc.
> +
> + SM 9.x -> 'a' suffix (e.g. compute_90a)
> + SM 12.x -> 'f' suffix (e.g. compute_120f) when the installed CUDA
> + toolchain supports it (CUDA >= 13.0), otherwise 'a'.
> + SM 10+ -> 'a' suffix (e.g. compute_100a)
> + SM < 9 -> no suffix
> + """
> + if major == 9:
> + return (major, str(minor) + "a")
> + elif major == 12:
>
> One last nit after discussing with @nv-yunzheq
> <https://github.com/nv-yunzheq> ; I think we also want to compile DGX
> spark (sm121) to sm120f, instead of 121f.
>
> Can we change the logic to return always major + 0f (or just 120f) on
> line 50?
>
> —
> Reply to this email directly, view it on GitHub
> <#2725 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/B7RPGFFZSNSZ2QGRWRZPO2L4QR2UVAVCNFSM6AAAAACWLNSDG2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTSNBWHE3TINZQG4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
/bot run |
|
[SUCCESS] Pipeline #46362901: 13/20 passed |
|
hi the precomit check has failed could you take a look the steps for pre-commit are in PR default template ✅ Pre-commit Checks |
Hi there, re-commit hooks pass locally — all 14 hooks clean (clang-format v19.1.1, ruff, mypy). I pushed an empty commit to retrigger CI. The 7 test failures in the earlier run appear to be CI runner timeouts (A10G/T4 all fail at exactly 1m47s) rather than code issues — the H100 JIT unittest passed successfully after 4+ hours. If there's anything else i can do, please let me know! |
|
Hi there is a specific check named "pre-commit" which is marked "Required" in the checklist of the bottom of the page. This blocks the merge due to it is required specifically it shows diff --git a/csrc/trtllm_fused_moe_kernel_launcher.cu b/csrc/trtllm_fused_moe_kernel_launcher.cu
index b1af6ee..e6a5cf2 100644
--- a/csrc/trtllm_fused_moe_kernel_launcher.cu
+++ b/csrc/trtllm_fused_moe_kernel_launcher.cu
@@ -414,8 +414,9 @@ void FusedMoeLauncher::init_common(
int major = 0, minor = 0;
cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device);
- TVM_FFI_ICHECK(major == 10 || major == 12) << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM "
- << major << minor;
+ TVM_FFI_ICHECK(major == 10 || major == 12)
+ << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM " << major
+ << minor;
this->device_version = std::make_tuple(major, minor);
args->routing_logits = routing_logits.has_value() ? routing_logits.value().data_ptr() : nullptr;
diff --git a/flashinfer/compilation_context.py b/flashinfer/compilation_context.py
index 14b7863..c6072fb 100644
--- a/flashinfer/compilation_context.py
+++ b/flashinfer/compilation_context.py
@@ -47,12 +47,15 @@ class CompilationContext:
elif major == 12:
try:
from flashinfer.jit.cpp_ext import is_cuda_version_at_least
+
if is_cuda_version_at_least("13.0"):
return (major, "0f")
except (ImportError, RuntimeError, ValueError):
logger.debug(
"Could not determine CUDA version; "
- "falling back to 'a' suffix for SM %d.%d", major, minor
+ "falling back to 'a' suffix for SM %d.%d",
+ major,
+ minor,
)
return (major, "0a")
elif major >= 10:
@@ -77,9 +80,7 @@ class CompilationContext:
try:
for device in range(torch.cuda.device_count()):
major, minor = torch.cuda.get_device_capability(device)
- self.TARGET_CUDA_ARCHS.add(
- self._normalize_cuda_arch(major, minor)
- )
+ self.TARGET_CUDA_ARCHS.add(self._normalize_cuda_arch(major, minor))
except Exception as e:
logger.warning(f"Failed to get device capability: {e}.") |
- Break long TVM_FFI_ICHECK line per clang-format v19.1.1 - Add blank line after import per ruff format - Reformat logger.debug args to one-per-line - Inline single _normalize_cuda_arch call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Head branch was pushed to by a user without write access
|
looks all good! auto-merge enabled thx for the contrib! |
|
Hi @aleozlx — thanks for the heads-up on the pre-commit check. I've run
The pre-commit check is now passing (green). All 14 hooks pass locally as well. Regarding the other CI failures — the A10G and T4 JIT unittest jobs all time out at exactly 1m47s, which appears to be a runner provisioning issue rather than a code problem. The H100 JIT unittest passed successfully after a full 4+ hour run, confirming the code works correctly on supported hardware. Please let me know if there's anything else needed. Thank you for reviewing! |
|
<!-- .github/pull_request_template.md --> ## 📌 Description Bug found in nightly [Spark, 12.9] matrix https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/285092631, where Spark compiles to "120a" (see "/tmp/.cache/flashinfer/0.6.6/120a/" path in log below). ``` E RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device /tmp/.cache/flashinfer/0.6.6/120a/generated/single_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_head_dim_qk_128_head_dim_vo_128_posenc_2_use_swa_False_use_logits_cap_False/single_decode.cu:100: RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device ``` Root cause was #2725 , where we added logic for compiling both Spark and Thor to 120f, but on the condition that cuda version is 13 or higher. Lower (12.9) defaults to 'a' suffix, 120a. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Strengthened CUDA validation for SM 12.x GPUs: now requires CUDA 12.9 or newer and emits a clear error if unmet, replacing the previous silent fallback behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md --> ## 📌 Description Bug found in nightly [Spark, 12.9] matrix https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/285092631, where Spark compiles to "120a" (see "/tmp/.cache/flashinfer/0.6.6/120a/" path in log below). ``` E RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device /tmp/.cache/flashinfer/0.6.6/120a/generated/single_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_head_dim_qk_128_head_dim_vo_128_posenc_2_use_swa_False_use_logits_cap_False/single_decode.cu:100: RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device ``` Root cause was flashinfer-ai/flashinfer#2725 , where we added logic for compiling both Spark and Thor to 120f, but on the condition that cuda version is 13 or higher. Lower (12.9) defaults to 'a' suffix, 120a. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Strengthened CUDA validation for SM 12.x GPUs: now requires CUDA 12.9 or newer and emits a clear error if unmet, replacing the previous silent fallback behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks.
Changes:
jit/fused_moe.py: Add major version 12 tosupported_major_versionscsrc/trtllm_fused_moe_kernel_launcher.cu:ICHECK_EQ(major, 10)->ICHECK_GE(major, 10)Benchmark (Qwen3.5-397B on 4x RTX PRO 6000 SM120):
Root cause: All TMA WS grouped GEMM autotuner tactics fail on
compute_120a, requiringcompute_120f(CUDA 13.0).CuTe DSL
admissible_archsin vendored CUTLASS also needssm_120a/sm_120f(cpasync/copy.py, tcgen05/mma.py, arch/mbar.py, etc).Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577
Summary by CodeRabbit